How to Build a Reliable Well-Architected Framework
In this article, we will explore the Reliability pillar of the AWS Well-Architected Framework, examining best practices for cloud-based operations, including change management and disaster recovery.
Save to Folio
Related articles in the Well-Architected series:
- Overview of All 5 Pillars
- Security Pillar
- Operational Excellence Pillar
- Performance Efficiancy Pillar
- Cost Optimization Pillar
In the world of DevOps, the end goal is to ensure what you build work as intended, and only as intended. The Reliability pillar of the Amazon Web Services (AWS) Well-Architected Framework is all about ensuring that a workload performs as expected; consistently and reliably. It is that simple and yet, incredibly difficult to achieve at the same time. The Reliability pillar details design principles that, when followed, can help you design reliable cloud architecture.
Five design principals for the Reliability pillar
- Automatically recover from failure by tracking metrics known as key performance indicators (KPIs). A KPI specifies the level of reliability that a business needs to achieve in order for cloud workloads to perform properly. With an established KPI, it is possible to monitor performance levels in relation to the reliability of systems within the organization. Once a metric is established, you can set up alerts to notify you if systems are reaching a point of failure. Going one step further, with those metrics in place, it is possible to configure automatic recovery steps that would be triggered by reaching KPIs. When done well, the failure can be predicted and prevented automatically.
- Build and test recovery procedures. Failure will happen, and if you work with that assumption, you can minimize impact to customers and the business when things are not working well or actually fail.
- Scale horizontally to increase aggregate workload availability. When designing and building cloud services, it is normal to figure out the performance needed and build one server to meet that need. Instead, what if we divided that need by five, building five smaller servers that work together to meet the original need. Now, if one of those servers fails, it is not an automatic Denial of Service (DoS), as there are four other servers to support connectivity.
- Stop guessing capacity. Use capacity management to ensure you have the proper resources now and can adapt to changes in the future. There is logic to calculating and planning for the needs of the business before reaching your current maximum capability. It is essential that there is a team working on this to predict your growth and allow appropriate expansion.
- Manage change in automated tasks. Changes to anything automatically handled within the cloud need to be managed through a change management process.
Reliability is designed, it does not occur by accident. To design it properly, you need to understand your business’ requirements, which is an ongoing theme within a Well-Architected Framework. It is truly crucial to understand what users and customers need from the network and systems in order to build a reliable cloud. With the requirements in hand, it is possible to design a workload that enables the business, rather than disrupt it from a lack of functionality. Planning to ensure that there is an automatic detection and response to failures is essential.
Level of availability of your cloud workloads
Level of availability or service availability is defined by a metric. Metrics must be specific and measurable in order to be usable. For availability, it is expressed as a percentage and is the amount of time that a service is available to the users. In this case, the term available means it must be functional, for example the user must be able to access data and functions. When expressing the percentage of availability, it is important to include both unplanned and planned downtime. Be careful with all vendor contracts, do not assume this is how they are written. Many vendors do not include planned downtime in their promises, however, AWS does.
This calculation results in the table of nines. We discuss this as four nines (99.99%), five nines (99.999%), and the list goes on. This then translates to a matter of hours, minutes, or seconds that downtime would be experienced per year. For example, 99.999% translates to a downtime of 5 minutes a year.
Availability must also be determined in relationship to or between dependent systems. When calculating a system’s dependency, it must be factored with the level of availability of the systems that it relies on. AWS gives the following example in its Reliability pillar document:
If a system with a reliability of 99.99% is dependent on two other independent systems, both with 99.99% reliability, then the first system is at 99.97% reliability: 99.99% x 99.99% x 99.99% = 99.97%.
In a different example provided by AWS, if there is a system with 99.99% reliability and there are two fully-redundant dependent systems, both with a 99.99% reliability, the overall reliability would be 99.9999%. This is calculated as follows: Available max x ((100%-Available dependency) x (100%-Available dependency)) 100% x ((100%-99.99%) x (100%-99.99%)) = 99.9999%
As you can see, availability is improved when the dependent systems are made redundant, but it will cost your more. The highest level of availability is not always the best answer. The cost of the system built must be balanced against the needs of the business. The trick is to understand the business requirements, then select and build the right systems to fulfill them.
The concept of availability levels is central to disaster recovery (DR) calculations. In DR, you have the recover time objective (RTO) and recovery point objective (RPO). These terms vary slightly in definition, depending on the standard that is being followed. AWS defines these terms in the following manner.
Disaster Recovery Definitions
AWS defines these terms in the following manner:
- RTO is the maximum time that a service can be offline. Meaning, it is the time from failure to functional as a maximum number.
- RPO is the maximum amount of data that can be lost in an expression of time. Meaning, it is the window of time from the last backup to the point of failure. Since that data would not be backed up, it would be irretrievably lost.
Do not forget to find the sub-components and analyze the downtime or lost data requirements for each part in your calculations for the overall system requirements. When analyzing services, there are two distinct categories: The data plane and control plane. The data plane is where the users send and receive data, while the control plane is more administrative and handles requests for creating a new database or starting a new instance of a virtual machine, for example.
The foundations of a reliable Well-Architected Framework
Foundations are the core requirements that extend beyond, or you could say under, any workload, such as ensuring there is enough bandwidth in and out of your data center to meet the requirements of the business. AWS has broken this down into two things you must consider:
- The management of service quotas and constraints
- The provisioning of your network topology
Managing service quotas and constraints is also known as service limits within the cloud. Service limits are controls placed on the account to prevent services being provisioned beyond the desires of the business. For example, services like Amazon Elastic Cloud Compute (EC2) have their own specific dashboards for managing quotas. Quotas include input and output operations per second (IOPS), rate limits, storage applications, concurrent user limits, and the list goes on. It is critical to remember that you must manage all regions that your services exist in independently of one another.
It is also critical to monitor usage through metrics and alerts. AWS has Service Quotas and Trusted Advisor for this purpose. However, you still need to ensure correct configuration of these monitors, which can be a time-consuming task and something you’d want to automate. Trend Micro Cloud One™ – Conformity’s Knowledge Base has rules to apply to Trusted Advisor, which can be run and applied automatically using the Trend Micro Cloud One™ – Conformity service.
Four considerations for planning your reliable network
Provisioning and planning your network topology is crucial to the reliability and safe expansion of your network. The first thing you should consider is having highly-available public endpoints provisioned through the network connectivity. This is done with load balancing, redundant DNS, content delivery networks (CDNs) etc. Conformity has many rules for AWS products like the Elastic Load Balancer to ensure, for example, that HTTP/HTTPS services are using Application Load Balancer (ALB) rather than the Classic Load Balancer (CLB).
The second thing to consider is provisioning redundancy in the connectivity between private data centers and the cloud. When you maintain a private data center and connect it to services within the cloud, it is important to know the business requirements for access between those two networks.
Moving right along, the third consideration is the configuration of IP subnets. When joining virtual private clouds (VPCs) together, it is essential to ensure that there will not be an addressing conflict. VPCs are created with private addressing, as defined in RFC 1918. If two VPCs are utilizing the same address structure, it would cause a conflict if they were connected. It is necessary to allocate unique subnets per VPC, leaving room for more to be added per region.
The fourth and final consideration is designing the environment as hub and spoke, versus many-to-many connectivity. As your cloud environment grows, a many-to-many configuration becomes untenable. You need to figure out the flow of data through the environments, what flows could take an extra hop along the way to its destination, and then connect high-usage paths directly.
Cloud workload architecture design decisions
Cloud workload architecture design decisions have a large impact on the reliability of software and infrastructure. When designing your workload service architecture, building highly-scalable and highly-reliable architectures is critical. It is best practice to use common communications standards, like service-oriented architecture (SOA) or microservices architecture, to enable for quick incorporation into existing cloud workloads as your design progresses.
Designing software in a distributed system is very important for failure prevention. If you do a threat analysis with the assumption that failure will occur, then you can look at how to design your systems to best prevent failure. In order to determine the type of distributed system you need to design, you will need to determine the reliability it needs. A hard real-time system has the highest demand for reliability, as opposed to soft real-time system. If you choose a hard real-time system, then implement loosely-coupled dependencies, so if one component changes it does not force changes to the others that depend on it.
When designing your workload architecture, make all responses idempotent to ensure that a request is only answered once. In doing this, a client can retry their request many times, but it will only be answered once by the system to prevent it from being overwhelmed by the number of requests.
Always do constant work, for example, if there is a health check system that reports on 200 servers, it should report on 200 servers every time, rather than only reporting those with errors/issues. If it is not designed for constant work and the normal report only includes around 10 issues, but suddenly 180 servers are reporting issues. Suddenly, the health check is 18 times busier than normal and could overwhelm the system, causing a plethora of problems in variety of ways.
Six ways to mitigate cloud workload failure
Now, it’s time to take your workloads to the next level by designing your software in a distributed system to mitigate failures or to withstand them. If a system is designed to withstand stress, then the mean time to recovery (MTTR) would be shortened and if failures are prevented, then the mean time between failures (MTBF) is lengthened. Here are six best practices from AWS to help you achieve this level of design:
- Implement graceful degradation into the systems. Turn the hard dependencies into soft dependencies, so the system can use a predetermined response if a live response is not available.
- Throttle requests. If the system receives more requests than it can reliably handle, some will be denied. A response will be sent to those denied, notifying them that they are being throttled.
- Control and limit retry calls using an exponential backoff. If the intervals are randomized between retries and there is a maximum number of retries then the
- Fail fast and limit queues when dealing with a workload. If the workload is not able to respond successfully, then it should fail so that it can recover. If it is successfully responding, but too many requests are coming in, then you can use a queue, just do not allow long queues.
- Set timeouts on the client side. Most default values for timeouts are set too long. So, determine the appropriate values to use for both connections and requests.
- Where possible, make services stateless. If that is not possible, offload the state so that local memory is not utilized. This assists with the maintenance of the virtual environment, allowing for servers to be replaced if necessary, but not disrupt the client’s session.
Cloud workload and environment change management
Change management should be applied not only to the environment, but also to the workload itself. An example of a change in an environment could be a spike in demand, whereas changes to the workload can involve security patches or new features that are being deployed to a production system.
When managing changes to the environment, you’ll want to first monitor the resources. AWS has outlined the four phases of monitoring:
- The first phase is generation. This involves monitoring every component that would generate logs.
- Those logs must then be aggregated someplace, like a syslog server. Then, filters can be applied based on calculated metrics.
- Based on the metrics that are applied, real-time alerts can be generated automatically and sent to the appropriate people.
- Logs should be stored over time to allow for historical metrics to be applied. Analysis of logs over a broader period of time allows broader trends to be seen and greater insights to be developed about the workloads.
In terms of scalability, having the ability to scale down is crucial in managing costs appropriately, which is another pillar within the AWS Well-Architected Framework called Cost Optimization.
Testing and managing changes to the deployment of new functions or patches, especially security patches, is crucial. If a change has been well tested, you can follow the runbook to deploy.
Deploying onto an immutable infrastructure is best, as it provides a more reliable and expected outcome of the change being made. For example, if a server needs a patch, a new virtual image will need to be built. Then, the running server is shut down and restarted from the new image, but the running server is not altered.
Cloud workload protection for failure management
Hardware and software will ensure failures, so it is best to plan for it. Cloud providers already have redundancy built into a lot of their systems to help protect customers from as many failures as they can, but you will endure one eventually. Amazon Simple Storage Service (Amazon S3) objects are made redundant across multiple availability zones, effectively supplying a reliability of 99.999999999%. That is 11 nines. Yet, it is possible for a failure, causing data loss, to still occur. So, to be on the safe side, you should still back up your data and test the restoration of that backup—as discussed above, the reliability of data is specified by the RPO.
Now that you have prepared your workload for failures, you need to ensure your workload is protected from faults. To do so, you you’ll need to distribute the workload across multiple availability zones to reduce or eliminate single points of failure. However, if that isn’t possible, then you’ll have to find a way to recovery within a single zone, such as implementing an automated redeployment feature for when a failure is detected.
A workload should also be designed to withstand component failures. To do this, you need to monitor the system’s health and as a decline is noted, it should fail over to healthy systems. Then, you can automate the healing of resources. If a system is in a degraded state, it is always good to have the ability to restart.
Next, you want to shift your focus back to testing, but this time you’ll be testing reliability, in relation to failure. Test the systems, machines, applications, and networks so that you can pre-emptively find the problems before they become just that, a problem. However, when a failure does occur, it needs to be investigated and there should be a playbook that guides the team the process. With a carefully crafted process, the source of the failure can reliably be uncovered so that the system that failed can be brought back to a normal working condition.
After a failure is over and everything is back to normal operating conditions, there should be an analysis of the incident. The goal is to uncover where things can be improved, that way, if/when you experience the same or a similar incident, the response would be improved—when improvements are identified. The playbook should be updated.
The testing continues. Test resiliency using chaos engineering. Insert failures into pre-production and production—yes, production—environments on a regular basis. Netflix created a chaos monkey that runs in their AWS cloud environment. The chaos monkey regularly causes failures within the cloud environment to allow Netflix to see and improve their responses, as necessary. They have made the code available on GitHub. There are others available as well, such as Shopify’s Toxiproxy, if you want to explore them.
Finally, it is time to talk about having Disaster Recovery (DR) plans. Using the information regarding RPO and RTO we reviewed earlier, the correct choices can be selected to ensure you are ready when disaster strikes.
Disaster is not when a single virtual machine fails, rather it is when the failure could cause a significant loss to the business, possibly even the loss of the business itself. AWS recommends having multiple availability zones within a region.
Four disaster recovery levels
If you need multi-region recovery capabilities, they have defined recovery levels:
- Backup and restore–where the RPO is in hours and the RTO in 24 hours or less.
- Pilot light–where the RPO is in minutes and the RTO is in hours.
- Warm standby–where the RPO is in seconds and the RTO in minutes.
- Multi-region active-active–where the RPO is naught to seconds and the RTO in seconds.
Without sounding like a broken record…whatever your plans are, they need to be tested. There are many distinct levels of tests within the field of DR, but whatever recovery systems you are testing, there needs to be a planned failover path. The path not tested is the one that will not work.
As your production environment evolves, it is critical to update and change the backup systems and sites as well. Recovery sites need to be able to support the needs of the business, so there should be regular intervals at which the DR systems are analyzed and updated.
As the growth of the cloud continues, it is becoming more and more important for teams to ensure they are building reliable cloud environments. Conformity can help you stay compliant to the AWS and Microsoft® Azure® Well-Architected Frameworks with its 750+ best practice checks. If you are interested in knowing how well-architected you are, you can use our self-guided cloud risk assessment to get a personalized view of your risk posture in 15 minutes. Learn more by reading the other articles in the series, here are the links: 1) Overview of all 5 pillars 2) Security 3) Operational Excellence 4) Performance Efficiency 5) Cost optimization.
Alternatively, you can browse through some of our Knowledge Base articles, which include instructions on how to manually audit your environment and step-by-step instructions on how to remediate high-risk misconfigurations related to the Reliability pillar:
- Find any Amazon EC2 instances that appear to be overutilized, and upgrade (resize) them to help your Amazon EC2-hosted applications handle the workload better and improve the response time.
- Identify RDS instances with low free storage space and scale them for optimal performance
- Identify overutilized RDS instances and upgrade them to optimize database workload and response time
- Ensure the availability zones in ASG and ELB are the same
- Identify AWS Redshift clusters with high disk usage and scale them to increase their storage capacity
- Identify any Amazon ElasticSearch clusters that appear on disk space and scale them up
- Ensure that Amazon ElasticSearch (ES) clusters are healthy