Achieve Operational Excellence in Your Cloud Workload
Explore the Operational Excellence pillar of the AWS and Azure Well-Architected Framework and examine best practices and design principles for cloud-based security operations, including CI/CD and risk management.
Save to Folio
Related articles in the Well-Architected series:
- Overview of All 5 Pillars
- Security Pillar
- Reliability Pillar
- Performance Efficiancy Pillar
- Cost Optimization Pillar
In today’s landscape, achieving operational excellence can be difficult, but not impossible. With operations often viewed as distinct from the rest of the business, it sometimes isn’t integrated into the flow like it is for other departments.
We have seen the industry recognize this divide with the creation of DevOps—combining development and IT operations into one process to enable more streamlined creation and implementation of software throughout the software development life cycle (SDLC).
Microsoft® Azure® and Amazon Web Services (AWS) continue to publish design principals for building applications that adhere to their well-architected frameworks. The best practices for the AWS Well-Architected Framework are based on five different pillars, operational excellence, security, reliability, performance efficiency, and cost optimization, however this article will be focused on the pillar of operational excellence. In this pillar, AWS defined five design principles that spread across four areas: “organization”, “prepare”, “operate”, and “evolve”. Let’s take a look.
5 Operational excellence design principles
- Perform operations as code—the beauty of the cloud is that you can apply the same scripting skills you use to code applications to your entire environment, including operations. This means, you can reduce the need for human intervention by scripting code that will automate operations and trigger appropriate responses to any events or incidents.
- Make frequent, small, reversible changes—when multiple, large changes are made at once, it becomes exceedingly difficult to troubleshoot a problem when things don’t work in production. When designing your workloads, allow for small and frequent deployments that are easily reversable to make the process of identifying the source of the problem quick and easy when something isn’t running as intended in production.
- Refine operations procedures frequently—there is always room for improvement. Continually analyzing and poking holes in your processes and procedures helps you to constantly increase the efficiency of how you serve your customer needs.
- Anticipate failure—it is always better to expect failure, rather than assuming that what you’ve created is flawless. If you don’t anticipate errors, how can you catch them before deployment. This is effectively the process of threat modeling and risk assessment.
- Learn from all operational failures—the point of going back and analyzing a failure is to learn from it. It is important to set up structures and processes that enable the sharing of learnings across teams and the business.
Embedding operational excellence into your organization The area of “organization” is the first area up for discussion. The way your business organizes who is responsible for what, in relation to your engineering and operations departments, is critical to your success. Who is responsible for the platform, who is responsible for applications, how do we communicate between our different departments? At the end of the day, you need to be organized in a way that enables you to build software, applications, etc. that fulfill your business' strategy.
In order to make any decisions about organization, the priorities of the business must first be reviewed and determined.
High-level organization priorities:
- Evaluating your customer needs, both internal and external
- Evaluate the corporate requirements to comply with different laws and regulations
- Evaluate the current threat landscape
- Determine the tradeoffs you would have make if you were supporting competing interests or choosing different approaches
DevOps risk management
It is critical that business manage business. You can determine your businesses risk by looking at the possible attacks that could occur, as well as the likelihood of it coming to fruition. While the cloud has been around for a while, we need to pay close attention to managing the risks it can introduce, as it is still considered a new ecosystem that we are all learning to manage. How we deploy software and manage patches and updates have an impact on the businesses threat landscape.
Cloud operating models
In a white paper by AWS, Operational Excellence Pillar, they outline four operating models in the context of engineering and operations. AWS looks at engineering as the process of developing and testing applications and the infrastructure. Then, operations is responsible for the deployment and ongoing maintenance of the applications and infrastructure in production. But it isn’t always this straight forward and every business has its own processes, which is why they discuss four different operating models that businesses can use:
- Fully Separated Operating Model
- Separated Application Engineering and Operations (AEO) and Infrastructure Engineering and Operations (IEO) with Centralized Governance
- Separated AEO and IEO with Centralized Governance and a Service Provider
- Separated AEO and IEO with Decentralized Governance
Note, it may be necessary to alter your business culture to conform to any one of these models
Prepare for operational excellence
The next one up is “prepare”, which is where you start to get into the work software developers are more familiar with. However, just because it is more familiar, doesn’t mean it is more important than the area of organization. Without having proper organization in your business and processes, it would be very difficult to address the other three areas required to fulfill your business' strategy.
AWS has broken prepare into four things that we need to do:
- Design telemetry
- Improve flow
- Mitigate deployment risks
- Understand operational readiness
Design telemetry into your cloud workloads
Telemetry provides you with information on the current health and risk level of your applications and infrastructure, giving you the ability to better manage and respond effectively to events or incidents. This is done predominantly with logs and metrics. Trend Micro and its Trend Micro Cloud One™ Conformity Knowledge Base provide steps that you can take to confirm AWS CloudTrail is enabled or Amazon CloudWatch Logs are encrypted with instructions on how to remediate according to best practice. It is also good to ensure that you have metrics configured to monitor things like the functional status of your APIs.
You can audit your environment manually with 750+ industry best practices articles or give our free trial a shot and have your entire environment audited automatically in real time and continuously.
Improve your cloud workload flow
AWS says we need to adopt approaches that “enable refactoring, fast feedback on quality, and bug fixing”. Improving the way changes flow into production is what AWS is pointing to here. So, it is essential to have version control and ensure that you test and validate any changes before they reach production.
As a result, configuration management is a crucial topic. This relates back to one of the design principals: Making small, frequent, and reversible changes is critical to build into our processes. It is good to setup services, such as Amazon Simple Notification Service (Amazon SNS) to receive messages for services like AWS CloudFormation. Receiving a notification when stack events occur, such as create, update, and delete, allows for a faster response to unauthorized actions.
5 Deployment Risk Mitigation Processes
There are many steps that can be taken to mitigate deployment risks, before those, it is crucial to have the attitude that changes pushed to production don’t always work. This will help you to always be prepared. Before pushing to production, always look for what would cause a failure:
- Use deployment management systems
- Deploy small changes
- Know how to reverse your changes before they are done
Understand your operational readiness
Once you understand what operational readiness is, the next step is to verify that your personnel is just as knowledgeable, so they can provide operational support. From there, you’ll want to determine whether or not you’ve automated everything you can.
The third area is “operate”, which includes three key understandings that are required to successfully manage the operation of the cloud and ensure you achieve your business outcomes. AWS says that it is critical to:
- Understand workload health
- Understand operational health
- Respond to events
Understanding the health of your workloads or operations comes down to metrics. In order to know how to improve, it is critical to be able to show how things are functioning and how your customers are interacting with your sites. Enabling logging on Amazon CloudWatch Logs, and then aggregating those logs for analysis is very important. These logs can help generate the information needed to produce the metrics you need to improve operations and can be delivered through AWS Health Events on the AWS Personal Health Dashboard. The Conformity Knowledge Base also has rules to assist in the creation of logs and health events. It is possible to use these rules manually, or to use an automated tool like Trend Micro Cloud One™ – Conformity, which is always looking for misconfigurations.
Optimize your AWS Systems Manager OpsCenter
Once the logs are created, delivered, and analyzed, it is possible to respond to an event. In ITIL® language, an event is a change of state. These may be planned monitored, or unplanned and problematic. With the latter, we need to ensure that we able to respond effectively.
AWS Systems Manager OpsCenter is a central place to manage issues. You can view, investigate, and resolve issues within this tool, while ensuring that information is kept confidential. There is a Conformity rule for this: SSM Parameter Encryption. And as with all the rules, it is included in the Conformity automated tool. When beginning on the path to operational effectiveness, having an automated tool to analyze our cloud looking for missing configurations is essential.
Automate event detection
Automating responses to detected events is the next step. You can utilize Amazon CloudWatch Events to create rules that respond to specific triggers. Otherwise, there would be alarms that might get missed. For example, the Conformity Knowledge Base and the Conformity tool have alarms to alert us when costs are reaching a threshold we have defined.
Evolve to operational excellence
The final area is “evolve”. AWS believes that, in the context of the cloud, to properly evolve, you must learn, share, and improve. For example, use your post-incident meetings, to learn from what has occurred and make improvements for the future. There needs to be a process to manage and promote continuous improvement in an effort to change behaviors that are not working.
As more security breaches hit the news and data protection becomes a key focus, ensuring your organization adhere to the well-architected framework’s design principles is crucial. Conformity can help you stay compliant to the well-architected framework with its 750+ best practice rules. As mentioned above, if you are interested in knowing how well-architected you are, see your own security posture in 15 minutes or less. Learn more by reading the other articles in the series, here are the links: 1) overview of all 5 pillars 2) Security 3) performance efficiency 4) reliability 5) cost optimization.
SQS Dead Letter Queue
Stack Failed Status
ACM Certificate Expired
EBS Volumes Attached To Stopped EC2 Instances