Last Thursday, April 21, 2011 Amazon Web Services Elastic Compute Cloud (EC2) had an outage that impacted multiple Availability Zones. Thursday morning, Amazon issued a status update indicating that the outage was based on problems with replication mirroring: “This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances.”
A certain amount of service outage is to be expected. However, this incident raises a couple of different concerns. One is that the Amazon Availability Zones did not work as represented. Amazon provides computing resources from different geographic reasons. In addition, each geographic location offers different Availability Zones which are supposed to be engineered to be insulated from failure in another Availability Zone. However, in the recent outage, multiple Availability Zones were impacted, showing that they are not acting as advertized.
In addition, this incident went beyond just an availability issue. Amazon was not able to recover all of the volumes affected. On April 25, Amazon issued the following in an update: “We've determined that a small number of volumes (0.07% of the volumes in our US-East Region) will not be fully recoverable.” This seems like a small amount, but 0.07% of what? Depending on the amount of Amazon’s overall services, this could be considerable. And if you’re one of the customers impacted, you don’t care how small the number is overall.
Should this outage with Amazon give cause for concern? Should it make businesses limit their cloud adoption? These are two separate questions. Yes, it should give cause for concern, but this should be a cautionary tale that influences how companies approach the cloud, not if they approach the cloud.
This was certainly a significant outage. However, Amazon has generally done a good job of service availability. And Amazon has built out their infrastructure with failover and load balancing beyond what most businesses are able to deploy in an on-premise data center. Although using an on-premise data center may give companies a feeling of more control, especially when an incident like this occurs, in truth they will most likely get better availability through a provider that is dedicated to offering on-demand computing services.
Service Level Agreements (SLAs) can provide some assurances. But they indicate more of what the service provider feels they will deliver than an adequate remedy. In this case, the Amazon SLA offers 99.95 percent availability in each region if the customer uses multiple Availability Zones. However, SLAs are going to be limited in applicability and liability—generally offering only a fraction of what a company can lose during downtime (or with actual data loss). Also, the assurances in an SLA are not an absolute guarantee. They are a calculated business risk by the provider, balancing what they feel they can provide with what they would be willing to pay out if they are not able to deliver on that level. Therefore, SLAs should be considered general guidance when selecting a cloud-computing service provider.
Certainly this incident flags that companies need to set up proper redundancies. Amazon customers will undoubtedly want to know why the Availability Zones did not work as advertised. There are benefits to deploying redundancies in the same region, but they obviously have to work correctly. Otherwise, companies can use services in separate regions, but this additional level of protection is provided at a higher cost and can create more latency. There is also the option of using multiple cloud-computing service providers or creating a hybrid cloud that uses both third-party services and an on-site data center. However, each of these options gets increasingly more costly and more burdensome to manage.
Ultimately companies must choose a deployment model that fits their comfort level based on how critical the data and the level of their resources. Cloud-computing can provide significant benefits to business agility and cost savings, and can often be maintained more efficiently than on-premise resources. Businesses just need to ensure they build a structure that supports their needs, including redundancies as well as security. Throughout this recent incident, Amazon has continually promised to issue a post mortem that explains the root causes and what was learned. Hopefully this information will help to guide Amazon customers on how to structure their services – multiple Availability Zones, different regions, multiple providers or more of a hybrid solution.
[Ed. note: Trend Micro would like to know what you think about this. We enthusiastically invite your comments and we will read every one of them. For very detailed information: