The myth of the always-on cloud
One of the big promises of cloud computing is the idea of always-on. The cloud as a whole – infrastructure, platform and software – is supposed to be available at all times. Service providers from every layer give 99% and above availability guarantees and everyone claims that their services are resilient and failure-tolerant. While the track record of different providers can vary wildly, the real problem is that clients often forget that even 99.99% availability does not mean that a service will be accessible all of the time.
Let’s disregard for a moment the large differences between what each provider describes and considers as being available, and look exclusively at the numbers. 99.99% uptime of a service over the course of a year means that the service can be offline for about 52.56 minutes per year, or roughly one minute per week. This could account, for instance, for a server reboot every other week or so. As the uptime guarantee decreases, the downtime numbers obviously grow: for 99.9% uptime, the service can be offline for 525.6 minutes, or 8.76 hours, in a year; for 99% uptime, it would be 5256 minutes, or 87.6 hours, which is more than an hour and a half every week. While some of these numbers may seem small, they can severely impact systems and processes that aren’t ready for them.
The first thing that anyone who is looking to use cloud-based services or any kind must consider is the possibility of failure: what happens when I try to call the service and get back an error response or, even worse, get no response back? Retrying a request is an obvious answer, but also a problematic one. If a service goes offline for a significant amount of time, a “retry loop” can trap an application or create unexpected situations from which it can’t recover.
Even worse, issuing many retries can create a bottleneck at the receiving service, with even worse consequences. An interesting example of this was last year’s October outage of AWS, where they stopped accepting requests for creating EBS volumes and EC2 instances due to an excessive number of errors. In this sense, retrying requests can create a cascade of failures in interdependent services that becomes harder and harder to recover from.
Handling failures means having contingencies in place to handle unforeseen and unexpected situations. This not only means reacting and dealing with obvious failures, but also with situations that don’t clearly represent a failure. Let’s take a system that automatically launches virtual machine instances to do some processing: it must be prepared to handle the situation where the request for a new virtual machine is denied, but also for the situation where it receives a normal response for the request, but the virtual machine is never launched.
Availability issues become even more pronounced when we consider the track record of service providers. While almost everyone will promise aggressive SLAs (99%+ uptime), the fact is that many of the top tier providers routinely fail to deliver the promised levels of availability. If the systems that make use of these providers aren’t ready to handle a provider being unavailable for long periods of time, they will fail spectacularly in real life.
Another important point to take into consideration is that a system can only be as available as its underlying components. Many cloud software-as-a-service providers offer uptime guarantees that they can’t hope to match in real life, because these guarantees surpass what they are getting from their own infrastructure providers. If you’re looking for cloud-based software, always beware excessive promises. At the same time, take into consideration what it means to only have 99% availability: if the software were to stop working, can your business survive?
None of the issues discussed here are new. Most of them have been around since the advent of client-server architectures some decades ago, but sometimes users and developers forget the lessons of the past just because they are dealing with new technology. By remembering that cloud-based services are just like any other IT system and that eventual failures are expected, we can avoid many headaches when moving to the cloud.