On September 20th AWS suffered an outage of its DynamoDB database application which impacted around 20 other AWS services affecting major players like AirBnB, Docker, Netflix and Reddit. The outage for the world’s leading provider is surely a news in cloud world, but to its credit Amazon took just two hours to nail down the root cause, and a little over five hours to get back to full speed. The detailed and positively surprising open statement from Amazon can be checked here.
What exactly happened?
The choreography resembles a little bit like a known pattern: the co-existence of several actually minor incidents all of a sudden result in a disastrous outcome:
- To start with there was a brief network disruption, impacting a portion of DynamoDB’s storage servers.
- DynamoDB’s servers would normally have handled such situations seamlessly, but a new feature called Global Secondary Indexes (“GSIs”) resulted in unpredictable resource demand.
- Amazon on their part did not anticipate this demand upsurge and had less than required capacity allocated.
- Finally, cascading impact of failed requests led to an astounding error rate of 55 % in customer requests to DynamoDB, blocking the service even further.
What do we learn?
Many anti-cloud fans will use this to point out the risks of cloud computing. But is it true? We take three major lessons from this happening:
- Firstly, the outage was restricted to Platform as a Service (PaaS). PaaS suffers from a less predictable load forecast than Infrastructure as a Service (IaaS) due to its application specific reaction on workload. While IaaS only has few and simple hardware parameters that can simply be monitored from the outside to identify machines for relocation to balance the system, PaaS requires more insight on what is actually happening inside the container, making management more difficult. And – as it has happened in this case – management of nodes runs in the same priority level as the workload itself. This may result in exactly the perceived behaviour: unmanageability.
This should motivate PaaS providers to focus on additional management capabilities to add more value than just managing images and clusters.
- IaaS proofs to work (re-routing of Netflix services worked as expected) and shows full controllable functionality. But using IaaS by itself is no insurance, still emergency plans and failover strategies are essential. Despite this outage, Netflix was back in operations quiet fast due to their failover concept (See Link to Netflix concepts.). Multi-location and/or mulit-vendor strategies remain relevant.
- Lastly onus of recovery planning while failure in case of PaaS lies both with the provider and the user:
- Amazon as a learning from this outage now has stricter monitoring on performance dimensions, which will help in proactively plan for right capacity,
- And users like Netflix from time to time carry out disruptive activities that create havoc in systems. By constantly inducing failures in its systems, the firm is able to shore itself up against problems like those that affected AWS.
So none of our lessons would be to stop investing in the cloud! We at ASCAMSO are firm believers in anti-fragility and in our view this outage is not a downfall but a step towards ensuring cloud offerings become more solid and consistent in delivering quality of service. The financial case of IaaS remains undisputed and the right combination of IaaS, PaaS and SaaS opens new dimensions of agility and independence to corporate IT. Those using this or similar events to proof the need for on-premise IT, will finally find themselves stuck with limited budgets trying to cope with the ever speeding up technology and lack of skill to manage it. To relate this to a more established industry, almost no car manufacturer knows how to create breaks anymore, but they know where to purchase break systems from.
If you seek support in finding the right partner for your cloud endeavour, connect with us at ASCAMSO. We are deeply engaged in simplifying this complex matrix of quality and performance evaluation, we have conducted thousands of tests on providers, studied their capabilities and provide extensive insight. Not enough to talk about the capabilities and performance we also have our eyes on pricing which allows us to not only deliver a unique Cloud Service Provider Rating but also clear cross-provider price comparisons.