Every time there’s a major IT failure (particularly when it involves large-scale data centers and public services) the self-proclaimed "experts" and pundits are quick to point out how such a failure would never happen (sometimes adding “if only they used whatever solution I prefer”).
Engineers with long-enough real-life operational experience take a more realistic view:
2) Redundancy doesn’t prevent failures, but reduces their likelihood (the proof is left as an exercise for readers with at least minimal interest in statistics and probability);
3) Too much redundancy or complexity can cause failures, or exacerbate them.
Note: for an interesting overview of large-scale failures and root causes read this ACM article
However, we often see two extreme ways of dealing with this unpleasant reality (and lots of gray area in between the opposites):
Architects of truly robust solutions know the only thing that matters is how you deal with the inevitable failure. Companies using this approach develop failure-resilient applications that also happen to be highly scalable as a result of correct architectural choices. These choices include:
1) Having a gracefully degrading architecture that can continue providing at least minimal service even when encountering failures;
2) Using eventual consistency instead of strict transactional consistency whenever possible;
3) Building scale-out architectures with isolated swimlanes;
The ultimate example of this approach is Netflix with their Simian Army
– automated mechanisms that continuously kill virtual machines, hinder server performance, or introduce latency into the application stack to stress its component and identify potential weak or non-resilient points.
If you’re interested in learning more about robust application architectures, Scalabilty Rules book
webinar and workshop useful.
The alternate view tries to avoid the failures by using high-availability Magic deep within the infrastructure. Believers in this approach commonly use:
2) Active/active storage arrays across multiple data centers;
Most of these solutions are highly complex and tightly coupled
, resulting in a single huge failure domain
. You could compare them to the electronic 4-wheel-drive mechanisms: they are great as long as they work correctly, but result in spectacular failures when they fail.
Note: It’s not impossible to build well-architected high-availability infrastructure like the Tandem NonStop OS
) and usually provides high availability only within the compute and storage layer.
Now imagine you have to build a mission-critical application or supporting infrastructure. What should YOU do?
If you’re a small shop, deploy your workload on VMware High Availability cluster (but please don’t stretch it across multiple locations), or in a public cloud with automatic restart feature, and expect to have somewhere between 99% and 99.9% availability.
If want to go beyond that you should:
1) Stop believing in silver bullets marketed by infrastructure vendors;
2) Start with the business requirements, including realistic Recovery Time Objective and Recovery Point Objective.
3) Realize that not everything is most important. Figure out what MUST be always available, what SHOULD be always reachable, and what COULD be unavailable for a while.
4) Applications that require more than 99.9% availability MUST have failure-resilient application architecture. Don’t try to solve that problem within the infrastructure;
5) Based on realistic requirements, design the least-complex infrastructure that can meet the requirements considering all infrastructure aspects, from databases and storage to compute, virtualization, networking and security.
Finally, you don’t have a highly available solution until you’ve proven it recovers from failures. Test, test, test… and then test some more. Netflix Simian Army performs continuous testing of their infrastructure. You might not be able to get anywhere close, but that doesn’t mean there’s no need to test application or infrastructure redundancy and resilience, or your disaster recovery plans. (Ivan Pepelnjak)
Ivan Pepelnjak, CCIE#1354 Emeritus, is an independent network architect, book author, blogger and regular speaker at industry events like Interop, RIPE and regional NOG meetings. He has been designing and implementing large-scale service provider and enterprise networks since 1990, and is currently using his expertise to help multinational enterprises and large cloud- and service providers design next-generation data center and cloud infrastructure using Software-Defined Networking (SDN) and Network Function Virtualization (NFV) approaches and technologies. Ivan is author of several Cisco Press books, and a series of highly successful webinars.