Post und Postfinance down: Input eines Netzwerk-Experten

4. Mai 2017, 11:57

Das Thema Netzwerk steht im Schweizer Rampenlicht. Ivan Pepelnjak, international renommierter Netzwerk-Architekt, erklärt in seinem englischsprachigen Gastbeitrag: "Failure is inevitable, get used to it".

Every time there’s a major IT failure (particularly when it involves large-scale data centers and public services) the self-proclaimed "experts" and pundits are quick to point out how such a failure would never happen (sometimes adding “if only they used whatever solution I prefer”).
Engineers with long-enough real-life operational experience take a more realistic view:
1) Failures are inevitable;
2) Redundancy doesn’t prevent failures, but reduces their likelihood (the proof is left as an exercise for readers with at least minimal interest in statistics and probability);
3) Too much redundancy or complexity can cause failures, or exacerbate them.
Note: for an interesting overview of large-scale failures and root causes read this ACM article.
However, we often see two extreme ways of dealing with this unpleasant reality (and lots of gray area in between the opposites):
Architects of truly robust solutions know the only thing that matters is how you deal with the inevitable failure. Companies using this approach develop failure-resilient applications that also happen to be highly scalable as a result of correct architectural choices. These choices include:
1) Having a gracefully degrading architecture that can continue providing at least minimal service even when encountering failures;
2) Using eventual consistency instead of strict transactional consistency whenever possible;
3) Building scale-out architectures with isolated swimlanes;
The ultimate example of this approach is Netflix with their Simian Army – automated mechanisms that continuously kill virtual machines, hinder server performance, or introduce latency into the application stack to stress its component and identify potential weak or non-resilient points.
If you’re interested in learning more about robust application architectures, Scalabilty Rules book webinar and workshop useful.
The alternate view tries to avoid the failures by using high-availability Magic deep within the infrastructure. Believers in this approach commonly use:
1) Hypervisor-based high availability and fault tolerance features;
2) Active/active storage arrays across multiple data centers;
4) Data center fabrics or VLANs stretched across multiple data centers.
Most of these solutions are highly complex and tightly coupled, resulting in a single huge failure domain. You could compare them to the electronic 4-wheel-drive mechanisms: they are great as long as they work correctly, but result in spectacular failures when they fail.
Note: It’s not impossible to build well-architected high-availability infrastructure like the Tandem NonStop OS) and usually provides high availability only within the compute and storage layer.
Now imagine you have to build a mission-critical application or supporting infrastructure. What should YOU do?
If you’re a small shop, deploy your workload on VMware High Availability cluster (but please don’t stretch it across multiple locations), or in a public cloud with automatic restart feature, and expect to have somewhere between 99% and 99.9% availability.
Note: Do keep in mind that you could be down three days per year and still have better than 99% availability.
If want to go beyond that you should:
1) Stop believing in silver bullets marketed by infrastructure vendors;
2) Start with the business requirements, including realistic Recovery Time Objective and Recovery Point Objective.
3) Realize that not everything is most important. Figure out what MUST be always available, what SHOULD be always reachable, and what COULD be unavailable for a while.
4) Applications that require more than 99.9% availability MUST have failure-resilient application architecture. Don’t try to solve that problem within the infrastructure;
5) Based on realistic requirements, design the least-complex infrastructure that can meet the requirements considering all infrastructure aspects, from databases and storage to compute, virtualization, networking and security.
Finally, you don’t have a highly available solution until you’ve proven it recovers from failures. Test, test, test… and then test some more. Netflix Simian Army performs continuous testing of their infrastructure. You might not be able to get anywhere close, but that doesn’t mean there’s no need to test application or infrastructure redundancy and resilience, or your disaster recovery plans. (Ivan Pepelnjak)
Ivan Pepelnjak, CCIE#1354 Emeritus, is an independent network architect, book author, blogger and regular speaker at industry events like Interop, RIPE and regional NOG meetings. He has been designing and implementing large-scale service provider and enterprise networks since 1990, and is currently using his expertise to help multinational enterprises and large cloud- and service providers design next-generation data center and cloud infrastructure using Software-Defined Networking (SDN) and Network Function Virtualization (NFV) approaches and technologies. Ivan is author of several Cisco Press books, and a series of highly successful webinars.
Ivan Pepelnjak ist Keynote-Referent an der SIGS Technology Conference vom 16. bis 18. Mai.


Mehr zum Thema


Green schliesst Refinanzierung über 480 Millionen Franken ab

Mit dem eingenommenen Kapital soll vor allem in den Ausbau der Infrastruktur investiert werden. Damit will der Provider seine Marktposition stärken.

publiziert am 1.12.2022

SATW insights: Der geringe Frauen­anteil in der IT ist proble­matisch

Der Frauenanteil in der IT verharrt in der Schweiz auf tiefem Niveau. In ihrer Kolumne erklärt Iris Hunkeler, warum das ein Problem ist und wie sich das ändern könnte.

publiziert am 1.12.2022

Auch die Post lässt qualifiziert und digital signieren

Nach Swisscom bringt Tresorit, eine Tochter der Schweizerischen Post, eine eigene E-Signatur-Lösung. In Zusammenarbeit mit Swisssign, dem Herausgeber der SwissID.

publiziert am 30.11.2022 2

Prantl behauptet: Wachstum geht auch ohne neues Personal

Die Rahmenbedingungen für überdurchschnittliches Wachstum sind nahezu perfekt, aber viele Unternehmer behaupten, ohne zusätzliches Personal sei dies gar nicht möglich. Kolumnist Urs Prantl behauptet das Gegenteil.

publiziert am 29.11.2022