Thinkers360

Systematic and Chaotic Testing: A Way to Achieve Cloud Resilience

May



In today’s digital technology era where downtime translates to shut down, it is imperative to build resilient cloud structures. For example, in the pandemic, IT maintenance teams can no longer be on-premises to reboot any server in the data center. This may lead to a big hindrance in accessing all the data or software, putting a halt on productivity, and creating overall business loss if the on-premises hardware is down. However, the solution here would be to transmit all your IT operations to cloud infrastructure that ensures security by rendering 24/7, round-the-clock tech support by remote members. Cloud essentially poses as a savior here.

Recently, companies have been fully utilizing the cloud potency, and hence, observability and resilience of cloud operations become imperative as downtime now equates to disconnection and business loss.

Imagining a cloud failure in today’s technology-driven business economy would be disastrous. Any faults and disruption will lead to a domino effect, hampering the company’s system performances. Hence, it becomes essential for organizations and companies to build resilience into their cloud structures through chaotic and systematic testing. In this blog, I will take you through what resilience and observability mean, and why resilience and chaos testing are vital to avoid downtimes.

To avoid cloud failure, enterprises must build resilience into their cloud architecture by testing it in continuous and chaotic ways.

1.) Observability

Observability can be understood through two lenses. One is through control theorywhich explains observability as the process of understanding the state of a system through the inference of its external outputs. Another lens explains the discipline and the approach of observability as being built to gauge uncertainties and unknowns.

It helps to understand the property of a system or an application. Observability for cloud computing is a prerequisite that leverages end-to-end monitoring across various domains, scales, and services. Observability shouldn’t be confused with monitoring, as monitoring is used to understand the root cause of problems and anomalies in applications. Monitoring tells you when something goes wrong, whereas observability helps you understand why it went wrong. They each serve a different purpose but certainly complement one another.

Observability along with resilience are needed for cloud systems to ensure less downtime, faster velocity of applications, and more.  

2.) Resilience

Stability

Is it on/reachable?

Reliability

Will it work the way it should consistently and when we need it to?

Availability

Is it reliably accessible from anywhere, any time?

Resilience

How does the system respond to challenges so that its available reliably?

Every enterprise migrating to cloud infrastructure should ensure and test its systems for stability, reliability, availability, and resilience, with resilience being at the top of the hierarchy. Stability is to ensure that the systems and servers do not crash often; availability ensures system uptime by distributing applications across different locations to ease the workload; reliability ensures that cloud systems are efficiently functioning and available. But, if the enterprise wants to tackle unforeseen problems, then constantly testing resilience becomes indispensable.

Resilience is the expectation that something will go wrong and that the system is tested in a way to address and maneuver itself to tackle that problem. The resilience of a system isn’t automatically achieved. A resilient system acknowledges complex systems and problems and works to progressively take steps to counter errors. It requires constant testing to reduce the impact of a problem or a failure. Continuous testing avoids cloud failure, assuring higher performance and efficiency.

Resilience can be achieved through site resilient design and leveraging systematic testing approaches like chaos testing, etc.

Conventional Testing and Why It Is Not Enough

Conventional testing ensures a seamless setup and migration of applications into cloud systems and additionally monitors that they perform and work efficiently. This is adequate to ensure that the cloud system does not change application performance and functions in accordance with design considerations.

Conventional testing doesn’t suffice as it is inefficient in uncovering underlying hidden architectural issues and anomalies. Some of the faults appear dormant as they only become visible when specific conditions are triggered.

High Availability Promises of Cloud

“We see a faster rate of evolution in the digital space. Cloud lets us scale up at the pace of Moore’s Law, but also scale out rapidly and use less infrastructure” says Scott Guthrie on the future and high promises of cloud. As a result of the pandemic and everyone being forced to work from home, there has not been a surge in cloud investments. But, due to this unprecedented demand, all hyperscalers had to bring in throttling and prioritization controls, which is against the on-demand elasticity principle of the public cloud.

The public cloud isn’t invincible when it comes to outages and downtime. For example, the recent Google outage that halted multiple Google services like Gmail and Youtube showcases how the public cloud isn’t necessarily free of system downtimes either. Hence, I would say the pandemic has added a couple of additional perspectives to resilient cloud systems:

  1. The system must operate smoothly and be unaltered even when they receive an unexpected surge in online traffic
  2. The system must look for alternate ways to manage the functionality and resource pool in case additional resource allocation requests are declined or throttled by the Cloud provider.
  3. The system should be accessible and secure to handle unknown locations and shift to hybrid work environments (may be a number of endpoints coming outside the network firewall).

The pandemic has highlighted the value of continuous and chaotic testing of even resilient cloud systems. A resilient and thoroughly tested system will be able to manage that extra congested traffic in a secure, seamless, and stable way. In order to detect the unknowns, chaos testing and chaos engineering are needed.

Cloud-Native Application Design Alone Is Not Sufficient for Resiliency

In the public cloud world, architecting for application resiliency is more critical due to the gaps in base capabilities provided by cloud providers, the multi-tier/multiple technology infrastructure, and the distributed nature of cloud systems. This can cause cloud applications to fail in unpredictable ways even though the underlying infrastructure availability and resiliency are provided by the cloud provider.

To establish a good base for application resiliency, during design the cloud engineers should adopt the following strategies to test, evaluate and characterize application layer resilience:

  1. Leverage a well-architected framework for overall Solution Architecture and adopt the cloud-native capabilities for availability and disaster recovery.
  2. Collaborate with cloud architects and technology architectsto define availability goals and derive application and database layer resilience attributes. 
    • . Along with threat modeling, define hypothetical failure models based on expected or observed usage patterns and establish a testing plan for these failure modes based on business impact.

By adopting an architecture-driven testing approach, organizations can gain insights into the base level of cloud application resiliency well before going live and they can allot sufficient time for performance remediation activities. But you still would need to test the application for unknown failure and aspects of multiple failure points in cloud-native application design.

Chaos Testing and Engineering

Chaos testing is an approach that intentionally induces stress and anomalies into the cloud structure to systematically test the resilience of the system.

Firstly, let me make it clear that chaos testing is not a replacement for actual testing systems. It’s just another way to gauge errors. By introducing degradations to the system, IT teams can see what happens and how it reacts. But, most importantly it helps them to gauge the gaps in the observability and resilience of the system — the things that went under the radar initially.

This robust testing approach was first emulated by Netflix during their migration to cloud systems back in 2011, and since then, it has effectively established this method. Chaos testing brings to light inefficiencies and ushers the development team to change, measure, and improve resilience, and it helps cloud architects to better understand and change their design.

Constant, systematic, and chaotic testing increase the resilience of cloud infrastructure, which effectively enhances the systems' resilience and ultimately boosts the confidence of managerial and operational teams in the systems that they’re building.

A resilient enterprise must create resilient IT systems partly or entirely on cloud infrastructure.

Using chaos and site reliability engineering helps enterprises to be resilient across:

  • Cloud and infrastructure resilience
  • Data resilience via continuous monitoring.
  • User and customer experience resilience by ensuring user interfaces hold up under high-stress conditions
  • Resilient cybersecurity by integrating security with governance and control mechanisms
  • Resilient support for infrastructure, applications, and data

To establish complete application resiliency, in addition to earlier mentioned cloud application design aspects, the solution architect needs to adopt architecture patterns that allow you to inject specific faults to trigger internal errors which simulate failures during the development and testing phase.

Some of the common examples of fault triggers are delay in response, resource-hogging, network outages, transient conditions, extreme actions by users, and many more.

  1. Plan for continuous monitoring, management, and automate the incident response for common identified scenarios
  2. Establish chaos testing framework and environment
  3. Inject faults with varying severity and combination and monitor application-layer behavior
  4. Identify anomalous behavior and iterate the above steps to confirm criticality

How to Perform the Chaos Test

Chaos testing can be done by introducing an anomaly into any seven layers of the cloud structure that helps you to assess the impact on resilience.

When Netflix successfully announced its resiliency tool — Chaos Monkey in 2011 — many developing teams adopted it for chaos engineering test systems. There’s another tool test system developed by software engineers called Gremlin that essentially does the same thing. But, if you’re looking to perform a chaos test in the current context of COVID-19, you can do so by using GameDay. This stimulates an anomaly wherein there’s a sudden increase in traffic; for example, customers accessing a mobile application at the same time. The goal of GameDay is to not just test the resilience but also enhance the reliability of the system.

The steps you need to take to ensure a successful chaos testing are the following:

  1. Identify: Identify key weaknesses within your system and create a hypothesis along with an expected outcome. Engineers need to identify and assess what kind of failures to inject within the hypothesis framework.
  2. Simulate: Inject anomalies during production based on real-life events. This ensures that you include situations that may happen within your systems. This could entail an application or network disruption or node failure.
  3. Automate: You need to automate these experiments, which could be every hour/week, etc. This ensures continuity, a detrimental factor in chaotic engineering.
  4. Continuous feedback and refinement: There are two outcomes to your experiment. It could either assure resilience or detect a problem that needs to be solved. Both are good results from which you can take feedback to refine your system.

Other specific ways to induce a faulty attack and sequence on the system could be:

  1. Adding network latency
  2. Cutting off scheduled tasks
  3. Cutting off microservices
  4. Disconnecting system from the datacenter

Summary

In today’s digital age where cloud transition and cloud usage is surging, it becomes imperative to enhance cloud resilience for the effective performance of your applications. Continuous and systematic testing is imperative in the life cycle of a project, but also to ensure cloud resiliency at a time where even the public cloud is over-burdened. By preventing lengthy outages and future disruptions, businesses save significant costs, goodwill, and additionally, assure service durability for customers. Chaos engineering, therefore, becomes a must for large-scale distributed systems.

By Gaurav Agarwaal

Keywords: Climate Change, Cloud, Digital Twins

Share this article