Chaos Engineering: AWS Chief Technologist on Preparing for the Unexpected

Practice makes perfect, as the adage goes, even when you’re practicing failure.

That’s a rough approximation of how Amazon Web Services practices operational resilience. Other companies may try and even succeed at proactively stopping business disruption. At AWS, practicing disruption is part of the culture.

“We recommend that our clients safeguard their companies via implementation of robust testing and validation procedures to ensure their effectiveness,” Laurent Domb, chief technologist for worldwide public sector financial services at AWS, told PYMNTS for the “What’s Next in Payments” series. “And this should include methods such as penetration testing for cybersecurity, disaster recovery testing, but then more importantly, run ‘game days’ where they can truly practice the various incidents that happen in a real-world event, via controlled chaos engineering experiments.”

AWS is one of the world’s biggest providers of cloud-based infrastructure to financial services and other verticals, and practicing for disruption ensures readiness for the unpredictable. When Domb mentioned “game day” as part of his job, he wasn’t talking about getting ready to watch Thursday Night Football on Prime Video. He was referring to preparing a team to deal with the kind of disruption that is all too familiar to any executive who has been through the pandemic, the rush of innovations and changing consumer behavior that followed, and even the slew of sophisticated fraud attacks that have marked the 2024 calendar.

The ability to stage business disruption in a controlled way with minimal to no customer impact is key to creating a culture of resilience, Domb said.

Continuous Resilience

First, Domb and AWS advise financial services companies to prioritize building a culture of continuous resilience. To do this, they must understand which user journeys and components are critical for delivering a seamless customer experience. Financial organizations must establish clear service-level objectives, recovery points and time objectives, along with levels of impact tolerance, which is the maximum tolerable level of disruption to an important business service.

To further enhance resilience practices, AWS encourages a shift from traditional disaster recovery exercises to fault-induced disaster recovery via controlled chaos engineering, which gets as close as possible to real events. The company pushes familiarity with two terms that sound threatening but are necessary to any company that wants to achieve operational resilience: chaos engineering and fault injection.

“Purposeful fault injection experiments, also known as chaos engineering, help teams create the real-world conditions needed to uncover hidden bugs, monitor blind spots, and manage bottlenecks that are difficult to find in distributed systems,” according to the AWS website.

So “game day” can include chaos engineering. As Domb said, by replaying real-world disruptions, companies can verify that their systems are truly resilient. AWS supports this transformation with specialized resilience programs, including incident tabletop exercises, well-architected reviews, threat modeling and operational readiness reviews, as well as AWS-led chaos engineering game days. These initiatives help financial services firms build the necessary muscle memory and operational procedures to manage mission-critical systems effectively, fostering a culture of resilience.

Building a culture of resilience is not just about implementing the latest technologies, Domb told PYMNTS. It’s about creating an organizational mindset that prioritizes preparedness and continuous improvement. Resilience must be ingrained in every aspect of a financial institution’s operations, from the top down.

“Fostering a culture of resilience is crucial for companies to successfully navigate failures and outages,” Domb said. “We encourage our customers to embrace practices like continuous resilience and chaos engineering. These practices involve purposely stressing and disrupting systems in a controlled environment to identify weaknesses that may not be apparent during normal operations.”

Another part of the culture of resilience at AWS is a focus on creating an environment of psychological safety where teams can openly discuss mistakes without fear of retribution.

“The goal should be deriving lessons and ultimately preventing the same issues from happening in the future,” Domb said.

This approach not only improves operational resilience but also fosters a learning culture that can adapt to an ever-changing threat landscape.

The Proactive Approach

Proactive risk management is another critical component of operational resilience. Domb outlined a comprehensive approach that financial institutions can take to assess their strengths and vulnerabilities.

“Organizations should conduct risk assessments regularly,” he said. “These assessments should cover all critical areas of the business, including cybersecurity, operational processes, supply chain, regulatory compliance and reputational risk.”

By identifying vulnerabilities, evaluating their potential impact and prioritizing mitigation strategies, financial institutions can build a solid foundation for their risk management efforts.

Domb also highlighted the importance of resilience modeling techniques, which allow organizations to anticipate potential disruptions and their cascading effects across different components of their business. By understanding these interdependencies, companies can develop targeted strategies to mitigate risks and minimize the impact of business disruptions.

AWS’ commitment to operational resilience is exemplified in its work with Amazon Prime Video. The streaming service faces unique challenges, particularly during live events like the UEFA Champions League in Germany, Italy, the United Kingdom and Ireland, where a service disruption could result in customer dissatisfaction.

“Prime Video understands the importance of operational resilience, service resilience and observability,” Domb said. “They have integrated the use of chaos engineering, machine learning and automation to ensure continuous resilience.”

By using ML to forecast demand and running simulations to test systems under different conditions, Prime Video can identify and address potential bottlenecks before they occur, Domb said.

Additionally, Prime Video conducts weekly game day exercises, injecting simulated failures and service disruptions into its environments to validate its fallback mechanisms and ensure its systems can handle real-world failures.

“Resilience is nothing without observability,” Domb said. “Prime Video has instrumented their entire ecosystem to gain visibility into performance, errors and anomalies, enabling them to quickly address issues before they impact customers.”

AI as Threat Detector

Artificial intelligence is playing an increasingly important role in enhancing operational resilience, particularly in the areas of threat detection and defense. Domb highlighted several AWS services that integrate AI and ML to bolster security and resilience for financial institutions.

“At AWS, we have pioneered the integration of AI and ML into our suite of services, such as Amazon GuardDuty, which uses machine learning models to continuously analyze data streams and identify potential threats,” Domb said. “Similarly, services like the AWS Firewall employ machine learning techniques to detect and block malicious network traffic in real time, adapting their defenses to evolving attack vectors.”

Beyond threat detection, AI and ML are also being used for predictive maintenance and operational resilience. Services like AWS Security Hub, AWS Resilience Hub and Amazon Inspector help organizations correlate and prioritize security and resilience findings, allowing them to focus on the most critical risks.

“As new threats and challenges emerge, we continuously double down on integrating cutting-edge AI and ML techniques into our services,” Domb said. “This empowers our customers with intelligent threat detection, proactive maintenance and automated incident response, ensuring they are prepared for the evolving threat landscape.”

Domb also pointed to operational resilience as a competitive differentiator. By building a culture of resilience and identifying and mitigating risks, financial institutions can safeguard their operations and maintain customer trust in the face of uncertainty.

“Today’s customers expect a 24/7, always-on service,” Domb said. “To meet these expectations, financial institutions must not only understand the critical components of their business but also continuously practice and refine their resilience strategies. By doing so, they can ensure they are prepared to handle whatever challenges the future may bring.”