Engineering Resilient Systems: BMW Group’s Chaos Engineering Journey and Insights

“Using AWS FIS has been a transformative journey for BMW Group, and we are committed to scaling chaos engineering across our organization to ensure the highest levels of reliability and resilience for our digital services.” – Dr. Céline Laurent-Winter, BMW Group Vice President Connected Vehicle Platforms

Introduction

Ensuring resilience and reliability in mission-critical software systems is paramount, especially for companies like BMW Group that deliver advanced connected vehicle experiences. As BMW Group migrated its next-generation vehicle connectivity driving experience, the ConnectedDrive Platform, to the cloud, it recognized the need for a systematic approach to help validate and strengthen the resilience of its cloud infrastructure. Enter chaos engineering – a discipline of intentionally introducing controlled disruptions to proactively identify weaknesses. In this blog post, we explore BMW Group’s transformative journey in adopting chaos engineering. We highlight the company’s strategic cloud migration, the execution of its first Cloud IT Emergency Exercise (Cloud ITEE), and the scaling of chaos engineering practices across the organization. Through real-world examples and lessons learned, we explain how BMW Group used AWS Fault Injection Service (FIS) to help conduct large-scale chaos experiments in production, uncover issues, and foster a culture of greater resilience and continuous improvement.

BMW Group’s Cloud Migration

In 2020, BMW Group and AWS embarked on a strategic collaboration, initially focused on accelerating BMW Group’s data-driven innovations. Soon after, BMW Group made the decision to modernize their ConnectedDrive Platform by migrating it to the AWS cloud. ConnectedDrive is a suite of digital services and applications that are designed to enhance the driving experience and vehicle connectivity. With ConnectedDrive, BMW Group provides their customers with features like navigation with real-time traffic and map updates, remote control of vehicle functions via mobile app, over-the-air software updates, driver assistance systems for adaptive cruise control and lane-keeping, as well as entertainment options like Spotify, Apple CarPlay, and Android Auto. The migration of the ConnectedDrive platform to AWS was driven by BMW Group’s need to accommodate rapidly increasing future demand, help enhance resilience, and drive greater efficiency through automation. The migration was successfully completed in May 2023, solidifying ConnectedDrive as the backbone of the world’s largest connected vehicle fleet, comprising over 22 million connected vehicles.

This migration and further optimizations involved continuously improving the ConnectedDrive Platform software architecture, upskilling BMW Group’s workforce, and expanding agile development methodologies to help roll out updates and enhancements across the entire BMW Group’s product portfolio at an unprecedented pace. Achieving scalability and resilience has become paramount as BMW Group aims to efficiently deploy its innovations across different regions and vehicle platforms globally, while ensuring a more robust and secure software infrastructure that is designed to withstand dynamic conditions and cyber threats. At the heart of that digital experience lies a complex software system composed of over 1,100 micro-services, serving 12 billion service requests and processing 165 Terabyte (TB) of data daily. Today, BMW Group delivers these services with a remarkable reliability of 99.95% and isn’t stopping there.

The First Cloud ITEE Experiment

To gain confidence in the resilience of its software infrastructure, BMW Group and AWS teamed up to organize and run a series of enablement sessions and hands-on workshops focused on resilience and chaos engineering. Following these events, BMW Group fully embraced chaos engineering methodologies, using AWS FIS to conduct Cloud ITEEs – a term coined by BMW Group to describe their centrally organized large-scale chaos experiments in production environments. In the past, BMW Group conducted “IT Emergency Exercises” (ITEEs), a concept involving the testing of infrastructure in their on-premises data centers. As their migration to the AWS cloud accelerated, the on-premises ITEEs became obsolete and required a new approach. This prompted BMW Group to adapt their resilience engineering approach and start Cloud ITEEs, using the cloud’s inherent scalability and flexibility, and tools like AWS FIS to simulate and test a wide range of failure scenarios.

The team responsible for managing the vehicle connectivity and telemetry part of the ConnectedDrive Platform was selected as the pilot for the first Cloud ITEE. They had already conducted smaller-scale experiments using AWS FIS in pre-production environments on the central, self-managed Message Queuing Telemetry Transport (MQTT) broker cluster. MQTT is a lightweight, publish-subscribe messaging protocol, that BMW Group uses to process more than 12 billion messages per day from their connected vehicle fleet. Therefore, the resilience of the MQTT broker cluster is critical. The cluster is deployed across three Availability Zones (AZs) in a highly available configuration, utilizing Amazon Elastic Compute Cloud (EC2) instances and Amazon Elastic Block Store (EBS) volumes. In the first Cloud ITEE experiment, the team used AWS FIS to simulate an AZ Availability: Power Interruption scenario and test the ability of the MQTT broker cluster to automatically recover from that disruption by starting new nodes in the remaining AZs. The experiment helped BMW Group uncover previously unknown issues that lead to the formation of a second cluster, a situation known as a “split-brain” scenario. As a result, vehicles randomly connected with any of the two MQTT clusters, resulting in undeliverable messages. As the experiment was conducted under close observation, the issue was quickly mitigated by the operation team before customers could be impacted. The experiment provided invaluable insights, allowing BMW Group to proactively identify and address a critical bug that could have potentially caused customer impact.

Scaling Chaos Engineering Across the BMW Group Organization

The early success of these Cloud ITEE experiments built growing confidence, leading to the decision by BMW Group to introduce chaos engineering practices across the entire organization, and also increase their frequency from annually to monthly. Along with conducting chaos experiments on other critical systems, BMW Group integrated chaos engineering capabilities into the ConnectedDrive developer platform. This step helped empowered BMW Group teams building applications for the ConnectedDrive Platform to run their own experiments on their own applications. The developer platform is crucial for creating and deploying ConnectedDrive applications, serving thousands of developers. It provides essential development tools, including over 35,000 daily ephemeral runners for continuous integration and deployment. The platform centrally manages binaries using Kubernetes, leveraging Amazon Elastic Kubernetes Service (EKS).

The diagram below illustrates a typical AWS account setup for a single application within the BMW Group’s ConnectedDrive Platform. Traffic ingress is facilitated through AWS PrivateLink, designed to ensure secure and scalable connectivity. The compute resources, including Amazon EC2 instances and Amazon EKS clusters, are auto-scaled across three AZs, providing high availability and fault tolerance. The application’s database layer, powered by Amazon Aurora, is deployed in separate private subnets and uses the multi-AZ feature for enhanced durability and redundancy.

Typical AWS account setup for a given BMW Group applications within the ConnectedDrive Platform, running the Availability Zone Power Interruption Scenario

To gain even more experience with complex chaos engineering experiments, BMW Group decided to start experimenting on the developer platform itself, and simulate a real-world disruption scenario. The BMW Group developer platform team also used the AZ Availability: Power Interruption scenario in order to simulate a complete outage of a single AZ (AZ a in the diagram above) for all components of the platform. This time, the experiment revealed an expected limitation where the application’s configuration and the Amazon EBS volumes were bound to specific AZs. When Amazon EKS nodes were automatically started in a different AZ by the auto-scaling mechanism, the pods launched on those new nodes were still bound to the original EBS volumes in the initial AZ, causing accessibility issues. This led to a 15–20-minute period of suboptimal performance as the application operated without the EBS volumes, which serve as cache. Eventually, the pod attached to new EBS volumes in different AZs, and the cache began populating. This finding highlighted the importance of considering AZ constraints when customers are designing resilience and recovery strategies. The AZ Availability: Power Interruption scenario, in particular, has proven to be a powerful tool, becoming an integral part of BMW Group’s chaos engineering program where it continuously validates and strengthens the resilience of their systems.

Scaling chaos engineering practices organization-wide demanded extensive coordination and collaboration across multiple teams and stakeholders at BMW Group. This cross-functional alignment and mutual buy-in became a cornerstone, enabling the company to successfully implement chaos engineering at a broader scale. Through seamless teamwork and a shared commitment to resilience, BMW Group achieved a significant milestone by strategically adopting chaos engineering as a driving force behind building more highly resilient and reliable connected vehicle applications.

Lessons Learned

Here are a few lessons that BMW Group learned along the way:

1. Encourage cross-team collaboration: BMW Group found success by bringing together teams from different domains within their organization. Enterprises should foster a collaborative environment that allows teams working on different systems to share learnings and coordinate their initiatives.

2. Establish a structured program and secure leadership buy-in: BMW Group’s Cloud ITEE framework provided a leadership-approved consistent framework for planning, executing, and analyzing chaos engineering experiments. Enterprises should consider implementing a similar centralized program to coordinate chaos engineering efforts across the organization.

3. Use shared tooling and knowledge: BMW Group teams were able to build upon each other’s work, standardizing tooling on AWS FIS, and sharing best practices. Teams that had already participated in exercises could support other teams through mentoring and guidance. Enterprises should encourage the sharing of chaos engineering frameworks, scripts, and learnings to accelerate adoption and reduce duplication of effort. A mechanism such as the Resilience Lifecycle Framework can help incorporate learnings into a continuous improvement process, enabling more streamlined planning and execution of future experiments.

4. Empower teams: While BMW Group had a centralized Cloud ITEE program, individual teams maintained a high degree of autonomy in designing and executing their own experiments. Organizations should empower their teams to take ownership of chaos engineering, rather than imposing a top-down approach. These teams have the critical knowledge required to identify which experiments they need to perform.

5. Start small and build confidence: BMW Group’s experience showcased the importance of starting with smaller, controlled chaos experiments to build confidence and buy-in among teams. By gradually increasing the scope and complexity of experiments, teams were able to learn, iterate, and refine their chaos engineering practices, while also gaining leadership support and organizational momentum. This incremental approach allowed BMW teams to develop the necessary skills, tooling, and processes to successfully execute more ambitious experiments over time.

6. Recognize the human benefits of chaos engineering: Chaos engineering is not just about identifying software issues – it also has profound human benefits central to its success. By creating an environment where failure is not stigmatized and potential disruptions are captured through proactive execution of chaos experimentation, organizations empower their teams to be bold, innovative, and focused on delivering exceptional experiences. This environment helps enable creativity without the worry of being blamed. Striking the right balance between reliable systems and psychological safety transforms chaos engineering into a powerful driver of business success.

Conclusion

In conclusion, BMW Group’s journey showcases the transformative power of chaos engineering in building more resilient and reliable cloud-based systems. By implementing a structured chaos engineering program and integrating it into their development processes, BMW Group successfully scaled this practice across the organization. Their experience highlights several key lessons that can guide other enterprises on a similar path, including fostering collaboration, securing leadership support, empowering teams, and starting small before tackling more complex experiments. Ultimately, chaos engineering helped enable BMW Group to build more reliable systems while promoting a culture of resilience. To learn more about how AWS FIS can transform your organization, visit the AWS FIS page.

AWS for Industries