AWS for Industries

Scaling Automated Driving data processing and data management with BMW Group on AWS

Autonomous driving (AD) and highly automated driving are key technologies in the automotive industry. Fully realized, they have the potential to fundamentally transform the automotive and mobility industries through improved comfort and safety and new business models for automobile manufacturers. The BMW Group is one of the leading OEMs in the automotive industry. This year, the BMW Group will launch a highly automated self-driving system on 7 Series. In a 2020 AWS re:Invent session, we learned how the BMW Group collects over a billion kilometers of anonymized perception data from its worldwide connected fleet of customer vehicles to help the BMW Group develop safer and more capable automated driving systems. In general, BMW Group does not process vehicle data without informed consent, unless the data processing is legally required. Informed consent requires transparency about the data being processed and the processing purpose. This blog will describe how BMW Group collects, analyzes and visualizes customer vehicle data on AWS.

Purpose of customer vehicle data

In general, there are two types of vehicle data for AD: test data and customer vehicle data. Test vehicles are internal development cars used by automakers with additional sensors and logging systems not found in  standard vehicles for customers. This additional equipment enables automakers to collect more detailed vehicle data. In this post, we will focus on the customer vehicle data. Customer vehicle data is all about real-world data and real driving behaviors. Customer vehicle data, which may be based on direct customer behavior, is important because it allows automakers to understand how to better design, improve, verify, validate, and homologate the AD functions of their vehicles. For example, data from customer vehicles reveals when and where a specific AD function is used the most by the customer, and where the function needs improvement. This helps the AD function developers at the automaker to strategically focus on areas with high impact and value for its customers. Customer vehicle data is also used for validation and verification of AD functions. For example, the data may be used to compute the frequency of a driving scenario, and to estimate the criticality of a function failure in that scenario.

Collection of customer vehicle data

Customer data can only be collected and used by the automaker if the customer provides affirmative consent for such collection and use. If consent is provided to the BMW Group, the AD function data of the vehicle is collected via the BMW Group’s custom data collection platform. Two types of data are collected:

  • Time series data collected continuously during a drive, which is useful for most analytics use cases; and
  • Event-based data recorded in pre-defined scenarios like emergency braking, which is used for specific analytics tasks.

Both types of vehicle data are sent over-the-air by the BMW Group to its data storage on AWS. Collected data include:

  • Data from the vehicle, such as speed, mileage, position, the operational state of the AD functions.
  • Data from the environment, such as lane boundaries, traffic participants, and weather conditions.

The next step of data collection will be for the BMW Group to collect raw sensor data in specific situations. The BWM Group collects customer data from the following regions: EMEA, US, and China on various operational design domains (ODD) and subject to customer consent. More than a billion km of data are collected corresponding to more than 50 TB in each region per year.

The collected data is then processed by a big data toolchain built by the BMW Group running on AWS. This toolchain is designed to help empower AD function developers and data analysts at BMW Group to more efficiently work with such data and build AD functions that better reflect their customer’s needs. As the customer fleet grows and sends more data, the amount of collected data grows exponentially. Below we will describe the BMW Group’s solution on AWS to help cope with the growth of such vehicle data.

Data privacy and governance

The protection of customer data is one of the main goals of the data protection program at BMW Group. Therefore it has been already installed a mandatory process for collecting customer data.  If not mandatory by law, an explicit consent of a customer is required for the data collection. Multi-layered onboard (in the vehicle) and offboard (in the cloud) anonymization steps are in place to make sure identifying individuals is not possible, and only data needed for the BMW Group’s AD function development is collected and stored in accordance with appropriate retention policy.

Solution overview for data processing and data management

Figure 1 Solution Overview for data processing and data management

Figure 1: Solution Overview for data processing and data management

The proposed BMW Group solution in Figure 1 has six key components:

1. Data ingestion over-the-air from customer vehicles to the BMW Cloud Data Hub (CDH): To better manage this data, the BMW Group introduced the notion of “data providers” and “data consumers” to increase both the autonomy and agility of its software engineering teams. In this case, the aforementioned data collection platform acts as a data provider which is designed to ingest, transform, and store the data as Parquet files (unsorted events received by the cars) in CDH.

2. Data pipeline with AWS Step Functions: This data pipeline has five major Amazon EMR jobs; all results are stored in Amazon Simple Storage Service (Amazon S3) buckets.

a. Checks the completeness of each drive (start and end of each drive). Performs the anonymization of the drives. Partitions and sorts the data optimized in Apache parquet format for further analysis via the analytics stack (step 4). Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

b. Extracts and enriches the input data to produce flattened tables (de-normalized database table is much simpler to use for analysts).

c. Extracts metadata for dashboard/reports and search of data.

d. Synchronizes and interpolates input signals with different sample rates (simpler to use for analysts). As the various sensors in the vehicle have different frequencies, they are synchronized to a uniform frequency, e.g., 100 ms per sensor.

e. Automatically labels the intervals in drives with custom-made labeler for quicker search and statistics, e.g., an emergency break or a cut-in event on the highway.

There are a few key reasons to choose Amazon EMR as Spark job runtime: Application codes are testable on local systems. The same codebase can be shared with on-premises systems. It also supports a wide variety of Amazon EC2 instance types including AWS Gravition3 and Amazon EC2 Spot Instances to help optimize costs.

3. Data storage and access management: All data except raw sensor data is stored by BMW  in Amazon S3 in Apache Parquet format. The Amazon S3 Intelligent-Tiering storage class is enabled. AWS Glue Data Catalog provides the table definitions and contains references to data that is used as sources and targets of the extract, transform, and load (ETL) jobs in AWS Glue. AWS Lake Formation provides fine-grained access control at row-level or column-level to AD function developers and data analysts.

4. Analytics stack based on Amazon EMR: Data analysts at BMW Group can spin up their own clusters on-demand. They use JupyterLab or Zeppelin for the big-data analytics with Spark, Amazon Athena, or AWS Glue.

5. Drive scene visualization: Drive scene can be visualized with Foxglove and custom-made animation generator. A similar implementation with Foxglove and remote access via NiceDCV is implemented as a module of the Autonomous Driving Data Framework (ADDF) open-source project.

6. KPI dashboard with Amazon QuickSight: The dashboard for AD functions shows when and where a specific AD function like Active Cruise Control is used such that BMW Group can prioritize the development of the AD functions. The dashboard for the platform allows the BMW Group platform manager to control the cost and usage of certain AWS services.

Best practice 1: Use of parallel development environments in large teams.

Each AD developer in a large team wants to deploy and test the entire stack without interfering with another colleagues  work in the same AWS Account. The usage of Terraform workspaces helps enable the separation of all AWS resources and the development on multiple instances of the entire stack in the same AWS account.

Best practice 2: Spark performance optimization.

The BMW Group team uses custom made Spark partitioners and catalyst-strategies to lower the memory footprint and gain performance. In order to be able to use all instance types of Amazon EMR, the BMW team decided to optimize the memory usage by introducing a custom re-partition strategy, which is designed to help reduce the memory footprint up to 50% in our use case. This re-partition strategy is directly integrated as a Catalyst Strategy. There is no need to to serialize and deserialize the rows in Catalyst and it will reduce 30% processing time.

Best practice 3: Test automation.

The development team at the BMW Group insists on a high bar of code and data quality and defines test cases on all levels, from unit tests, integration tests to end-to-end tests.  The usage of the standard Apache Spark API allows for local testing without spinning up an extra Amazon EMR cluster.


In this blog post, we have shown how the BMW Group collects, analyzes, and visualizes vehicle data on AWS to help enable its AD function developers and data analysts to develop and improve their AD functions. We described how the data is prepared and pre-processed in the cloud using a wide range of AWS services, including Amazon EMR, Amazon S3, Amazon Athena, Amazon Glue, AWS CodeBuild, and Amazon QuickSight, to help scale according to the fast growth of vehicle data while taking the privacy of the BMW Group customers into account.

Learn more about BMW’s Cloud Data Hub in this blog post, ADDF in this blog post, AWS offerings at the AWS for automotive page, or contact your AWS team today.


Thomas Atz

Thomas Atz

Thomas is the Cluster Product Owner for customer vehicle data at the autonomous driving department of BMW Group. He leads the strategic enablement of the in-vehicle and cloud based off-board data collection infrastructure, as well as the applied data collection and analysis campaigns. The data is used at BMW with a focus on data driven product design, data driven development and legal purposes.

Jan-Stefan Fischer

Jan-Stefan Fischer

Jan-Stefan Fischer is the Chief Engineer for customer vehicle data at the autonomous driving department of BMW Group. He oversees the development of a globally distributed AD data platform, which facilitates the collection, processing, and analysis of customer vehicle data. His ultimate objective is to furnish BMW's autonomous driving developers with scalable and resilient solutions that promote data-driven development.

Junjie Tang

Junjie Tang

As a Principal Consultant at AWS Professional Services, Junjie leads the design and data-driven solutions for global clients, brining over 10 years of experience in cloud computing, big data, AI, and IoT. Junjie heads the Autonomous Driving Data Framework (ADDF) open-source project designed to enable scalable data processing, model training, and simulation for automated driving systems. Junjie is passionate about creating innovative solutions that improve quality of life and the environment.

Tae Won Ha

Tae Won Ha

Tae Won Ha is an Engagement Manager at AWS with a software developer background. He leads AWS ProServe teams in multiple engagements. He has been helping customers delivering tailored solutions on AWS to achieve their business goals. In his free time, Tae Won is an active open-source developer.