WirelessCar builds automotive data lake solution using Amazon S3, Amazon Redshift

WirelessCar worked alongside Amazon Web Services (AWS) to commoditize connected vehicle services and turn its data into insights, digital services, and revenue. Connected vehicle services in any vehicle— not just premium vehicles—is becoming a new normal. Vehicles are producing a huge amount of sensor and user pattern data, and there is a shift of user focus from vehicle hardware to software-defined experiences when choosing a vehicle. Therefore, original equipment manufacturers (OEMs) need to come up with new business models to make connected vehicles a profitable business and turn data into insights to provide personalized new services for better user experience, safety, and loyalty. A first step in this direction is to build a data lake that is compliant with the European Union’s General Data Protection Regulation (GDPR).

Alongside AWS, WirelessCar built a data lake solution using numerous AWS services to bring relevant data out of team silos while respecting consumer privacy and OEM data separation. These services include Amazon Simple Storage Service (Amazon S3), object storage built to retrieve any amount of data from anywhere, Amazon Redshift, which helps companies accelerate their time to insights with fast, easy, and secure cloud data warehousing at scale, Amazon Kinesis Data Firehose, which lets users load real-time streams into data lakes, warehouses, and analytics services, and streams from Amazon DynamoDB, a fast, flexible NoSQL database service for single-digit millisecond performance at any scale. This solution collects OEM data in a per-tenant Amazon S3 bucket of millions of vehicles and processes it using AWS Lambda, a serverless, event-driven compute service that lets users run code for virtually any type of application or backend service without provisioning or managing servers, and AWS Fargate, a serverless compute service for containers. User data access, OEM data separation, and GDPR compliance is managed with the per-tenant Amazon S3 bucket, AWS Identity and Access Management (AWS IAM), which provides fine-grained access control across all of AWS, and Active Directory Connector (AD Connector), a directory gateway which lets users redirect directory requests to their on-premises Microsoft Active Directory without caching any information in the cloud.

One major challenge is establishing data collection processes and pipelines from OEM programs because each OEM provides different services that are based on a different technology stack and have been developed by WirelessCar for 20 years. A data lake is used for the analysis of global services provided by WirelessCar to OEMs, connected vehicle analytics on an aggregated level, domain- and region-specific insights, and metrics such as the cost of WirelessCar services per vehicle per year, vehicle diagnostic errors, and forecasting, which were previously tracked in ad hoc Excel spreadsheets. Based on this data lake, we built intelligent dashboards for exploratory analysis, and we are building services based on artificial intelligence (AI) and machine learning (ML) for advanced analytics for OEMs and solution users. These insights and data-driven new services will create value for OEMs and solution users and create new revenue sources and business models from data collected from vehicles.

This blog post shows an example setup for a data lake solution for OEMs and automotive data. It highlights architecture, use cases, and a solution to build an automotive data lake. Further, it paves a path to build a dashboard and AI/ML–based services for OEMs and solution users.

Overview of solution

WirelessCar has multiple solutions developed during its 20 years of experience in the industry. Teams working for different OEMs perform continuous refactoring such as switching from SQL databases to NoSQL alternatives like Amazon DynamoDB. WirelessCar has hundreds of AWS accounts throughout its organization, with different teams operating their services in separated accounts to apply least-privilege access and limit the scope of impact of issues. The data lake solution had to provide significant freedom for teams to refactor and control their own accounts and, at the same time, provide central access to datasets. This solution also needed to find processes and pipelines to collect data from a diverse set of solutions.

WirelessCar opted for using Amazon S3 buckets with write-only permissions for teams. These buckets provide a high-capacity, virtually infinite input buffer for the data lake. The raw data is typically either JSON lines or comma-separated values (CSV) files or in compressed formats. By supporting write access to AWS IAM roles in the source accounts, we provide the teams the ability to stream data into Amazon S3 with the method of their choice—for example, Amazon Kinesis Data Firehose, Amazon S3 replication, direct Amazon S3 PUTs, or Amazon DynamoDB streams in a single destination data lake account. Independent teams are provided per-tenant buckets where they can stream data. Once ingested, automatic jobs move the input files to a separated archive bucket with suitable Amazon S3 storage class transitions. AWS Fargate hourly initiated tasks and AWS Lambda cleanse data in Amazon S3, create datasets, and load them into Amazon Redshift in tables suitable for efficient queries. Once in Amazon Redshift, datasets are normally transformed to a structured columnar format. Because WirelessCar operates a multitenant environment, we have chosen to have multiple Amazon Redshift clusters to separate costs and backups for different OEMs. Datasets in Amazon Redshift are queried by admins (the data management team), data scientists, and dashboard readers using Amazon QuickSight, a popular cloud-native, serverless business intelligence service.

This way raw vehicle data is collected from many OEM programs, cleansed, and archived, and datasets are loaded in Amazon Redshift to be queried with data security measures.

Solution

Data lake provisioning

The WirelessCar data management team set up the data lake using AWS CloudFormation templates. AWS CloudFormation speeds up cloud provisioning with infrastructure as code. The templates are for Amazon Redshift clusters, Amazon S3 input, archive buckets, and AWS Lambda. WirelessCar OEM program teams request a data ingestion Amazon S3 bucket and an Amazon Resource Name (ARN) with a write-only access role for the same Amazon S3 bucket for each tenant. Each OEM has a unique tenant for each of their brands.

Breaking data silos with data ingestion pipelines

Each OEM team inside WirelessCar pushes data from its account with the provided ARN role in a provisioned Amazon S3 bucket in the central data lake account. Depending on the data source, different methods of writing data to Amazon S3 are used. A small Amazon DynamoDB table could be exported in its entirety with the Amazon DynamoDB–to–Amazon S3 export feature. A larger Amazon DynamoDB table had its change stream continually written using Amazon Kinesis Data Firehose to compress partitioned chunks in Amazon S3. Methods of writing data to an Amazon S3 bucket in the data lake account were up to the producers. The data management team provided guidance and template solutions. Amazon S3 works alongside AWS Managed Services, which helps users operate their AWS infrastructure more efficiently and securely, to write data. This helped WirelessCar to break data silos and gather data from multiple sources in its data lake account.

There is a particular situation with cross-account AWS IAM roles that is unique. An Amazon S3 policy referencing an AWS IAM role by an ARN will not support the deletion and recreation of an AWS IAM role, even under an identical ARN. While this is a design choice that makes sense in the general case, it puts an operational constraint between the two parties that is not justified in this case. This is solved without compromising security by using a StringEquals condition on the aws:PrincipalArn.

The intermediate storage in Amazon S3 offers decoupling between data producers and the data lake. Input data is generally relatively structured, because it has been sent from vehicles and processed in an OEM tenant account. The data is placed in Amazon S3 in JSON, CSV, or a batched and compressed format. Depending on the type of data, some anonymization or pseudonymization is already applied by the source OEM programs using AWS Lambda transformation in Amazon Kinesis Data Firehose. Most of the data is delivered incrementally, because Amazon Kinesis Data Firehose will automatically split it into suitable chunks.

Once the data is ingested, the source files are transferred to a separate archival Amazon S3 bucket. The archival Amazon S3 bucket makes it simple to replay data deliveries during testing or refer to the unmodified source data for troubleshooting purposes. It also means that the input Amazon S3 bucket will always remain empty, except for files that are just about to be picked up for processing.

Data cleansing

AWS Lambda and AWS Fargate are used to cleanse data and load it in Amazon Redshift for querying. Amazon S3 initiates AWS Lambda for data processing. AWS Fargate batch job processing is initiated every 15 minutes. In the data processing step, input data files from Amazon S3 are processed and loaded in Amazon Redshift. The Amazon Redshift COPY command also efficiently ingests even large files directly from Amazon S3 into Amazon Redshift.

AWS Glue—a simple, scalable, and serverless data integration service—or even direct analytical queries into production databases will lower the latency of data access. But following best practices, WirelessCar decided to not query live production databases. Doing so can hamper workload performance. Amazon S3 Object Lambda helps WirelessCar to reduce latencies to the order of minutes till data is ready for consumption in Amazon Redshift, which is reasonable for WirelessCar use cases.

DBT, a data build tool, gives analytics engineers the ability to transform data in their warehouses by simply writing select statements. DBT handles turning these select statements into tables and views. It runs on an automatic schedule in AWS Fargate, performing certain tasks at frequent intervals and larger updates nightly. These are simple filtering views to reduce bad data and the aggregation of views or tables to reduce the number of rows. In certain cases, sensitive source data such as geospatial information is masked or reduced in precision using DBT views. The final layer exposes the datasets in structured columnar format. Amazon Redshift user-defined functions (UDFs) facilitate Amazon Redshift invoking AWS Lambda from SQL queries. This gives WirelessCar the ability to put certain business logic functions in any language which is preferred—for example, Python, which is well adapted for data science purposes—and enrich geospatial data with, for example, geogrid identifiers or lookups using Amazon Location Service, which lets users securely and easily add location data to applications. AWS Lambda, AWS Fargate, the DBT tool, and UDFs help WirelessCar to process data ingested in Amazon S3 buckets and create columnar datasets in Amazon Redshift for data consumption.

Data protection

Because WirelessCar operates cloud solutions for multiple OEMs, it is important to separate storage costs and backups per OEM. Therefore, multiple Amazon Redshift clusters are used per OEM. Schema structures are used to separate different tenants / car brands within an OEM. WirelessCar will use Amazon Redshift Serverless, which helps users get insights from data in seconds without having to manage data warehouse infrastructure, because some of the Amazon Redshift clusters do not need to be running continuously.

The GDPR and the California Consumer Privacy Act are regulatory compliance requirements for dealing with user data. Anonymization of data helps facilitate compliance. When personal data is processed, it must be deleted when required. To avoid manual processes, the WirelessCar data management team consumes the Amazon DynamoDB stream from the source database and replays the operations in Amazon Redshift. This action facilitates the removal of any data deleted in the source system from Amazon Redshift too. This also has the benefit of covering user-level actions, like explicitly deleting a trip, as well as batch jobs actions of removing all trips for a vehicle and time to live (TTL) events actions, like removing trips that should no longer be stored.

Data access

Data from Amazon Redshift is accessed by Amazon QuickSight. Amazon QuickSight is used for creating visualization dashboards to create insights from data. WirelessCar uses Active Directory integration to manage access permissions to datasets. This helps in following data regulatory compliance. Data is exported periodically from Amazon Redshift to Amazon S3 for short-term analysis by a limited number of data scientists. In order to facilitate compliance with regulation, this Amazon S3 data is automatically deleted using life cycle events.

To avoid central bottlenecks, the data management team is intentionally kept small, with 3-4 people actively working on the data lake. In contrast, WirelessCar has dozens of teams serving multiple car makers. The data lake solution was set up in 2021, and all OEM solutions now have one or more data streams ingesting data in the data lake. It is our intention to continue this work in 2022 and gather an even greater number of datasets to permit innovation across previously existing data silos.

For the future, we plan to make use of Amazon Redshift Serverless because it helps us to scale up the number of clusters used for cost and backup separation without increasing our fixed costs. As our data volumes grow, it is our intention to shift data out of Amazon Redshift storage and seamlessly query Amazon S3 using Amazon Redshift Spectrum, which lets users query data directly from files on Amazon S3, for longer time series. This data lake is used for creating exploratory dashboard visualizations and developing AI/ML–based services for OEMs and solution users.

Conclusion

WirelessCar is collecting data across all OEM programs and creating a regulatory compliant data lake. This data lake is used for dashboard exploratory analysis and creating new AI/ML–based services for connected mobility. Please reach out to us with your questions or adopt the WirelessCar data lake solution for your workloads. We will share more about building dashboard services for connected mobility in our next blog.

AWS for Industries