AWS Database Blog

Cox Automotive scales digital personalization using an identity graph powered by Amazon Neptune

Cox Automotive Inc. makes buying, selling, owning and using cars easier for everyone. The global company’s 34,000-plus team members and family of brands, including Autotrader®, Clutch Technologies, Dealer.com®, Dealertrack®, Kelley Blue Book®, Manheim®, NextGear Capital®, VinSolutions®, vAuto® and Xtime®, are passionate about helping millions of car shoppers, 40,000 auto dealer clients across five continents and many others throughout the automotive industry thrive for generations to come.

Auto dealers hosting their e-commerce websites on platforms like Dealer.com need innovative ways for targeting website visitors with relevant and personalized content. To deliver personalized content, dealers need tools to segment shoppers, display relevant advertisements, and trigger personalized email drip campaigns for different shopper segments.

Historically, website visitors were tracked across multiple domains using third-party cookies. Several browsers have already phased out third-party cookies with the remaining ones phasing out by 2022. This change will heavily impact the way Cox Automotive delivers personalized content to its online shoppers.

Cox Automotive’s disparate business units were brought together via acquisitions and have evolved in silos. They have a growing need to combine cross-business unit data to create a holistic 360-degree view of the consumer household.

The Consumer Insights team is focused on providing shopper personalization services across brands. Their software serves consumers, car dealerships, and automotive OEMs.

In October 2019, the Consumer Insights team decided to use an identity graph approach to be less reliant on third-party cookies while also addressing the growing need for building a 360-degree view of households that can be utilized across business units. The team decided to use Amazon Neptune to address their identity graph needs.

Neptune is a fully managed graph database service that makes it easy to build and run applications using highly connected datasets. Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with milliseconds latency. Neptune supports both the Property Graph and the Resource Description Framework (RDF) standard.

The Consumer Insights team uses several data sources in the identity graph for building personalization capabilities. These include:

  • Cox Automotive’s proprietary data, referred to as Pixall Data
  • Vehicle inventory data
  • Consumer browsing history
  • Vehicle transactions
  • Vehicle leads

As phase one, the Consumer Insights team built an identity graph that ties together consumer browsing history data with CRM data (leads and transactions). The following visualization outlines the team’s approach to addressing current challenges while crafting a vision for an identity graph driving all aspects of multi-channel marketing personalization.

Reasons for choosing Neptune

The Consumer Insights team ran experiments to store the connected datasets in their identity graph with managed relational databases, key value stores, in-memory databases, and graph databases. The experimentation focused on performance and TCO. Two key technical aspects stood out in favor of Neptune:

  1. Simplified data modelling – Traditional relational databases with their rigid schemas and relationships posed a challenge in building a data model for a graph problem. The Consumer Insights team took several passes at building the right data model. The data model was hard to scale as new vertices and edges were identified both in terms of query performance and difficulty of writing the desired query in SQL. Modeling the data as a graph mimicked the business use case. This was the first indication that Neptune was a natural fit for the use case.
  2. Query performance – Neptune offered out-of-the-box query performance that met the team’s needs and saved them time for doing any optimizations. Query performance scaled well as additional edges and vertices were added.

The identity graph

To demonstrate the data model simplification offered by Neptune, we provide the following visualization of the Consumer Insights team’s identity graph, which shows the relationship between three entities (vertices) – cookie, IP Address and CRM ID.

This diagram shows the actual Gremlin query used in the identity graph. The following query shown facilitates the identity resolution of a visitor in simple steps. This query traverses the graph for a given visitor (request.uid), finds the CRM ID vertices with edges to the visitor, and finds all visitor IDs with edges to that CRM ID.

At this point, two visitor IDs have been identified in the preceding. From those two visitors, all IP

addresses with edges to them are identified and from those IP addresses, all visitor IDs with connected

edges are found. In the preceding diagram, one visitor ID becomes six visitor IDs after walking the graph.

Business processes powering the identity graph

After choosing Neptune as the preferred database, the Consumer Insights team embarked on the next step of actually building the identity graph.

The identity graph is the heart of personalized marketing offering identity resolution. It’s built on top of two distinct steps to create a 360-degree view of the consumer and the household: householding and lead mapping.

Householding

Householding is the process of identifying and combining data of shoppers or prospects of a “household”. Householding can combine anonymous consumer activity (cookies) across different websites, devices, and internet connections.

The data science team at Cox Automotive runs several AI/ML based probabilistic processes (processes based on the theory of probability) to enrich and map the data together. The output of the probabilistic processes creates one-to-many relationships between an IP address (identity graph vertex) and visitor IDs (cookie).

The enriched householding data creates three files for bulk loading into Neptune:

  • IP address (vertex)
  • Visitor ID (vertex)
  • Relationships (edges)

The following table shows the file containing IP address vertices.

~id ~label createTime:String(single) Name:String(single)
0.0.0.0_ipAddress ipAddress 7/10/2020 2:00 PM 0.0.0.0
1.1.1.1_ipAddress ipAddress 7/11/2020 2:00 PM 1.1.1.1
2.2.2.2_ipAddress ipAddress 7/12/2020 2:00 PM 2.2.2.2

The following table shows the file containing visitor ID vertices.

~id ~label createTime:String(single) Name:String(single)
abc_visitor visitor 7/10/2020 2:00 PM abc
def_visitor visitor 7/11/2020 2:00 PM def
ghi_visitor visitor 7/12/2020 2:00 PM ghi

The following table show the file containing the edges of IP to visitor.

~id ~from ~to ~label
abc_visitor_0.0.0.0_ipAddress abc_visitor 0.0.0.0_ipAddress hasIPAddress
def_visitor_1.1.1.1_ipAddress def_visitor 1.1.1.1_ipAddress hasIPAddress
ghi_visitor_2.2.2.2_ipAddress ghi_visitor 2.2.2.2_ipAddress hasIPAddress

Lead mapping:
The lead mapping process collects data from the lead forms filled out at the Cox Automotive branded sites. The lead form is mapped to an internally generated crm_id (identity graph vertex) and visitor ID (car shopper cookie).

The lead mapping process generates three files for bulk loading into Neptune:

  • CRM ID (vertex)
  • Visitor ID (vertex)
  • Relationships (edges)

The following table shows the file containing the CRM ID vertices.

~id ~label createTime:String(single) Name:String(single)
123_crmid crmid 7/10/2020 2:00 PM 123
456_crmid crmid 7/11/2020 2:00 PM 456

The following table shows the file containing the lead visitor ID vertices.

~id ~label createTime:String(single) Name:String(single)
abc_visitor visitor 7/10/2020 2:00 PM abc
def_visitor visitor 7/11/2020 2:00 PM def
ghi_visitor visitor 7/12/2020 2:00 PM ghi

The following table shows the file containing the edges from CRM ID to visitor.

~id ~from ~to ~label
abc_visitor_123_crmid abc_visitor 123_crmid hasCRMid
ghi_visitor_456_crmid ghi_visitor 456_crmid hasCRMid

Incrementally refreshing the identity graph 

As of this writing, the identity graph consists of approximately 0.5 billion edges and 0.4 billion vertices running on a Neptune cluster of db.r5.2xlarge instance types.

The householding process creates a full data set every day and stores it in an Amazon Simple Storage Service (Amazon S3) bucket. To optimize Neptune performance, the Consumer Insights team starts an Amazon EMR job that creates a daily difference file between today’s full data and the prior day’s data set. Any net new additions or modifications are scheduled for bulk upload into Neptune using the Neptune bulk load API. When the bulk load is complete, a separate EMR job kicks off to process net deletions using the Gremlin API. Finally, an EMR job deletes any orphan vertices using the Gremlin API.

The lead mapping data process creates incremental datasets and stores them in an S3 bucket. An EMR job starts bulk loading the lead mapping data set to Neptune on a nightly basis.

Overall solution architecture

The following block diagram outlines how the Consumer Insights team plans to offer complete personalization capabilities to business units. They combine household and lead matching from Neptune along with browsing history from DynamoDB, with shopper segmentation and vehicle recommendations. Downstream processes utilized across business units like Autotrader®, Dealer.com®, Kelley Blue Book®, and VinSolutions® facilitate content personalization, ad targeting, remarketing, and in-store personalization based on the combined household view. The yellow highlighted blocks are the foundational elements for the identity graph. In this section, we also dive deeper to outline the architecture used for the identity graph.

Identity resolution powered by Neptune provides real-time query-response to downstream processes. As a result, it is the mission-critical component of delivering downstream personalization capabilities. Therefore, fault tolerance, high availability, and the ability to scale operations are critical for the Consumer Insights team. To facilitate these technical goals, the Neptune writer node is provisioned on a db.r5.2xlarge instance with two read replicas spread across multiple Availability Zones. Both read replicas use the same instance type as the writer node. All REST API data lookups are serviced by both read replicas. To enable horizontal scale-out, client applications use Neptune’s built-in load balanced reader endpoint rather than directly accessing Neptune reader nodes.

The following diagram illustrates the solution architecture of the identity graph and shows the steps for incrementally refreshing the graph (using Amazon S3, Amazon EMR and bulk loader), Multi-AZ Neptune read replicas and the built-in load balancer.

Results

On average, the Consumer Insights team’s roll-out of the Neptune based identity resolution has yielded twice as much browsing history per household compared to using individual cookies. The team believes this will have a direct impact on creating highly personalized content for car shoppers resulting in higher engagement, better click through rates, and higher email open rates.

At time of writing, the immediate goals of reducing dependence on third-party cookies and building a 360-degree view of the consumer household have been addressed. The downstream systems like advertisement segmentation and vehicle recommendation are being integrated with the new identity graph. The business impact of these downstream processes will be quantified after these integrations are complete.

Lessons learned and recommendations

Reflecting on the 9-month journey of building and integrating the identity graph, the Consumer Insights team has compiled the following lessons and recommendations to help other enterprises, software architects or software engineers embarking on a graph use case:

  1. Investing in training – The team didn’t have any in-house graph database or Neptune skillset. At the onset, the team collaborated with AWS to plan a 1-day training session for their 12 SDEs. The training focused on graph data modelling, querying with Gremlin and Neptune best practices. The training was a very valuable step in the team’s graph journey.
  2. Understanding best practices – Understanding the business use case and learning from existing graph implementations is one of the best ways to avoid any unforeseen errors. The team felt that Best Practices: Getting the most out of Neptune was one of the best resources that helped them with an optimal technical design.
  3. Right sizing Neptune instances for PoC – Choosing the right instance type while doing a proof of concept (PoC) can greatly impact the total cost of the PoC, and overall engineering team time invested in building a demo application. As a rule of thumb, if the PoC involves bulk loading existing data to Neptune, getting the largest instance reduces overall time to bulk load the data, thereby reducing overall engineering time and compute cost. When the bulk load is complete, switching the instance size to a smaller instance type could save valuable resources.

 


About the Authors

Carlos Rendon is a Principal Technical Architect at Cox Automotive. He has been working for the last 8 years on providing real-time consumer personalization, vehicle recommendations, and ad-targeting for KBB.com, Autotrader.com, Dealer.com, and VinSolutions.

 

 

 

Niraj Jetly is a Software Development Manager for Neptune at Amazon Web Services. Prior to AWS, Niraj has led several product and engineering teams as CTO, VP-Engineering, and Head of Product Management for over 15 years. Niraj is a recipient of over 15 innovation awards including being named as CIO of the year in 2014 and top 100 CIO in 2013 and 2016. A frequent speaker at several conferences, he has been quoted in NPR, WSJ, and The Boston Globe.