AWS for Industries

How Audi improved their chat experience with Generative AI on Amazon SageMaker

Audi AG is a German automotive manufacturer and part of the Volkswagen Group. It has production facilities in several countries including Germany, Hungary, Belgium, Mexico and India. With a strong focus on quality, design and engineering excellence, Audi has established itself as a leading brand in the global luxury car industry that designs, engineers, produces, markets and distributes luxury vehicles.

Reply specializes in the design and implementation of solutions based on new communication channels and digital media. As a network of highly specialized companies, Reply defines and develops business models enabled by the new models of AI, big data, cloud computing, digital media and the internet of things. Reply delivers consulting, system integration and digital services to organizations across the telecom and media; industry and services; banking and insurance; and public sectors.

Audi, and Reply worked with Amazon Web Services (AWS) on a project to help improve their enterprise search experience through a Generative AI chatbot. The solution is based on a technique named Retrieval Augmented Generation (RAG), which uses AWS services such as Amazon SageMaker and Amazon OpenSearch Service. Ancillary capabilities are offered by other AWS services, such as Amazon Simple Storage Service (Amazon S3), AWS Lambda, Amazon CloudFront, Amazon API Gateway, and Amazon Cognito.

In this post, we discuss how Audi improved their chat experience, by using a Generative AI solution on Amazon SageMaker, and dive deeper into the background of the essential components of their chatbot, by showcasing how to deploy and consume two state-of-the-art Large Language Models (LLMs), Falcon 7B-Instruct, designed for Natural Language Processing (NLP) tasks in specific domains where the model follows user instructions and produces the desired output, and Llama-2 13B-Chat, designed for conversational contexts where the model responds to user’s messages in a natural and engaged way.

How this situation came about with Audi and Reply

For over 3 years, Reply has been helping Audi transition to the cloud. As Audi’s internal knowledge base grew rapidly, access to internal documentation became difficult to navigate at times. For example, the difficulties in keeping the pages up to date, the scattering of topics over multiple documents in different locations, the presence of redundant or outdated information. These aspects posed a significant challenge to educating and training activities. Also, the Audi internal ticketing system receives diverse queries from developers, and it often takes them hours to navigate the documentation and fully grasp a topic. This situation has resulted in productivity losses and less-than-optimal service response times.

The following paragraphs provide an overview of the solution, discuss its features, and report the results of the pilot project.

Solution overview

The high-level architecture of the Generative AI chatbot is illustrated in Figure 1 below:

Figure 1 High-Level Architecture of Generative AI ChatbotFigure 1: High-Level Architecture of Generative AI Chatbot

The solution workflow can be described in two steps: data ingestion and chatbot inference. These two steps are part of the RAG technique used to power the chatbot solution.

Data Ingestion

In this process, the data to be ingested consists of documents from the Confluence space. An external data ingestion component accesses Confluence using an API key, and converts the documents into a readable text format. The text is then split into smaller chunks of text and tokenized using a Recursive Character Splitter and tokenizer from the selected LLM.

The dimension of the chunks can vary according to the accepted number of tokens of the LLM, named context window. For the Falcon 7B-Instruct, which has a context window of 2048 tokens, we used a chunk size of 200 with an overlap of 20. In contrast, we used a chunk size of 1000, with an overlap of 200, for the Llama 2 13B-Chat model, as it has a longer context window of 4096 tokens.

These chunks are then fed to the embeddings model and converted into vectors of embeddings, then stored in a vector database where it can be subsequently queried. The querying of the data uses semantic search techniques, described in the next section.

Chatbot Inference

The chatbot, shown in Figure 2 below, consists of the following steps:

  1. Authentication: Audi users first log into the User Interface (UI) by authenticating themselves. After successful completion, users are directed to the chatbot UI.
  2. Querying: The user can then start querying the chatbot. The queries are transmitted through the API gateway to the Lambda function, where they are then converted into embeddings by the hosted embeddings model.
  3. LLM Query extraction: The hosted LLM extracts the most relevant parts of the query.
  4. Vector Retrieval: The LLM output is passed to the vector database using the LangChain Lambda. Based on this query, the database retrieves the most relevant k vectors by using semantic search techniques, to be used as context.
  5. Response Generation: The context, along with the query, is combined with a prompt that instructs the LLM to only answer if the similarity scores are above a predefined threshold. The LLM then evaluates the generated prompt and generates an output accordingly.

Figure 2 the Audi Chatbot InterfaceFigure 2: the Audi Chatbot Interface

Semantic Search

Semantic search is a technique that tries to find the most relevant results based on the meaning and context of the query. The main difference between semantic search and traditional keyword search is that the latter tries to match the exact words or phrases in the queries to the retrieved results. In contrast, with the help of a Deep Neural Network (DNN) engine, semantic search features retrievals with the search context and answers the questions in a human-like manner.

To enable semantic search, the documentation to be queried is split into chunks, tokenized, and converted into embeddings. The embeddings are then stored as vectors in a vector database, capable of storing data as high-dimensional vectors. The dimensions of the vectors are dependent on the complexity and granularity of the data. Vector databases enable quick and highly accurate similarity search and retrieval of relevant data based on the input query. Similarity search is enabled by calculating the distance between two vectors, be it cosine search, Euclidean distance, or hamming distance.

Furthermore, the vector database utilizes the approximate K-Nearest Neighbors (KNN) algorithm to cluster the vectors closest in distance. Approximate KNN can be computed using different algorithms, which can be implemented using different engines such as Non-Metric Space Library (NMSLIB), Apache Lucene, and Hierarchical Navigable Small Worlds (HNSW).

To summarize how semantic search works, the query is first converted into an embeddings vector using the same function that was used to process the documentation. Then, the approximate KNN algorithm retrieves the k most relevant elements from the vector database, which represents how many “neighbors” we are looking at, by calculating the distance between the query and the elements in the dataset using the embeddings vectors.

Solution details

The key features of the chatbot are shown in Figure 3 and described below:

Figure 3 Key Features of the ChatbotFigure 3: Key Features of the Chatbot

  • Security: The chatbot solution is designed to consist of two tiers of security. Isolating the solution with a VPC helps ensure safety and privacy from external actors. The in-house approach is also designed to enable employees and customers to extract insights from their proprietary data, thus helping to reduce the possibility of information leakage to third-party applications.
  • Improved Hallucination Resistance: Modern Generative AI chatbots are often prone to hallucinations if they are asked a question from outside their training corpus. Our implementation relies on RAG and efficient prompt engineering to help reduce the degree of hallucination as much as possible. Moreover, the chatbot kindly advises the user that no answer can be provided if an adequate confidence level is not met.
  • Low Latency: Employees often spend a significant amount of time searching for information in their humongous knowledge base, leading to productivity loss. Our chatbot not only reduces the search time from hours to few seconds but is also able to reason and generate unique insights from your data.
  • Multiple Sources Integration: From Confluence to PDF, from Notion to customized websites, our AI chatbot solution is capable of ingesting data from a variety of sources. The integration flexibility opens up a plethora of opportunities to leverage information from multiple unstructured data sources and generate actionable insights.

SageMaker Endpoints

The LLMs are hosted on Amazon SageMaker endpoints, in particular, Falcon 7B-Instruct runs on a ml.g5.4xlarge instance, while Llama2 13B-Chat is powered by a ml.g5.12xlarge instance. The deployment is carried out manually by means of an AWS SDK script. This approach assists with flexibility for fine-tuning and a faster development cycle. Comparison details are described in the next paragraphs.

Furthermore, we deployed the HuggingFace Instructor-XL embeddings model, selected by analyzing the MTEB Public Leaderboard, to convert the documentation as well as the query into embeddings vectors. In this case the runtime is of type ml.r5.xlarge, a memory-optimized EC2 instance to ensure that the results are readily available with low latency and in a cost-optimized manner.

Amazon OpenSearch

Amazon OpenSearch is the vector database of choice, a managed service with advanced support for vectors, capable of unburdening the user from the hassles of daily operations.
The choice of Amazon OpenSearch is due to greater flexibility in optimizing chunk size, selection of search algorithm and embeddings model, as well as lower operational costs for the use case.

Lambda Function

The Lambda function acts as an orchestrator, connecting the different components and organizing the flow of information. It uses a Lambda Layer that implements LangChain, a framework which simplifies and speeds up the creation of Generative AI applications.

CloudFront, API Gateway, and Cognito

The chatbot’s frontend is hosted on a serverless architecture using Amazon S3 and CloudFront. User connections are encrypted with TLS and authenticated via Cognito to restrict access.

The API Gateway provides a REST API that connects the frontend to the Lambda function. It facilitates communication between these services while governing usage through throttling and access controls. This infrastructure ensures secure and reliable connectivity that can scale on demand.

Virtual Private Cloud (VPC)

The Lambda function, LLM endpoint, embeddings endpoint, and vector database are all hosted inside an Amazon Virtual Private Cloud (Amazon VPC). Architecting these inside an Amazon VPC ensures high security of the solution, as each component with network exposure is private with restricted access to the internet. The application’s security and privacy aspects are strengthened since the capability of external bad actors to steal data from the vector database or tamper with the endpoints is greatly reduced.

Results
Comparison of the LLMs

A comparison was made between the two LLMs, Falcon 7B-Instruct and Llama-2 13B-Chat with the results documented in the following table. Since Falcon 7B-Instruct has a lower inference time and is cheaper to host than Llama 2 13B-Chat, the Falcon 7B-Instruct model will be used for further experiments, despite having fewer capabilities than the Llama 2 13B-Chat model.

llama 2 13b chat model

Results of experimentation on Audi Documentation

We used the Falcon 7B-Instruct model for conducting the experiments on the Audi documentation, and the results are detailed in the following table.

Results of experimentation

While the model performs well on most queries, it sometimes struggles to answer longer queries, for which we can instead use the Llama 2 family of models. Furthermore, the models are unable to answer specific questions involving names, which can be solved using a hybrid search approach, to combine semantic search techniques with keyword search techniques. An average latency of 6 seconds may be considered high for real-time applications, so the solution could be further optimized to get faster responses.

Conclusion

This blog described how the Audi enterprise search experience was improved, thanks to the innovative Generative AI chatbot solution on AWS. The Audi chatbot is designed to help reduce search time from hours to few seconds by providing high fidelity and accuracy in the generated responses.

With the rapid developments in the Generative AI landscape, Reply, Audi, and AWS will continue to empower this solution by adding additional features for security, performance, and cost optimization, and scaling it further by introducing new use cases across multiple verticals within Audi.

To learn more about running your AI/ML and Generative AI experimentation and development workloads on AWS, visit Amazon SageMaker.

Fabrizio Siciliano

Fabrizio Siciliano

Fabrizio Siciliano is a Full Stack Solutions Architect for AWS based in Munich. He works with customers in the automotive industry helping them to deeply understand their technical needs; overcome their technical challenges and implement cloud-based applications for different kinds of end users. His expertise includes full-stack development, cloud-based applications and cloud architecting. He loves food traveling, exploring new horizons and destinations and live life at its fullest.

Bruno Pistone

Bruno Pistone

Bruno Pistone is an AI/ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His expertise include: Machine Learning end to end, Machine Learning Industrialization, and Generative AI. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations.

Domenico Capano

Domenico Capano

Domenico Capano is a DevOps Engineer and Scrum Master for Reply. He is part of the Audi Cloud Foundation Services Platform Team that provides a basic framework of how to use AWS in a secure and compliant way inside Audi that consists of more than 250 individual customer AWS accounts with more than 3000 federated users and 60 successful hosted projects. Domenico is also part of the Community of Practice for AI-Powered Software Development at Reply. His expertise includes Requirements Engineering, Customer Management, Solution Architecture, and Generative AI. He has a high customer focus and enjoys spending time traveling and learning about new technologies.

Farooq Khan

Farooq Khan

Farooq Khan is Customer Solutions Management Leader for Global Automotive OEMs at Amazon Web Services. He and his team operate as voice of the customer within AWS by supporting most strategic Automotive customers on their Cloud journey and in their digital transformation. He has an industry-background in Connected Vehicle, embedded/Connected Navigation and embedded software development. Prior to AWS, Farooq held various roles at Harman, Volkswagen Infotainment and BlackBerry across software and product development.

Francesco Ongaro

Francesco Ongaro

Francesco Ongaro is a Senior Manager of Storm Reply and is based in Munich, Germany. He has more than a decade of experience on AWS supporting Italian and German enterprises during their Cloud journey. His technical background and vision for emerging technologies help him to lead and expand his Business Unit. He likes snowboarding, travelling with his wife, and spending times with good friends.

Matteo Lanati

Matteo Lanati

Matteo Lanati is a Senior Consultant at Storm Reply Germany in Munich. He has an academic background in telecommunications and over ten years of experience as a system administrator / DevOps. He contributed to multiple projects on topics such as Infrastructure as Code, automation, migration to Kubernetes and architecture design based on AWS services. In his free time, he likes reading and climbing.

Michael Pawelke

Michael Pawelke

Michael Pawelke is the Product Owner for the Audi AWS Cloud Foundation team in Ingolstadt, Germany. With expertise in agile methodologies, he holds certifications as a Product Owner and Scrum Master, facilitating seamless project management. His technical vision extends to AWS cloud solutions, which is reinforced by his AWS Solutions Architect Associate certification. Michael’s impact is felt across the company as he streamlines AWS account provisioning for various projects. Outside of work, he’s a sports enthusiast who enjoys cycling and skiing, complementing his professional dedication with an active lifestyle.

Timo Schmidt

Timo Schmidt

Timo Schmidt is a Manager and Principal Solutions Architect at Storm Reply, specializing in driving the widespread adoption of Amazon Web Services (AWS). With a broad general knowledge of many different AWS services, he excels at advising customers on the most appropriate solution based on best practices and needs. Working with project owners and central cloud teams, Timo helps turn their cloud visions into reality by providing architectural guidance and expertise throughout the implementation of strategic cloud solutions.

Toaha Umar

Toaha Umar

Toaha Umar is an ML Consultant and AI Task Force Lead at Storm Reply in Munich, Germany. He graduated from TU Munich with a master’s degree in communications engineering, with a focus on machine learning applications in multimedia communications and automotive industry. He helps customers in their AI cloud journey by architecting and engineering end-to-end AI solutions, and project management, specializing in Generative AI and MLOps. Toaha has more than six years of experience in leadership positions in non-profit organizations, including IEEE and TUM.ai, Europe’s leading AI Student Initiative. In his free time, he enjoys reading, cooking and travelling.