Getting your generative AI pilot ready for production

In this article you’ll learn about the requirements of production ready RAG for your gen AI LLM implementation

Your LLM pilot succeeded – now how do you scale?

Large language models (LLMs) and generative AI have taken the world by storm, and this is not surprising considering the awe-inspiring feats models are capable of producing. This wave of interest drove a dramatic development of pilots and prototypes in organizations of all types, enhanced by a rapid evolution of the tooling, frameworks, and services that enabled those concepts.

And guess what, users of those prototypes generally loved them. That means development, operations, platform, and all other teams involved in actually building and running these applications now need to get them to a state of production readiness.

In this article we’ll talk about what that means, and how tools like Amazon Bedrock and Pinecone can help.

Introduction to generative AI scaling on AWS

What is “production ready” for your generative AI application?

Getting your application ready to handle the demands of a production environment requires you to look at many different areas that may not have been top of mind as you are building your pilot:

  • Scalability:
    Can your machine learning (ML) infrastructure handle the demands of users at scale?
  • Cost:
    Are the technologies used during pilot development cost-effective at production scale?
  • Privacy and Security:
    How can you ensure the output generated by your pilot does not expose user data?
    How can you protect the new surface of attack that your ML solution exposes?
  • Tenancy:
    Are you ready to offer your new set of capabilities to different groups within your organization or different sets of users?
  • Day 2 Operations:
    How will you continuously deploy new versions of your ML-driven service? What about observability and reliability?

These are just a handful of the most critical areas that must be considered when looking to promote pilots and prototypes to production readiness. Let’s dig a little deeper into scalability and cost:

Scalability

Of course, when something hits production that means the scale it will be exposed to will considerably, maybe dramatically, increase. Which means all systems involved in its operation, from the compute that’s running your model to any external systems that are storing data both raw as well as for context, must scale alongside it.

And scalability, particularly of those components of the infrastructure that directly interface with user interactions, must be aligned with the expectation of responsiveness that users have. Just because the nature of these applications is very different from anything that we’ve built so far, user expectations have not changed, and “immediate” responses are what users are really looking for.

Cost

But, as your scale explodes, you don’t want your cloud invoices to explode as well. Many times design choices around cost are secondary during prototype and pilot stages, leaning more to velocity than to fiscal responsibility, but once you hit production scale, a lot of those choices may need to be revisited to ensure the application will have a positive impact to your bottom line and not break budgets.

Another critical aspect to when digging into the finances of a production ready system revolves around metadata. Requests to an ML system usually require the interface of a more complex orchestration of systems than traditional APIs, which means tracking the various components that participate in providing service to users must rely on solid data.

Understanding outputs of a probabilistic system

Generative AI, LLMs, and machine learning in general are not deterministic, which means given the same set of inputs, the output will not always be the same. This quality poses unique challenges when you are looking to provide reliable value to the users that are now relying on these new features as a core part of your product.

Evaluating, testing, and tracing the output generated by your models is critical to broad user-facing success.

Different users, different data, different contexts

Data sets used to provide context to LLMs, using retrieval augmented generation (RAG) for example, will also grow and may even require segmentation of that data, and the introduction of concepts of multi-tenancy to your generative AI application.

Choosing the right services

Getting the right services in your solution architecture for production-ready generative AI applications lie at the heart of a successful production deployment, one that addresses the various concerns outlined above.

We’re going to look at how Pinecone and Amazon Bedrock working together can help you easily achieve those objectives and pave the way for a growing number of machine learning capabilities in your organization.

Production-ready generative AI stack

Your generative AI application will always require a set of services targeted at the different requirements that the traditional LLM architecture requires:

  • You’ll need compute to run your front end and interfaces between users and the underlying models.
  • You'll need optimized and specialized compute to run the models themselves.
  • You’ll need repositories to store data, both in raw format as well as transformed to representations that can be used for fast, at-scale similarity searches when you are using RAG.
  • And depending on how you approach things, you may end up with data pipelines that handle the transformation of raw data into its vector representations.
Generative AI Stack
Source: community.aws

A baseline architecture

Let’s look at a common architecture when building machine learning solutions:

Baseline Architecture: Pinecone

First you need some user facing application, which will directly interact with users - if you’re using a chat interface, likely it will hold the WebSockets for the various clients and orchestrate communications with one or more backend APIs, including that of your LLM implementation. Most likely you will run that service containerized in some managed container orchestrator like Amazon Elastic Kubernetes Service (Amazon EKS).

Then, you’ll need compute capacity, optimized to run your machine learning model, and of course you’ll need some model in the first place. For most practical purposes and considering the rapid evolution and maturity of open source and foundational models, you likely will use one of the many readily available pre-trained models.

And lastly, considering the very likely scenario of utilizing RAG to provide custom contextual data for your LLM to incorporate in its responses, you’ll need some form of vector storage that can be used to query for similar data.

Production-ready compute and inference

Amazon Bedrock handles all things related to provisioning compute and offers a choice of foundational models you can simply consume while enabling tools for fine tuning, testing, and observing the behavior of your model.

Broad Choice of Models
Source: AWS

Amazon Bedrock also offers a wide range of agents and other powerful capabilities such as Knowledge Bases, which allow effortless and secure, production ready integration with data sources and third-party service providers, for example, to store vector representations of that data, without having to manage any pipelines or supporting complicated processes.

Using Agents and Knowledge Bases you can connect to Pinecone, which offers query performance very much in line with the expectations of interactive users, while allowing you to structure the data your application will require around the tenants or groups of users you want to service, while complying with GDPR, HIPAA and other certifications that, once in production, you’ll very likely need to satisfy.

Production ready vector storage

Pinecone is designed from the ground up for production readiness with under 50ms latency for datasets that span hundred million embeddings (51ms p95 query latency, for those interested in the technicalities).

Security and reliability, the other two elements we discussed above related to ensuring your LLM implementation is ready for production exposure, are also built into Pinecone with SOC2 and HIPPA certification, as well as SLAs and plenty of observability tooling.

Pinecone generates time-series data towards understanding the performance of your vector indices over time, data that you can pull into Prometheus- and OpenMetrics-compatible tools.

Integration using leading-edge capabilities

Amazon Bedrock not only automates many of the capabilities required for production operations of a ML workload, but also provides integrations that make it completely effortless to integrate storage with compute.

Knowledge Bases for Amazon Bedrock provides “one click” integration with Pinecone, fully automating the ingestion, embedding, and querying of your data as part of your LLM generation process.

When integrating, you simply have to create a data source pointing to the Amazon Simple Storage Service (Amazon S3) bucket where you hold your raw context data:

Data source setup

Then connect Pinecone to the Knowledge Base:

Vector Database

Note that you will need to manually create the Pinecone index before setting up the connection, since you will need to specify which index to use when defining your connection endpoint!

Finally, create an agent that will use the knowledge base as part of your retrieval and generation process:

Add Knowledge base

Your Amazon Bedrock implementation now has access to your stored vector data for context without a single pipeline and in a handful of clicks.

What do we get with this combination?

Amazon Bedrock will likely be able to run your application with little to no changes, particularly if built around a foundational model that is readily available, and provide you with the capabilities to handle production ready load without having to worry about operational headaches.

When integrated with Pinecone, something you can do with a handful of clicks thanks to knowledge bases or agents, you get fast user-interactive response times, fully automated data processing and updates and clear visibility into your LLM context. Pinecone is available to try in AWS Marketplace—pay as you go using your AWS Account.

Be on the lookout for our upcoming lab where we show you, step by step, how to actually build a production ready solution like the one we glanced at above.

Why AWS Marketplace?

Try SaaS products free with your AWS account to establish your proof-of-concept then pay-as-you-go in production with AWS Billing.

AWS Marketplace and Salesforce: Data Cloud

Quickly go from POC to production - access free trials using your AWS account, then pay as you go.

AWS Marketplace and Salesforce: Service Cloud

Add capabilities to your tech stack using fast procurement and deployment, with flexible pricing and standardized licensing.

AWS Marketplace and Salesforce: Sales Cloud

Consolidate and optimize costs for your cloud infrastructure and third-party software, all centrally managed with AWS.