Building generative AI applications for your startup, part 2

This blog series in two parts discusses how to build artificial intelligence (AI) systems that can generate new content. The first part gives an introduction, explains various approaches to build generative AI applications, and reviews their key components. The second part maps these components with the right AWS services, which can help startups quickly develop and launch generative AI products or solutions by avoiding time and money spent on undifferentiated heavy lifting work.

Let’s advance this blog series, and dive deep into Amazon Web Services (AWS) capabilities your startup can leverage to get your product to market as quickly as possible, while keeping cost efficiency and performance as key goals.

Which AWS services should I use to build my generative AI application?

This can be well explained through my illustration of generative AI components from part 1 of this blog series. The following diagram, Figure 1, maps each component to corresponding AWS service(s). Note that these are a curated set of AWS services I see startups taking benefits from; however, there are other AWS services available.

Figure 1: Mapping AWS services to generative AI components.

To elaborate, I will start with mapping AWS services to the common components of a generative AI application. Then I will explain the AWS services that map to the remaining components in Figure 1, based on the approaches you use to implement your application.

Common components

The common components of a generative AI application are the foundation model (FM), its interface, and optionally the machine learning (ML) platform and accelerated computing. These can be met using managed offerings available from AWS:

Amazon Bedrock (foundation model and its interface components)

Amazon Bedrock, a fully managed service that makes foundation models from leading AI startups (AI21’s Jurassic, Anthropic’s Claude, Cohere’s Command and Embedding, Stability’s SDXL models) and Amazon (Titan Text and Embeddings models) available via API, so you can choose from a wide range of FMs to find the model that’s best suited for your use case. Amazon Bedrock provides API or serverless access to a set of foundation models to provide three capabilities: text embedding, prompt/response, and fine-tuning (on select models).

Figure 2: Amazon Bedrock workflow

Amazon Bedrock is well-suited for application or model consumer startups who are building value-added services – prompt engineering, retrieval-augmented generation, and more – around a foundation model of their choice. Its pricing model is pay-by-use, typically in the unit of millions of tokens processed. Amazon Bedrock is generally available; however some of the features discussed in this blog are in private preview. Learn more here.

Amazon SageMaker JumpStart (foundation model and its interface components)

AWS offers generative AI capabilities to Amazon SageMaker Jumpstart: a foundation model hub containing both publicly available and proprietary models, quick start solutions, and example notebooks to deploy and fine-tune models. When you deploy these models, it creates a real-time inference endpoint which you can access as directly using SageMaker SDK/API. Or, you can front-end SageMaker’s foundation model endpoint with AWS API Gateway and a lightweight compute logic in an AWS Lambda function. You can also leverage some of these models for text embedding.

Figure 3: Amazon SageMaker JumpStart workflow

Both the inference endpoint and the fine-tuning training jobs run on your choice of managed ML instances (see “Accelerated Computing” in Figure 1) using SageMaker as the ML platform (see “ML Platform” in Figure 1). SageMaker Jumpstart is well-suited for application or model consumer startups who want more control over their infrastructure, and who have moderate ML skills and infrastructure knowledge. Its pricing model is pay-by-use, typically in the unit of instance-hours. All the models and solutions in this offering are generally available.

Amazon SageMaker training and inference (ML platform)

Startups can leverage Amazon SageMaker’s training and inference features for advanced capabilities like distributed training, distributed inference, multi-model endpoints, and more. You can bring the foundation models from the model hub of your choice – whether that’s SageMaker JumpStart or Hugging Face or AWS Marketplace, or you can build your own foundation model from scratch.

Figure 4: Amazon SageMaker training and inference workflow

SageMaker is well-suited for full-stack generative AI application builders (from model providers to model consumers), or for model providers with teams who have advanced ML and data pre-processing skills. SageMaker also offers a pay-by-use pricing model, typically in the unit of instance-hours.

AWS Trainium and AWS Inferentia (accelerated computing)

In April 2023, AWS announced general availability of Amazon EC2 Trn1n Instances powered by AWS Trainium, and Amazon EC2 Inf2 Instances powered by AWS Inferentia2. You can leverage AWS purpose-built accelerators (AWS Trainium and AWS Inferentia) using SageMaker as the ML platform.

The benchmark testing for inference workloads reports Inf2 instances perform with 52% lower costs against a comparable inference-optimized Amazon EC2 instance. I suggest keeping an eye on fast development cycles of AWS Neuron SDK, where approximately every month AWS is adding new model architecture in their support matrix for both training and inference.

Approaches for building generative AI applications

Now, let’s discuss each of the components in Figure 1 from an implementation perspective.

The zero-shot or few-shot learning inference approach

As we discuss in part 1, zero-shot or few-shot learning is the simplest approach for building a generative AI application. To build applications based on this approach, all you need are the services for the four common components (foundation model, its interface, ML platform, and compute), your custom code to generate prompts, and a front-end web/mobile app.

Figure 5: Components of the zero-shot learning approach

To learn more about selecting a foundation model through Amazon Bedrock or Amazon SageMaker JumpStart, refer to the model selection guidelines here.

The custom code can leverage developer tools like LangChain for prompt templates and generation. The LangChain community has already added support for Amazon Bedrock, Amazon API Gateway, and SageMaker endpoints. Just to remind you, you may also like to leverage AWS Amazon CodeWhisperer, a coding companion tool, to help improve developers’ efficiency.

Startups building a front-end web app or mobile app can easily start and scale by using AWS Amplify, and host these web apps in a fast, secure, and reliable way using AWS Amplify Hosting.

Check out this example of zero-shot learning that builds with SageMaker Jumpstart.

The information retrieval approach

As discussed in part 1, one of the ways your startup can customize foundation models is through augmenting with an information retrieval system, most commonly known as retrieval-augmented generation (RAG). This approach involves all of the components mentioned in zero-shot and few-shot learning, as well as the text embeddings endpoint and vector database.

Figure 6: Components of the information retrieval approach

Options for the text embeddings endpoint vary depending on which AWS managed service you’ve selected:

Amazon Bedrock offers an embeddings large language model (LLM) that translates text inputs (words, phrases, or possibly large units of text) into numerical representations (known as embeddings) that contain the semantic meaning of the text.
If using SageMaker JumpStart, you can host an embeddings model like GPT-J 6B or any other LLM of your choice from the model hub. The SageMaker endpoint can be invoked by the SageMaker SDK or Boto3 to translate text inputs into embeddings.

The embeddings can then be stored in a vector datastore to do semantic searches using either Amazon RDS for PostgreSQL’s pgvector extension, or Amazon OpenSearch Service’s k-NN plugin. Startups prefer one or the other based on which service they are typically most comfortable using. In some cases, startups use AI native vector databases from AWS partners or from open source. For guidance on vector datastore selection, I recommend referring to The role of vector datastores in generative AI applications.

In this approach too, developer tools play a pivotal role. They provide an easy plug-n-play framework, prompt templates, and wide-support for integrations.

Going forward, you can also leverage agents for Amazon Bedrock, a new capability for developers that can manage API calls to your company systems.

Check out this example of using retrieval augmented generation with foundation models in Amazon SageMaker Jumpstart.

The fine-tuning or further pre-training approach

Now, let’s map the components to the AWS services needed for the last approach to implementing a generative AI application: fine-tuning or further pre-training a foundation model. This approach involves all of the components discussed in zero-shot or few-shot learning, as well as data pre-processing and model training.

Figure 7: Components of the fine-tuning or further pre-training approach

Data preparation (sometimes called preprocessing or annotation) is particularly important during fine-tuning, where you need smaller and labeled data sets. Startups can easily get started using Amazon SageMaker Data Wrangler. This service helps reduce the time it takes to aggregate and prepare tabular and image data for machine learning from weeks to minutes. You may also leverage this service’s inference pipeline feature to chain the preprocessing workflow to training or fine-tuning jobs.

If your startup needs to preprocess a huge corpus of unstructured and unlabeled datasets in your data lake on Amazon S3, you have a few options:

If you’re using Python and popular Python libraries, is useful to leverage AWS Glue for Ray. AWS Glue uses Ray, an open source unified compute framework used to scale Python workloads
Alternatively, Amazon EMR can help process vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto.

For the model training component of this approach, Amazon Bedrock allows you to privately customize FMs with your own data. It manages your FMs at scale without having to manage any infrastructure (this is the API way to fine-tune). Alternatively, the SageMaker Jumpstart approach provides a quick-start solution to privately fine-tune (on select models) for instruction or domain adaptation using your own data. You can modify the SageMaker JumpStart bundled training script for your needs, or you can bring your own training scripts for open-source models, and submit these as SageMaker’s training job. If you have to further pre-train the model (typically for open source models), you can leverage SageMaker’s distributed training libraries to speed up and efficiently utilize all of the GPUs of an ML instance.

In addition, you may also consider fully managed data generation, data annotation services, and model development with the Reinforced Learning from Human Feedback technique using Amazon SageMaker Ground Truth Plus.

An example architecture

So, how do all of these components look when realizing a generative AI use case? While every startup has a different use case, and unique approaches to solving real world problems, one common theme or starting point I have seen in building generative AI applications is the retrieval-augmented generation approach. After plugging in all those AWS services discussed above, the architecture looks like this:

Ingestion pipeline – The domain-specific or proprietary data is preprocessed as text data. It is either batch processed (stored in Amazon S3) or streamed (using Amazon Kinesis) as it is created or updated through the embedding process, and stored in dense vector representation.

Figure 8: An example ingestion pipeline for a generative AI application.

Retrieval pipeline – When a user queries the proprietary data stored in vector representation, it retrieves the related documents using k nearest neighbor (kNN) or semantic search. It is then decoded back to clear text. The output serves as rich and dense context to the prompt.

Figure 9: An example retrieval pipeline for a generative AI application.

Summarization generation pipeline – The context is added to the prompt with the original user query to get insight or summarization from the retrieved document.

Figure 10: An example summarization generation pipeline for a generative AI application.

All of these layers can be built with a few lines of code by using developer tools like LangChain.

Conclusion

This is one way to build an end-to-end generative AI application using AWS services. The AWS services you select will vary based on the use case or customization approach you take. Stay tuned on latest AWS releases, solutions, and blogs in generative AI by bookmarking this link.

Let’s go build generative AI applications on AWS! Kickstart your generative AI journey with AWS Activate, a free program specifically designed for startups and early stage entrepreneurs that offers the resources needed to get started on AWS.

AWS Startups Blog