AWS Public Sector Blog
Unlocking the power of generative AI: The advantages of a flexible architecture for foundation model fine-tuning
A large language model (LLM) is a general-purpose language model that can understand and respond to human input in a conversational manner. It’s trained on a massive dataset of text from the internet and can generate humanlike responses to a wide range of topics and questions. There are two primary approaches to adapting a model to a specific task or domain: the in-context approach and fine-tuning the foundation model (FM). In-context learning is the ability of an artificial intelligence (AI) model to generate responses or make predictions based on the specific context provided to it. An FM can adapt the model to a specific task or subdomain without retraining. This context is usually in the form of preceding text or a prompt.
For fine-tuning the FM, an LLM is trained on a specific task or dataset, using data and computational resources. The model’s weights are updated during training to optimize its performance on the task. Fine-tuning is a widely used technique in the field of generative AI because it allows researchers and practitioners to adapt powerful, pre-trained models to a wide range of applications and use cases. Generative AI is an area of active research that frequently produces new findings that lead to increased capabilities in new models, as well as new methods to run and maintain existing models more efficiently.
Thus, a flexible architecture offers value for generative AI solutions since the same production workload will become cheaper and faster to run, and the same workload can be updated to achieve greater accuracy and capabilities. This is where a flexible architecture becomes a crucial factor in unlocking the full potential of generative AI solutions. In this post, we cover an Amazon Web Services (AWS) Cloud infrastructure with a modular architecture that enables you to explore and take advantage of the benefits from different open source FMs in a flexible way. This solution provides the following benefits with faster time-to-market and shorter development cycle:
- Prompt catalog – It acts as a centralized repository of effective prompts, saving time and effort in crafting prompts from scratch. By providing tested and refined prompt templates, a catalog minimizes trial and error, enabling faster and more consistent results. Moreover, a well-structured catalog fosters collaboration and knowledge sharing, helping users discover new approaches and optimize their LLM prompts. It also serves to standardize evaluation of specific task performances such as retrieval or text generation. Parameter Store, a capability of AWS Systems Manager, offers the automation capability for prompts to be tested, cataloged, and applied systematically so that throughout the entire generative AI solution stack, the prompts are securely and hierarchically stored and accessible.
- Data pipeline – Building a data pipeline that can build fine-tuning datasets for LLMs means a repeatable and reproducible process that can keep fine-tuned models current with your organization’s evolving domain knowledge. Each data pipeline should have a defined task for the output dataset (such as Q&A, code generation, or entity extraction) and a defined data type and format for the dataset generated. The raw data should represent a collection of knowledge that you wish the LLM to gain domain proficiency in. The data processing steps can involve data cleaning, normalization, reformatting, chunking, and tokenization, and tools such as Amazon SageMaker Data Wrangler and AWS Glue are perfect for building, connecting, and orchestrating these modular steps into different permutations of workflows. For a highly specialized data processing workflow, SageMaker Jobs is one of the options that can scale your customer scripts for a batch workload. Data Wrangler and AWS Glue are two other tools that offer scalable and configurable data processing pipelines with direct integration with SageMaker. Each resulting dataset will also be versioned, so that the resulting fine-tuned models will have a traceable data source.
- Model evaluation – In order to achieve rapid iteration and experimentation, repeatable model evaluation is a key step. Amazon Bedrock model evaluation is a capability that can quickly evaluate fine-tuned model performance with public benchmark datasets. AWS has also released an open source tool called FMeval that can apply both public benchmark datasets and private datasets for rapid model evaluation. The easy and immediate evaluation feedback can help determine if a fine-tuned model is production-ready.
- Containers and serverless – Adopt an extensible platform to easily swap AI services as new innovations emerge. By decoupling model components and supporting data and operations stacks, a flexible design allows you to seamlessly integrate newer, more efficient chips and optimized models. This translates to faster processing speeds, reduced latency, and lower operational costs without needing to completely overhaul your existing system. Diverse hardware and a flexible design allow you to seamlessly integrate newer, more efficient chips and optimized models. As generative AI solutions evolve to handle more complex workflows that might contain multiple steps and tasks, AWS Step Functions can orchestrate and manage the workflow.
- Containers offer immense value in LLM ops by ensuring reproducibility, scalability, and streamlined deployment. They package a model’s code, dependencies, and runtime environment into a self-contained unit. This eliminates inconsistencies across development, testing, and production environments, guaranteeing identical behavior. Containers can easily scale up or down based on demand, which optimizes resource usage. Furthermore, container orchestration tools like Kubernetes simplify the process of deploying and managing complex LLM applications across diverse cloud or on-premises infrastructures. Amazon Elastic Container Service (Amazon ECS) can catalog and host the containers for training, fine-tuning, and serving all LLM models in SageMaker, as well as the containers compatible with the Amazon Elastic Kubernetes Service (Amazon EKS) and AWS Lambda applications.
- Serverless architecture plays a pivotal role in streamlining LLM ops by offering inherent scalability, cost-efficiency, and reduced operational overhead. With serverless functions, developers can focus on core LLM development, fine-tuning, and deployment rather than managing underlying infrastructure. This leads to faster model iterations and the ability to handle unpredictable traffic surges without the need for manual provisioning. Serverless also promotes cost-effectiveness by charging only for the resources used during execution, eliminating the expenses of idle servers. Lambda provides elastic function calls that can scale as the workload scales and stay modular so that multiple tasks can share the same architecture. Amazon EKS provides serverless support for the heavy workloads of model pre-training and fine-tuning, and AWS Fargate is a great fit for running the other components of the solution, such as batch data processing or front-end application.
Flexible architecture for FM fine-tuning
The following diagrams illustrate a modular cloud architecture based on the mentioned AWS services to bring flexibility for the customers in generative AI solution development and deployment.
In the data preparation step, this modular architecture allows annotation of data from various sources and formats, using either human feedback or automated extraction pipelines. The architecture also allows for choice and the flexibility to inject custom logic in the preprocessing of data using serverless Lambda functions. Alternatively, AWS Glue can execute Sparks jobs to import, prepare, transform, feature, and analyze data for machine learning (ML) model building. You can also use the power of Data Wrangler to build an end-to-end data processing pipeline interactively. The reproducible nature of this architecture allows a flexibility to choose the tool that matches your workload, team capabilities, and raw data formats. The datasets generated by this modular architecture are versioned and engineered feature sets ready for ML tasks. The feature store serves a pipeline checkpoint that can reduce the compute cost by eliminating the need to reprocess training or validation datasets from raw sources each time a new model is created.
The modular architecture empowers the development team during the build phase with a streamlined workflow. The team can use collections of prompts to evaluate pre-trained models for the generative AI solution. Additionally, this modularity facilitates the creation of standardized containers for both training and serving the LLM. This, in turn, ensures consistency in LLM parameters throughout the entire development and deployment process.
The training modular design accelerates experimentation during model fine-tuning. This allows researchers to easily try out different datasets, fine-tuning parameters, and evaluation criteria. This flexibility helps to find the optimal model configuration for the specific intended tasks for the LLM model. Registering the model makes it available to be governed and observed using tools such as Amazon SageMaker Clarify, Model Cards, and Model Dashboard.
The diagram in Figure 4 illustrates how the three modules work together.
This flexible architecture for FM fine-tuning brings you the following benefits:
- Improved performance and accuracy – By using a flexible architecture, you can fine-tune the generative model with domain-specific data and fine-tune the model’s parameters to enhance its performance and accuracy. This targeted fine-tuning ensures that the model’s outputs are more aligned with the intended use case, leading to higher-quality results and better user experiences.
- Adaptability and customization – A flexible architecture in generative AI allows for ease-of-use fine-tuning and customization of the model to meet the unique requirements of different applications, such as generating personalized content, adapting to industry-specific language patterns, or tailoring the output to specific user preferences. A flexible architecture provides the necessary agility to adapt the model to the task at hand. It allows you to deploy new technologies and services more quickly and easily without having to completely revamp the cloud architecture framework. It enables you to be more agile in responding to changing market demands and customer needs. This can help you to quickly test out new services and solutions and adapt to changing market conditions.
- Ready for future trending – For generative models, the emergence of mixture-of-expert architecture, 1-bit model weights, and improved chipset means that more capable models will likely run faster with higher accuracy, efficiency, and throughput in the future. The modular architecture thus allows for quick experimentation with each module to evaluate the performance of new features while holding all other variables constant. It also encourages strong versioning and reproducibility, thus improving the transparency of both the development and production process.
- Faster iteration and experimentation – The modular nature of a flexible architecture enables faster iteration and experimentation during the fine-tuning process. You can easily swap out or modify individual components of the model, test different configurations, and quickly assess the impact on the model’s performance. This iterative approach accelerates the development cycle and allows for rapid improvements, ultimately leading to more robust and effective generative AI solutions. It also ensures that the model can be fine-tuned easily to keep up with the rapidly changing landscape, allowing it to adapt to emerging trends, incorporate new data sources, and expand its functionality over time.
- Enhanced transferability and cross-domain applications – A flexible architecture in generative AI facilitates the transfer of learned knowledge and capabilities across different domains. By using transfer learning techniques, the fine-tuned model can be applied to a wide range of use cases, from content generation and language modeling to image synthesis and beyond. This cross-domain transferability maximizes the value and versatility of the generative AI solution.
Adopting a modular architecture during fine-tuning brings positive impact to the AI model portfolio lifecycle management. Organizations can benefit from increased scalability, maintainability, reusability, collaboration, reliability, and deployment agility—all of which are essential for effectively managing the complex lifecycle of AI models in a production environment.
Conclusion
The pace of evolution of generative AI is taking place at an accelerated rate. There are new FMs and revisions of existing FMs emerging every week. Enterprises are eager to select the best generative AI configuration, and evaluate the different options based on factors including accuracy and reliability, explanation and source attribution, integration requirements, responding time, deployment speed, integration, management efforts, and cost. From these benefits of the AWS modular architecture, your generative AI solutions can optimize the solution using the best fit-for-purpose FMs fine-tuned for your business use case, along with promoting reuse across different solutions within your organization.
For cloud-based workloads with strict compliance requirements, this modular AWS Cloud architecture infrastructure offers a range of service options that can satisfy rigorous standards, such as in AWS GovCloud (US) Regions or the IL4, IL5, and IL6 networks. This empowers public sector organizations to harness the power of generative AI resources with flexibility and agility, while ensuring the levels of security and compliance. Please refer to the Services in AWS GovCloud (US) Regions for the complete list of AWS services that are currently available in AWS GovCloud (US) to build your flexible generative AI solution in environments that have higher security and compliance requirements.