What are generative AI models?
Generative AI models have strengths and limitations. Depending on the complexity, performance, privacy, and cost requirements of your use case, some models may be a better choice than others. This guide explores the factors to consider and best practices for selecting a generative AI model.
Generative artificial intelligence models can create original and meaningful text, images, audio, and video content based on natural language input from users. Organizations are utilizing them for everything, from powering chatbots to creating design templates and solving complex problems in biology. Thousands of proprietary and open-source AI models exist, and new models and improved versions are being released daily.
Despite their flexibility and versatility, generative AI models are not a catch-all solution for every use case. AI teams must carefully select and evaluate the best model that optimizes cost and performance. Evaluating models is complex. Popular benchmarks like Helm and the Hugging Face leaderboard only provide a general view of how a particular AI model performs in common natural language tasks. AI teams must adopt different strategies to evaluate model output for custom data input and then select the one that best fits their requirements.
How are generative AI models evaluated for different use cases?
Here are some factors to consider when choosing an appropriate AI model for your use case.
Modality
Modality refers to the data type the model processes: embeddings, images (vision), or text. Some models are unimodal and can efficiently process a single data type. Others are multimodal and can integrate multiple data types but may be better suited for one type over others. For example, models like Claude, Llama 3.1, or Titan Text G1 are suitable for text-based tasks, while Stable Diffusion XL and Titan Image Generator v2 are better suited for vision tasks. Similarly, the Titan Multimodal Embeddings G1 model is preferred to translate any input image or text into an embedding that contains the semantic meaning of both the image and text in the same semantic space.
Model size
Model size is the number of parameters or configuration variables internal to the model. It can vary from several million to 100 billion+, with most models having between 10 and 100 billion parameters. The model size directly defines the model's capability to learn from data. Models with more parameters perform better because they can deeply understand new data. However, they are more expensive to customize and operate.
Inference latency
Inference latency is generally a concern in real-time scenarios where your AI application users may expect immediate responses. It is the total time a model takes to process input and return output based on input length. Generative AI models with complex architectures may have slower inference speeds than smaller models. However, inference latency varies depending on both, your expected prompts and the model's performance. An increased number of tokens (like letters, punctuation, etc.) in end-user input may also increase latency.
Context window
The generative AI model's context window is the number of tokens it can "remember" for context at any one time. A model with a larger context window retains more of the previous conversation and provides more relevant responses. Thus, larger context windows are preferred for complex tasks such as summarizing long documents or powering multi-turn conversations.
Pricing considerations
Model running costs include usage costs for proprietary models and computation and memory costs. Operational expenses can vary from model to model based on workloads. Weighing costs against benefits ensures you get the best value for your investment. For example, running Claude 2 or Command R+ incurs usage-based fees since they are proprietary models, whereas deploying Llama 2 7B has lower computational costs. However, if proprietary models provide significantly better accuracy or efficiency for your task, their added cost might be justified.
Quality of response
You can evaluate the quality of response of an AI model by using several metrics, like
- Accuracy—how often the model's responses are correct
- Relevance—how appropriate the responses are to the given input.
- Robustness—how well the model handles intentionally misleading inputs designed to confuse it.
- Toxicity—the percentage of inappropriate content or biases in the model's outputs.
The metrics are typically measured against a pre-configured baseline. It is a best practice to evaluate the quality of response of a few different models over the same input dataset and select the one providing the highest response quality.
What is the generative AI model selection process?
Generative AI model selection first requires you to determine the specific requirements of your AI application. Ensure you understand user expectations, data processing requirements, deployment considerations, and other subtleties within your business and industry. Then, you can eliminate different AI models by conducting quality tests until you find the best model that fits your requirements.
Step 1 - Shortlist initial model selection
Start the process by shortlisting around 20 models from the thousands out there that fit your requirements. Choosing between open-source and proprietary models is half the work done. Once you've determined that, you can further shortlist by assessing models based on key criteria such as modality, model size, context window, etc., described in the previous section.
Open-source vs. proprietary generative AI models
Open-source models offer flexibility and allow teams to fine-tune or fully retrain the model on proprietary data. This can be particularly valuable in specialized industries where general-purpose models don’t perform well on niche use cases. For instance, a large insurance company may prefer to train an open-source model on custom data instead of using proprietary models aimed at the financial sector that don't quite meet their specific requirements.
However, open-source models require additional considerations. They may introduce security and legal risks, requiring organizations to enforce their own compliance measures and thoroughly vet licensing terms. Proprietary models, on the other hand, typically offer built-in security features, indemnification for training data and outputs, and compliance assurances—reducing the operational overhead for businesses prioritizing risk mitigation.
Step 2 - Inspect output and narrow the list further
In this step, your goal is to identify the top 3 generative AI models best suited for your use case. First, identify a subset of test prompts that match your use case. Then, visually inspect each model's output for the specific prompts. Look for outputs with more details that best match your input. Select the top 3 that generate the most relevant, detailed, and accurate outputs.
Amazon SageMaker Clarify is best suited for this stage. It automatically evaluates FMs for your generative AI use case using metrics such as accuracy, robustness, and toxicity to support your responsible AI initiative.
Step 3 - Use case-based bench-marking
Now, you can evaluate the top-selected AI models in more detail based on predefined prompts and outputs for your specific test data set. The key factor here is to have a comprehensive test data set that covers all aspects of your use case with several variations. You should also have a corresponding ideal output to statistically assess which model's output is closest to your ideal output.
Amazon Bedrock provides evaluation tools to evaluate, compare, and select the AI model for your use case with Model Evaluation.
There are three evaluation approaches you can take.
Programmatic
Evaluate model outputs using traditional natural language algorithms and metrics like BERT Score, F1, and other exact matching techniques. Amazon Bedrock lets you achieve this using built-in prompt datasets, or you can bring your own.
Human in the loop
Get human evaluators — your team members, a sample set of end users, or professional AI evaluators — to assess the output of all three models based on pre-determined model metrics. They can manually compare outputs with ideal outputs, or if the use case is too broad, they can assess and mark output based on their best judgment.
With Amazon Bedrock, you can evaluate model outputs with your workforce or have AWS manage your evaluations on responses to custom prompt datasets with metrics like relevance, style, and alignment to brand voice or built-in metrics.
Another AI model as an evaluator
In this approach, another AI model evaluates the output of the three models in an unbiased manner. This works best for use cases where outputs are well-defined and their similarity to the ideal output is statistically measurable. Amazon Bedrock lets you evaluate model outputs using another AI model in LLM-as-a-judge mode. You can use your custom prompt datasets with metrics such as correctness, completeness, and harmfulness, as well as responsible AI metrics such as answer refusal and harmfulness.
Step 4 - Final selection
Use the evaluation data along with cost and performance analysis to choose the final model. With Amazon Bedrock, you can use the compare feature in evaluations to see the results of any changes you made to your prompts and the models being evaluated. View all your analytics in one place and select the model that provides the best balance between performance, cost, and associated risks and uses resources efficiently.
Choosing the right generative AI model for your use case requires a structured approach that balances technical capabilities, business needs, and operational constraints. The key is to align your decision with your use case's specific requirements. Carefully evaluate models based on factors such as modality, size, data processing capabilities, and deployment considerations. Ultimately, the right model enhances efficiency and innovation and provides a scalable foundation for future AI-driven advancements in your organization.