Going beyond vibes: Evaluating your Amazon Bedrock workloads for production

As I talk to organizations about their use of generative AI, I ask about their decision criteria for selecting a specific foundation model (FM) within Amazon Bedrock. In many cases, the organization has been testing with different model providers, but usually hasn’t made the decision based on anything quantitative. In other cases, a model might be selected based on the provider’s reputation, an organization’s success with a previous model version, or other nonquantitative criteria.

Additionally, when we talk about the organization’s plan for getting the workload to production, I find that many organizations are stuck in the testing of their generative AI solution. Many don’t have good criteria for moving out of testing—quite frankly, they’re testing based on vibes. In other words, I’ve seen many organizations validate their generative AI–based workloads on their emotional reaction to a FM’s output instead of real testing criteria. In this post, I show you how you can move away from vibe testing and effectively evaluate your Amazon Bedrock workloads for production.

Moving away from vibe testing is important so the workload doesn’t get stuck in a never-ending development cycle. But it’s important for another practical reason—FMs that are state of the art today will likely be replaced with faster, cheaper, and better models tomorrow. Organizations that can’t quantitatively test how their workload performs with new models will be stuck with older models. These organizations will be unable to reduce their generative AI inference costs, they’ll be unable to take advantage of advances in generative AI prediction power, and they’ll find their workloads going back to the never-ending development cycle while more vibe testing is done to validate newer FMs.

Model evaluation jobs

Amazon Bedrock has introduced Amazon Bedrock model evaluation jobs, which you can use to automatically evaluate the performance of a specific model on your prompts. The evaluation job can compare the result of the model against your reference response. This gives you a quantitative indicator of how close the generated model responses are to your ideal standard. Using this quantitative indicator, you can try different models and change your prompts to see how the results change.

The first step in all of this is to build a ground truth dataset. This dataset is in a .jsonl format that contains a prompt, a referenceResponse, and an optional category. The prompt should be the prompt you want sent to the FM, the referenceResponse should be the ground truth response against which the FM is evaluated, and category is an optional grouping you can use.

I’ve created a custom prompt dataset for model evaluation. This dataset contains prompts for my fictitious nonprofit organization, AnyCompany Nonprofit. I’ll use this dataset to demonstrate using Amazon Bedrock evaluations to evaluate the performance of various FMs against my prompts.

Amazon Bedrock evaluations currently offer several options for automatic evaluation. The two I will be discussing are programmatic and model as judge.

With programmatic model evaluation, you can evaluate a model’s ability to perform a task (like general text generation, text summarization, question and answer, or text classification). The model evaluation returns three metrics: accuracy, robustness, and toxicity. Accuracy measures how closely the generated results are to your ground truth. Robustness measures how much the model’s response is affected by changes to the prompt (like adding or deleting whitespace, converting text to all lower case, typos, or converting numbers to words). Toxicity measures the amount of toxic content generated by the model. These metrics are calculated based on the task type selected.
Model as judge uses a judge model (or an evaluator model) to judge the generator model’s effectiveness. The evaluator model scores the responses from the generator model and provides an explanation for each response. The model as judge evaluation can return up to nine quality metrics: helpfulness, faithfulness, completeness, relevance, readability, correctness, professional style and tone, coherence, and following instructions. It can return up to three responsible AI metrics: harmfulness, stereotyping, and refusal.

Regardless of which automation evaluation option you select, your prompt dataset must be stored in Amazon Simple Storage Service (Amazon S3). After adding your dataset to an S3 bucket, enable cross-origin resource sharing (CORS) permissions on the bucket. The CORS permissions are required for console-based model evaluation jobs.

Programmatic model evaluation

To begin a programmatic model evaluation job, follow these steps:

On the Amazon Bedrock console in the left navigation pane, select Evaluations. Under Model evaluations, choose Create. In the dropdown menu, choose Automatic: Programmatic , as shown in the following screenshot.

Figure 1. Selecting programmatic model evaluation within the Amazon Bedrock console.
On the automatic evaluation screen, enter a name for your evaluation.
Under Model selector, select your model. For this walkthrough, we selected Claude 3.5 Haiku.
Under Task type, select Question and answer.
In Metrics and datasets, for Metric choose Accuracy. Select Use your own prompt dataset. For S3 URI select the S3 location of the prompt dataset.
Repeat step 5 for the Toxicity and Robustness
Select the S3 location for the evaluation results.
Select or create an AWS Identity and Access Management (IAM) role that grants Amazon Bedrock permissions to your S3 input and output locations.
Choose Create.

These steps are shown in the following screenshot.

Figure 2. Configuring programmatic model evaluation for a set of custom prompts.

When the evaluation is complete, you can view the evaluation summary using the Amazon Bedrock console, or you can see the full results in the S3 evaluation results location.

I performed model evaluations using several different models on my sample prompts dataset. The results are shown in the following table.

Model

Accuracy

(higher is better)

Toxicity

(lower is better)

Robustness

(lower is better)

Amazon Nova Lite

0.414

0.000413

15.4

Amazon Nova Pro

0.388

0.000444

21.8

Anthropic’s Claude 3.5 Haiku

0.503

0.000407

13.4

Mistral Large (24.02)

0.463

0.000401

15.7

DeepSeek-R1

0.469

0.000420

14.0

The accuracy metric ranges from 0 to 1, with higher numbers indicating better performance. The toxicity metric is calculated using the detoxify algorithm, with values closer to 0 indicating the selected model is not producing toxic content. In the example in the table, all models have virtually the same toxicity level, which is close to 0, indicating there is no toxic content. For the robustness metric, a lower score indicates the selected model is more robust for the prompt dataset. Because these models are nondeterministic, you might see slightly different values for the same dataset.

In this example, Anthropic’s Claude 3.5 Haiku has the best accuracy and robustness for my specific prompt dataset. However, with a different prompt dataset, a different model could perform better. Although I can use these metrics to aid in selecting the best performing model, I can also use these metrics as I continually evolve my prompts. As I modify and improve the prompts for this workload, I can see if the prompt change has an impact on any of these three metrics. By tracking these metrics for my generative AI prompt dataset, I can select a model and make changes to my prompts with confidence.

Model-as-judge evaluation

Model-as-judge evaluation uses two models: an evaluator model and a generator model. The generator model generates responses based on the prompts dataset you provide, and the evaluator model evaluates the generator model’s responses based on the metrics you select. Each evaluator model uses a set of evaluator prompts to score the evaluation job. As a best practice, you should select an evaluator model from a model family that is different from the generator model.

When performing a model as judge evaluation, you can use the same prompt dataset format used for programmatic model evaluation. However, the ground truth response in the referenceResponse field is optional. If the prompt dataset does contain a referenceResponse, it’s used to calculate completeness and correctness metrics.

To create a model-as-judge evaluation job, follow these steps:

On the Amazon Bedrock Model evaluations page, choose Create. In the dropdown menu, select Automatic: Model as judge, as shown in the following figure.

Figure 3. Selecting model-as-judge model evaluation within the Amazon Bedrock console.
In the page that appears, select the evaluator model, the generator model, and the set of metrics you want evaluated. In my example, I selected Mistral Large (24.02) as the evaluator model, Claude 3.5 Haiku as the generator model, three quality metrics (Helpfulness, Correctness, and Completeness), and one responsible AI metric (Harmfulness).
Select the input dataset, the evaluation results location, and the IAM role.
Choose Create to begin the evaluation.

When the evaluation is complete, the model-as-judge evaluation provides a dashboard of the metrics you selected. These metrics are normalized between 0 and 1, and the closer the values are to 1, the more they exhibit the metric (in other words, higher is better). As with programmatic model evaluation, because these models are nondeterministic, you might see slightly different scores with the same dataset. The Metrics Summary screen shows the results, as shown in the following figure.

Figure 4. The metrics summary for the model-as-judge evaluation of the sample prompts dataset.

The evaluation results page includes histograms of the various metric scores. To view the Generation output, your original Ground truth, and the Score given by the evaluator model, expand the Prompt details section underneath each histogram. If you select the score, you can see the reasoning for it, as shown in the following screenshot.

Figure 5. The evaluator model provides a reason for why the generation output was given a score of 0.5 compared to the ground truth data.

The following example shows the reasoning for the lower correctness score of 0.5 that was given to one of the responses. For more details on how scoring is computed, refer to the prompts Mistral Large used to generate these responses.

The candidate response correctly identifies the three sponsorship levels and their respective prices as detailed in the ground truth response. However, the candidate response provides additional information about the benefits and hierarchy of the sponsorship levels, which is not present in the ground truth response. Despite this, the extra information is accurate and derived from the provided context, so it does not make the response incorrect.

Conclusion

In this post, I demonstrated how you can move beyond vibe testing and generate quantitative results to determine the best FM to use for your generative AI workload. I demonstrated how to use programmatic model and model-as-judge evaluations, which you can use to find the best FM for your workload. Additionally, as new FMs are released, you can validate new models against your workload, giving you confidence to upgrade. Finally, you can use these methods to measure the effect of modifying your prompts.

In this post I only demonstrated programmatic and model-as-judge evaluations, but Amazon Bedrock evaluations also support Retrieval Augmented Generation (RAG) evaluations and human worker evaluation. RAG evaluations are useful if you have a RAG workflow and want to test not only the text generation using a foundation model, but also the retrieval of content from your knowledge base. Human worker evaluations are useful if you have an evaluation that requires specialized knowledge or skill to determine the validity of the response.

For Amazon Bedrock programmatic and model-as-judge evaluations, you only pay for the inference costs of the models you select (both input and output tokens) based on standard Amazon Bedrock pricing. There is no additional cost for model evaluations.

As a next step, create a ground truth prompt dataset for your workload and begin evaluating the performance of your Amazon Bedrock workloads today.

AWS Public Sector Blog

Going beyond vibes: Evaluating your Amazon Bedrock workloads for production

Model evaluation jobs

Programmatic model evaluation

Model-as-judge evaluation

Conclusion

Resources

Follow

Learn

Resources

Developers

Help