Amazon Bedrock LLMs: A practical guide for chatbot development
Simplifying LLM selection and testing for AWS with Datasaur LLM Labs
The landscape of applications today
Generative AI is revolutionizing how we build our products. Almost every industry is using generative AI and building applications that provide unique experiences to their customers. From automating customer service to enhancing marketing campaigns, organizations are leveraging gen AI to drive efficiency and innovation. However, the success of these applications depends on selecting the best large language model for the use case. Choosing the wrong model can lead to poor user experience, budget overruns, and loss of trust. Performing in-depth LLM evaluations and choosing the right one can help you provide the best possible experience to your customers. Evaluating an LLM can be a lengthy and complex process; however, with some strategy and modern tools, you can make this process quick and smooth.
In this article, you will learn how to use Datasaur LLM Labs to automate and parallelize the LLM selection process.
Let’s dig in!
Imagine you are a machine learning engineer who recently joined an education startup that has seen good success catering to private schools in New York. However, the application’s group collaboration feature has quite a high attrition rate. Students use it once and never return. During a teacher focus group discussion, the product team learned that students don’t find it useful because the feature lacks the ability to answer students’ questions. And guess who the product team wants to fix it—your team! The task is simple - Build a generative-AI-backed chatbot that answers students’ questions and brings students back into the collaboration tool.
Your leadership has granted a $5,000 budget and expects a working proof of concept (POC) in four weeks. The technical architect has chosen Amazon Bedrock as the service to implement the generative AI application and is relying on you to confirm which large language model (LLM) to integrate. You have two weeks to finalize the LLM so the implementation can begin.
Defining parameters for your use case
Before you start evaluating models, it’s important to get a clear understanding of the acceptance criteria defined by the business. Here is what you got from your product manager:
- Latency: Customer is OK with a maximum of 1.5 seconds response for the service level agreement (SLA).
- Cost: Must not exceed POC budget of $5,000. Traffic will be low, as the new chatbot feature will be exclusively offered to 10 schools as part of the POC.
- Quality: Answer accuracy is most important. Must be able to answer more than 80% of answers accurately. The response tone should be educational, factual, and casual.
Checking cost, latency, and general information is straightforward, so no issues there. However, the quality requirement will eat up most of your time, making you wonder whether two weeks is enough time to finalize the LLM.
Let me explain why.
Challenges of choosing the right LLM from the list of candidates
The LLM you choose must meet your performance requirements, fit in your budget, and meet your quality requirements--and quality evaluation is an Olympian task. You must identify scenarios and check whether the model’s response meets expectations from different perspectives. For example, is the answer 1) correct, 2) accurate, 3) comprehensive and 4) informative? I’m thinking of that famous Facebook meme showing a woman with a campfire inside the tent, saying “Don’t take camping advice from generative AI!”
On another level, you also have to answer: 5) Is the LLM able to reason? 6) Is the model output coherent and factual? 7) Does it respond in a tone that’s appropriate for your audience? 8) Does it show any biases offending your customers? 9) What are its weaknesses? 10) Does it do the task well? 11) Is the output consistently good? and 12) Does it work well in the context of your data?
You get the idea—there are a lot of ways for your LLM to go wrong, and you don’t want yours to end up on a listicle of generative AI blunders.
On top of it, you must thoroughly evaluate not just one but all of your candidate LLMs. To start, you must identify high-quality candidate LLMs.
Step 1: Identify candidate models
Identifying candidate models is easier than it sounds. You apply obvious criteria to models offered on Amazon Bedrock and filter old, irrelevant, expensive, or slower models:
- Criteria 1: Since you want to use the LLM for your chatbot use case, choose Only Text/chat models
- LLMS to evaluate: 114
- Criteria 2: Since you have ultra-low traffic and a limited budget, choose Serverless/PAYG only
- LLMS to evaluate: 32
- Criteria 3: Recent LLMs only
- LLMS to evaluate: 19
- Criteria 4: Since you need an LLM to understand context and dialogue flow and have a grasp of conversational tone and etiquette, choose Chat-optimized LLMS only
- LLMS to evaluate: 6
- Criteria 5: Fits in the budget and has under a 1.5-second latency
- LLMS to evaluate: 4
Here is the final list of candidate LLMs you decide to evaluate:
Ok, that was easy! All your candidate LLMs fit your cost and latency criteria. They are chat-optimized, text-based LLMs, recent, and are available on a pay-as-you-go (PAYG) basis. Now, it’s time to do a deeper qualitative evaluation to see which of the four is best for your use case.
Perform deep qualitative analysis
Let's talk about what's absolutely crucial to our use-case - answer accuracy. You simply can't afford to have an LLM that gives wrong answers, as this can completely destroy students’ trust and your application's reputation. But here's the thing about evaluation - it's not as straightforward as it might seem.
With 2,000 questions and four different LLMs, you're suddenly looking at reviewing 8,000 responses. This brings up some important questions you need to think about: Who's going to do all this evaluation? Will you invoke each model yourself, or will you need to bring in others to help? What about checking the answer accuracy? Will you check it yourself?
Since you don’t have the luxury of manually invoking each model and checking each response yourself, your best way to proceed is by doing an automated evaluation. There are multiple industry standard evaluators you can use to evaluate the response of the LLM. After doing your research, you land on the following four:
- Langchain - Answer Correctness evaluator: This metric measures the accuracy of the LLM's response compared to the ground truth. It returns scores between 1 (low) and 10 (high). It also returns the chain of thought reasoning.
- Deepeval’s Answer relevance: This metric evaluates how relevant the LLM's responses are to the given questions.
- Deepeval’s Bias: Assesses the presence of bias in the LLM's outputs based on predefined criteria.
- Deepeval’s Toxicity: Detects and quantifies toxic language or harmful content in the LLM's response
An automated evaluation around these four metrics would take care of answer accuracy, relevance, bias, and toxicity.
Need for an automated evaluation engine
Automating LLM evaluation is a complex task. To perform an evaluation, you need to send prompts to each model and write code to handle throttling, timeouts, etc. And perhaps a UI to compare answers side by side. With the right engineering team, it will take weeks to build a system that performs automated evaluation against multiple LLMs using datasets provided and computes multiple evaluation metrics using custom scoring functions while displaying results in a UI that lets you dive deep into individual responses. A system that lets you evaluate multiple LLMs is necessary for the LLM evaluation task, as you know that this is not a one-time activity. Down the line, when newer LLMs get released, you would want to migrate to newer, better, cheaper models. The tool you use should allow you to test all LLMs available in Amazon Bedrock and other LLMs as well, in case you later consider LLMs that are exclusively available on another platform. The tool should not have an expensive license cost; in fact, a pay-as-you-go SaaS product would be perfect.
Exploring Datasaur LLM Labs
You start exploring AWS Marketplace and come across Datasaur LLM Labs, a platform that can evaluate the inference quality, speed, and cost of multiple Large Language Models (LLMs). What stands out is its support for 200+ different models that can perform automated evaluations. The platform also has a UI from which your human evaluators can manually perform inference on multiple LLMs at a time. Furthermore, it promises that your non-public data remains private. Based on your findings, you decide to do a quick POC to evaluate Datasaur.ai’s evaluation capabilities for the question-answering LLM evaluation use case.
Datasaur LLM Labs comes with a free trial when subscribed from AWS Marketplace. You go ahead and subscribe to the product and start manual evaluation.
Perform a manual evaluation
Within the Datasaur console, you first create a sandbox, add all your candidate models, and configure them as individual applications.

For each application, you configure custom hyperparameters and a knowledge base along with custom system and user instructions.

Next, you can configure a knowledge base where you can upload the question answers data and associate the knowledge base with the sandbox and all four applications.

Next, you return to the sandbox, associate the knowledge base with each application, and write a prompt. The Datasaur LLM Labs UI lets you send a single prompt to all four applications (LLMs) with a click of a button. If you want to run multiple prompts, you can simply click the Add prompt button and send prompts to corresponding LLMs applications.

With this manual invocation ability, human evaluators can evaluate multiple LLMs and evaluate necessary characteristics such as 1) Coherence, 2) Relevance, 3) Crispness, 4) Tone, 5) Context retention, and the ability to ask more focused follow-up questions, etc. They can further tune hyperparameters and reinvoke the prompts they have written to see how LLM behaves.
Ok, this is great for manual evaluation, but what about answer accuracy? Answer accuracy is trickier to evaluate because thousands of questions need to be asked, and answers need to be reviewed. This is where Datasaur LLM Labs’ automated evaluation functionality helps.
Perform an automated evaluation
You can save the applications you configured in the sandbox to library and then use them for an automated evaluation.

The first step of the evaluation is to choose applications and provide a Q&A dataset. For a quick test, I built a tiny, moderately complex question-answer dataset.

For the chatbot use-case, you can choose LangChain - Answer correctness metric and use GPT-4 for evaluating answers. After the evaluation is complete, you can open the evaluation and explore the results. As you can see in the following screenshot, you get an Answer Correctness LangChain metric for each LLM, which helps you understand how correct the answers are.

You can further drill down into each question’s answer, its score (0-10, 10 being highest), and see the reasoning behind why a specific score was assigned to the answer.

Next, you can trigger evaluation jobs for bias, toxicity, answer relevance, and analyze the output for the same.
As you analyze the automated evaluation results as well as inputs from human evaluators, you get clarity on which LLM is doing the job really well. Next, you can spend time tuning the prompts to further improve the responses while documenting limitations/weaknesses associated with the model. If these weaknesses affect a large number of use cases, you can consider an alternate approach of splitting the problem statement and addressing it with more than one solution.
Key Takeaways
You just learned how to use Datasaur LLM Labs to perform manual and automated evaluations on multiple LLMs simultaneously to identify the right LLM for your use-case. Here are some key takeaways to help you get started on your next project:
- Spend time understanding the use case, requirements, and parameters, such as allowed latency and cost for the project.
- Identify model hub, apply mandatory filtering criteria, and review general metrics on leaderboards such as HELM to come up with a list of candidate models to perform an evaluation.
- Explore tools such as Datasaur LLM Labs to expedite the evaluation process – remember, evaluation is not a one-time activity; you will be doing it down the line as well as you explore a new LLM.
- Tune your prompts and integrate with Amazon Bedrock Converse API.
If you haven’t started yet, give Datasaur LLM Labs a try for free in AWS Marketplace using your AWS account.
Get hands on
About AWS Marketplace
AWS Marketplace makes it easy to find and add new tools from across the AWS partner community to your tech stack with the ability to try for free and pay-as-you-go using your AWS account.

Easily add new category-leading third-party solution capabilities into your AWS environment.

Avoid up front license fees and pay only for what you use, consolidating billing with your AWS account.
