Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart

October 2023: This post was reviewed and updated with support for finetuning.

Today, we are excited to announce that Llama 2 foundation models developed by Meta are available for customers through Amazon SageMaker JumpStart to fine-tune and deploy. The Llama 2 family of large language models (LLMs) is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Fine-tuned LLMs, called Llama-2-chat, are optimized for dialogue use cases. You can easily try out these models and use them with SageMaker JumpStart, which is a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML.

In this post, we walk through how to discover, deploy, and fine-tune Llama 2 models via SageMaker JumpStart.

What is Llama 2

Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. Llama 2 is intended for commercial and research use in English. It comes in a range of parameter sizes—7 billion, 13 billion, and 70 billion—as well as pre-trained and fine-tuned variations. According to Meta, the tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Llama 2 was pre-trained on 2 trillion tokens of data from publicly available sources. The tuned models are intended for assistant-like chat, whereas pre-trained models can be adapted for a variety of natural language generation tasks. Regardless of which version of the model a developer uses, the responsible use guide from Meta can assist in guiding additional fine-tuning that may be necessary to customize and optimize the models with appropriate safety mitigations.

What is SageMaker JumpStart

With SageMaker JumpStart, ML practitioners can choose from a broad selection of publicly available foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment.

You can now discover and deploy Llama 2 with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping ensure data security. Llama 2 models are available today in Amazon SageMaker Studio in us-east-1 (fine-tunable), us-east-2 (inference only), us-west 2 (fine-tunable), eu-west-1 (fine-tunable), and ap-southeast-1 (inference only) Regions.

Discover models

You can access the foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

Once you’re on the SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.

From the SageMaker JumpStart landing page, you can browse for solutions, models, notebooks, and other resources. You can find two flagship Llama 2 models in the Foundation Models: Text Generation carousel.

If you don’t see Llama 2 models, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Apps.

You can also find other four model variants by choosing Explore all Text Generation Models or searching for llama in the search box.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You can also find two buttons, Deploy and Open Notebook, which help you use the model.

When you choose either button, a pop-up will show the end-user license agreement and acceptable use policy for you to acknowledge.

Upon acknowledging, you will proceed to the next step to use the model.

Deploy a model

When you choose Deploy and acknowledge the terms, model deployment will start. Alternatively, you can deploy through the example notebook that shows up by choosing Open Notebook. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using a notebook, we start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker with the following code:

from sagemaker.jumpstart.model import JumpStartModel
my_model = JumpStartModel(model_id = "meta-textgeneration-llama-2-70b-f")
predictor = my_model.deploy()

This deploys the model on SageMaker with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
    “inputs”:  
      [
        [
         {"role": "system", "content": "Always answer with Haiku"},
         {"role": "user", "content": "I am going to Paris, what should I see?"},
        ]   
      ],
   "parameters":{"max_new_tokens":256, "top_p":0.9, "temperature":0.6}
}

Fine-tuned chat models (Llama-2-7b-chat, Llama-2-13b-chat, Llama-2-70b-chat) accept a history of chat between the user and the chat assistant, and generate the subsequent chat. The pre-trained models (Llama-2-7b, Llama-2-13b, Llama-2-70b) requires a string prompt and perform text completion on the provided prompt. See the following code:

predictor.predict(payload, custom_attributes="accept_eula=true")

Note that by default, accept_eula is set to false. You need to set accept_eula=true to invoke the endpoint successfully. By doing so, you accept the user license agreement and acceptable use policy as mentioned earlier. You can also download the license agreement.

Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by = and pairs are separated by ;. If the user passes the same key more than once, the last value is kept and passed to the script handler (i.e., in this case, used for conditional logic). For example, if accept_eula=false; accept_eula=true is passed to the server, then accept_eula=true is kept and passed to the script handler.

Inference parameters control the text generation process at the endpoint. The maximum new tokens control refers to the size of the output generated by the model. Note that this is not the same as the number of words because the vocabulary of the model is not the same as the English language vocabulary, and each token may not be an English language word. Temperature controls the randomness in the output. Higher temperature results in more creative and hallucinated outputs. All the inference parameters are optional.

The following table lists all the Llama models available in SageMaker JumpStart along with the model_ids, default instance types, and the maximum number of total tokens (sum of number of input tokens and number of generated tokens) supported for each of these models.

Model Name	Model ID	Max Total Tokens	Default Instance Type
Llama-2-7b	meta-textgeneration-llama-2-7b	4096	ml.g5.2xlarge
Llama-2-7b-chat	meta-textgeneration-llama-2-7b-f	4096	ml.g5.2xlarge
Llama-2-13b	meta-textgeneration-llama-2-13b	4096	ml.g5.12xlarge
Llama-2-13b-chat	meta-textgeneration-llama-2-13b-f	4096	ml.g5.12xlarge
Llama-2-70b	meta-textgeneration-llama-2-70b	4096	ml.g5.48xlarge
Llama-2-70b-chat	meta-textgeneration-llama-2-70b-f	4096	ml.g5.48xlarge

Note that SageMaker endpoints have a timeout limit of 60s. Thus, even though the model may be able to generate 4096 tokens, if text generation takes more than 60s, request will fail. For 7B, 13B, and 70B models, we recommend to set max_new_tokens no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.

Inference and example prompts for Llama-2-70b

You can use Llama models for text completion for any piece of text. Through text generation, you can perform a variety of tasks, such as answering questions, language translation, sentiment analysis, and many more. Input payload to the endpoint looks like the following code:

payload = {
    “inputs”: str,
   (optional)"parameters":{"max_new_tokens":int, "top_p":float, "temperature":float}
}

The following are some sample example prompts and the text generated by the model. All outputs are generated with inference parameters {"max_new_tokens":256, "top_p":0.9, "temperature":0.6}.

Input: Simply put, the theory of relativity states that  

 Output: 
1) the laws of physics are the same for all observers in uniform motion relative to one another, and 2) the speed of light is the same for all observers, regardless of their relative motion or of the motion of the source of the light.
In the late 1800s, scientists believed that light travelled through a medium they called the luminiferous ether. But in 1887, Albert Michelson and Edward Morley designed an experiment to measure the speed of light in the ether, and found that the speed of light was the same no matter how they oriented their equipment. This result seemed to contradict the idea of the ether, and Michelson and Morley’s experiment became one of the most famous failed experiments in history.
In 1905, Albert Einstein published a paper that used the results of the Michelson-Morley experiment to develop the theory of special relativity. In 1915, Einstein published a paper on general relativity, which is a theory of gravitation.
In physics, the speed of light is considered to be a constant, and its value is 299,792,458 metres

Input: I believe the meaning of life is 

 Output: 
to be happy, to find joy in life, to love and be loved. To have the courage to be yourself and to be kind to others. To be able to express yourself, to create and to be inspired. To be able to give and receive. To be able to learn and grow. To be able to find beauty in the world and to be able to appreciate it. To be able to find peace and contentment. To be able to find meaning and purpose. To be able to make a difference in the world. To be able to find joy in the simple things. To be able to find joy in the difficult things. To be able to find joy in the journey. To be able to find joy in the destination. To be able to find joy in the present moment. To be able to find joy in the future. To be able to find joy in the past. To be able to find joy in the present. To be able to find joy in the future. To be able to find joy in the past. To be able to find joy in the present. To be able to find joy in the future. To be able to find joy in the past. To be able to find joy in the present. To be able to find joy

Input: A brief message congratulating the team on the launch:

        Hi everyone,
        
        I just  

 Output: 
wanted to send a quick note to congratulate everyone on the launch of our new website. I know it’s been a long time coming and I’m sure we’re all relieved that it’s finally out there.
        
        Thanks to everyone for their hard work and dedication over the last few months. I know it wasn’t always easy, but I think we can all be proud of the final product.
        
        I’m looking forward to seeing how our users respond to the new site and I’m confident that it will be a success.
        
        Thanks again, and enjoy the rest of your day!
        
        Best,
        
        [Your Name]
        
        [Your Title]
        
        [Your Company]
        
        [Your Email]
        
        [Your Phone Number]
        
        [Your Website]
        
        [Your Social Media Handles]
        
        [Your Address]
        
        [Your City, State, Zip Code]
        
        [Your Country]
        
        [Your Postal Code]
        
        [Your Website]
        
        [

In the next example, we show how to use Llama models with few-shot in-context learning, where we provide training samples available to the model. Note that we only make inference on the deployed model and during this process, model weights don’t change.

Input: Translate English to French:
        
        sea otter => loutre de mer
        peppermint => menthe poivrée
        plush girafe => girafe peluche
        cheese => 

 Output: 
fromage
        chocolate => chocolat
        car => voiture
        hamster => hamster
        beer => bière
        bicycle => bicyclette
        banana => banane
        strawberry => fraise
        pizza => pizza
        cheeseburger => cheeseburger
        pineapple => ananas
        carrot => carotte
        watermelon => pastèque
        whale => baleine
        eggplant => aubergine
        cat => chat
        penguin => manchot
        lion => lion
        elephant => éléphant
        panda => panda
        tiger => tigre
        salmon => saumon
        squirrel => écureuil
        kangaroo => kangourou
        turtle => tortue
        octopus => poulpe
        frog => grenouille
        pig => cochon
        goldfish => poisson rouge
        tiger shark => requin tigre
        gorilla => gorille
        dol

Inference and example prompts for Llama-2-70b-chat

With Llama-2-Chat models, which are optimized for dialogue use cases, the input to the chat model endpoints is the previous history between the chat assistant and the user. You can ask questions contextual to the conversation that has happened so far. You can also provide the system configuration, such as personas that define the chat assistant’s behavior. The input payload to the endpoint looks like the following code:

payload = {
    “inputs”:  [
     [
      (optional){"role": "system", "content": str},
      (repeated k times) {"role": "user", "content": str}, {"role": "assistant", "content": str},
      {"role": "user", "content": str},
     ]
    ],
   (optional)"parameters":{"max_new_tokens":int, "top_p":float, "temperature":float}
}

The following are some sample example prompts and the text generated by the model. All outputs are generated with the inference parameters {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6}.

Input:
User: what is the recipe of mayonnaise?

Output:
Assistant:  Sure, I can help you with that! Mayonnaise is a thick, creamy condiment made from a mixture of egg yolks, oil, vinegar or lemon juice, and seasonings. Here's a basic recipe for homemade mayonnaise:

Ingredients:

* 2 egg yolks
* 1/2 cup (120 ml) neutral-tasting oil, such as canola or grapeseed
* 1 tablespoon (15 ml) vinegar or lemon juice
* 1/2 teaspoon salt
* 1/4 teaspoon black pepper

Instructions:

1. In a medium-sized bowl, whisk together the egg yolks, salt, and black pepper until well combined.
2. Slowly pour the oil into the bowl while continuously whisking the mixture. It's important to add the oil very slowly, otherwise the mixture can break and you'll end up with a separated, curdled mess.
3. Continue whisking until the mixture thickens and emulsifies, which should take about 5-7 minutes. You'll know it's ready when the mixture has doubled in volume and has a thick, creamy consistency.
4. Add the vinegar or lemon juice and continue whisking for another minute or two, until the flavors are fully incorporated.
5. Taste the mayonnaise and adjust the seasoning as needed. You can add more salt, pepper, or acid if desired.
6. Cover the bowl with plastic wrap and refrigerate the mayonnaise for at least 30 minutes before serving. This will allow the flavors to meld together and the mayonnaise to thicken further.

That's it! This recipe makes about 1 1/2 cups (360 ml) of mayonnaise, which should be enough for a small to medium-sized batch of sandwiches or salads. You can also customize the recipe by adding different flavorings, such as garlic, mustard, or herbs, to create a unique taste. Enjoy!

In the following example, the user has had a conversation with the assistant about tourist sites in Paris. Next, the user is inquiring about the first option recommended by the chat assistant.

Input:
User: I am going to Paris, what should I see?
Assistant: Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:

1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.

These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.
User: What is so great about #1?

Output:
Assistant:  The Eiffel Tower is a world-renowned iconic landmark that symbolizes Paris and France. It was built for the World's Fair in 1889 and was initially intended to be a temporary structure, but it has become a permanent part of the Parisian skyline. Here are some reasons why the Eiffel Tower is so great:

1. Engineering marvel: The Eiffel Tower was a groundbreaking feat of engineering at the time of its construction. It stands at 324 meters (1,063 feet) tall and was the tallest structure in the world for over 40 years.
2. Iconic design: The Eiffel Tower's distinctive shape and design make it instantly recognizable. It has been the subject of countless works of art, photographs, and films, and has become a symbol of Paris and France.
3. Panoramic views: The Eiffel Tower offers breathtaking views of the city from its observation decks. Visitors can enjoy a 360-degree view of Paris, including famous landmarks such as the Seine River, the Arc de Triomphe, and the Montmartre neighborhood.
4. Historical significance: The Eiffel Tower has played a significant role in Paris's history, having been the site of numerous important events, including the famous "Battle of the Eiffel Tower" during World War II.
5. Cultural impact: The Eiffel Tower has had a significant impact on popular culture, inspiring countless works of art, literature, and music. It has also been the subject of numerous documentaries and films, including the classic movie "An American in Paris."
6. Romantic setting: The Eiffel Tower is known for its romantic atmosphere, making it a popular destination for couples and newlyweds. It has been the site of countless marriage proposals and has been featured in numerous romantic films and literature.
7. Accessibility: The Eiffel Tower is easily accessible by public transportation, making it a convenient destination for visitors. It is also wheelchair-friendly, with elevators and ramps providing access to all levels.

Overall, the Eiffel Tower is a must-see destination for anyone visiting Paris, offering a unique and unforgettable experience that combines history, culture, engineering, and rom

In the following examples, we set the system’s configuration:

Input:
System: Always answer with Haiku
User: I am going to Paris, what should I see?

Output:
Assistant:  Eiffel Tower high
Art in the Louvre Museum
River Seine flows

Input:
System: Always answer with emojis
User: How to go from Beijing to NY?

Output:
Assistant:  Here are the emojis that represent the steps to go from Beijing to New York:

🇨🇳🛫🛬🇺🇸🗽🏙️🚕💨🛩️🛬🛫😴💤🛩️🛬🛫😍

Fine-Tuning

In ML, the ability to transfer the knowledge learned in one domain to another is called transfer learning. You can use transfer learning to produce accurate models on your smaller datasets, with much lower training costs than the ones involved in training the original model. Using transfer learning, you can fine-tune the Llama 2 model and adapt on your own dataset in a matter of 1-2 hours. You can use both domain adaptation and instruction tuning datasets to perform fine-tuning of the base model. Currently, you can train Llama 2 7B and 13B model on SageMaker JumpStart. Fine-tuning scripts are based on the scripts provided by this repo. They utilize Fully Sharded Data Parallel (FSDP) library as well as Low Rank Adaptation (LoRA) method fine-tuning the models efficiently. You can tune any of the 14 hyper-parameters to adapt fine-tuning for your application. To learn more, please see the Fine-tune LLaMA 2 models on SageMaker JumpStart notebook and Fine-tune Llama 2 for text generation on Amazon SageMaker JumpStart blog post.

To train the model via studio, you can click on the train button in the model card. It will train the model on SageMaker with the default dataset and default parameters.

Clean up

After you’re done running the notebook, make sure to delete all resources so that all the resources that you created in the process are deleted and your billing is stopped:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Llama 2 models in SageMaker Studio. With this, you have access to six Llama 2 foundation models that contain billions of parameters. Because foundation models are pre-trained, they can also help lower training and infrastructure costs and enable customization for your use case. To get started with SageMaker JumpStart, visit the following resources:

About the authors

June Won is a product manager with SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping application and last mile delivery.

Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.

Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker JumpStart team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Sundar Ranganathan is the Global Head of GenAI/Frameworks GTM Specialists at AWS. He focuses on developing GTM strategy for large language models, GenAI, and large-scale ML workloads across AWS services like Amazon EC2, EKS, EFA, AWS Batch, and Amazon SageMaker. His experience includes leadership roles in product management and product development at NetApp, Micron Technology, Qualcomm, and Mentor Graphics.

AWS Machine Learning Blog