AWS Machine Learning Blog
Code Llama code generation models from Meta are now available via Amazon SageMaker JumpStart
Today, we are excited to announce Code Llama foundation models, developed by Meta, are available for customers through Amazon SageMaker JumpStart to deploy with one click for running inference. Code Llama is a state-of-the-art large language model (LLM) capable of generating code and natural language about code from both code and natural language prompts. Code Llama is free for research and commercial use. You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. In this post, we walk through how to discover and deploy the Code Llama model via SageMaker JumpStart.
What is Code Llama
Code Llama is a model released by Meta that is built on top of Llama 2 and is a state-of-the-art model designed to improve productivity for programming tasks for developers by helping them create high quality, well-documented code. The models show state-of-the-art performance in Python, C++, Java, PHP, C#, TypeScript, and Bash, and have the potential to save developers’ time and make software workflows more efficient. It comes in three variants, engineered to cover a wide variety of applications: the foundational model (Code Llama), a Python specialized model (Code Llama-Python), and an instruction-following model for understanding natural language instructions (Code Llama-Instruct). All Code Llama variants come in three sizes: 7B, 13B, and 34B parameters. The 7B and 13B base and instruct variants support infilling based on surrounding content, making them ideal for code assistant applications.
The models were designed using Llama 2 as the base and then trained on 500 billion tokens of code data, with the Python specialized version trained on an incremental 100 billion tokens. The Code Llama models provide stable generations with up to 100,000 tokens of context. All models are trained on sequences of 16,000 tokens and show improvements on inputs with up to 100,000 tokens.
The model is made available under the same community license as Llama 2.
What is SageMaker JumpStart
With SageMaker JumpStart, ML practitioners can choose from a growing list of best-performing foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances within a network isolated environment and customize models using SageMaker for model training and deployment.
You can now discover and deploy Code Llama models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping ensure data security. Code Llama models are discoverable and can be deployed in in US East (N. Virginia), US West (Oregon) and Europe (Ireland) regions.
Customers must accept the EULA to deploy model visa SageMaker SDK.
Discover models
You can access Code Llama foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.
SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.
In SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.
On the SageMaker JumpStart landing page, you can browse for solutions, models, notebooks, and other resources. You can find Code Llama models in the Foundation Models: Text Generation carousel.
You can also find other model variants by choosing Explore all Text Generation Models or searching for Code Llama.
You can choose the model card to view details about the model such as license, data used to train, and how to use. You will also find two buttons, Deploy and Open Notebook, which will help you use the model.
Deploy
When you choose Deploy and acknowledge the terms, deployment will start. Alternatively, you can deploy through the example notebook by choosing Open Notebook. The example notebook that provides end-to-end guidance on how to deploy the model for inference and clean up resources.
To deploy using notebook, we start by selecting an appropriate model, specified by the model_id
. You can deploy any of the selected models on SageMaker with the following code:
This deploys the model on SageMaker with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:
Note that by default, accept_eula
is set to false
. You need to set accept_eula=true
to invoke the endpoint successfully. By doing so, you accept the user license agreement and acceptable use policy as mentioned earlier. You can also download the license agreement.
Custom_attributes
used to pass EULA are key/value pairs. The key and value are separated by = and pairs are separated by ;. If the user passes the same key more than once, the last value is kept and passed to the script handler (in this case, used for conditional logic). For example, if accept_eula=false; accept_eula=true
is passed to the server, then accept_eula=true
is kept and passed to the script handler.
Inference parameters control the text generation process at the endpoint. The maximum new tokens control refers to the size of the output generated by the model. Note that this is not the same as the number of words because the vocabulary of the model is not the same as the English language vocabulary, and each token may not be an English language word. Temperature controls the randomness in the output. Higher temperature results in more creative and hallucinated outputs. All the inference parameters are optional.
The following table lists all the Code Llama models available in SageMaker JumpStart along with the model IDs, default instance types, and the maximum supported tokens (sum of the number of input tokens and number of generated tokens for all concurrent requests) supported for each of these models.
Model Name | Model ID | Default Instance Type | Max Supported Tokens |
CodeLlama-7b | meta-textgeneration-llama-codellama-7b | ml.g5.2xlarge | 10000 |
CodeLlama-7b-Instruct | meta-textgeneration-llama-codellama-7b-instruct | ml.g5.2xlarge | 10000 |
CodeLlama-7b-Python | meta-textgeneration-llama-codellama-7b-python | ml.g5.2xlarge | 10000 |
CodeLlama-13b | meta-textgeneration-llama-codellama-13b | ml.g5.12xlarge | 32000 |
CodeLlama-13b-Instruct | meta-textgeneration-llama-codellama-13b-instruct | ml.g5.12xlarge | 32000 |
CodeLlama-13b-Python | meta-textgeneration-llama-codellama-13b-python | ml.g5.12xlarge | 32000 |
CodeLlama-34b | meta-textgeneration-llama-codellama-34b | ml.g5.48xlarge | 48000 |
CodeLlama-34b-Instruct | meta-textgeneration-llama-codellama-34b-instruct | ml.g5.48xlarge | 48000 |
CodeLlama-34b-Python | meta-textgeneration-llama-codellama-34b-python | ml.g5.48xlarge | 48000 |
While the Code Llama models were trained on a context length of 16,000 tokens, the models have reported good performance on even larger context windows. The maximum supported tokens column in the preceding table is the upper limit on the supported context window on the default instance type. Since the Code Llama 7B model can only support 10,000 tokens on an ml.g5.2xlarge instance, we recommend deploying a 13B or 34B model version if larger contexts are required for your application.
By default, all models work for code generation tasks. The base and instruct models both respond to infilling tasks, though the base model had better quality output for the majority of sample queries. Finally, only instruct models work on instruct tasks. The following table illustrates which models had good performance (Good) and moderate performance (Moderate) on example queries in the demo notebooks.
. | Code Generation | Code Infilling | Code instructions |
CodeLlama-7b | Good | Good | N/A |
CodeLlama-7b-Instruct | Good | Moderate | Good |
CodeLlama-7b-Python | Good | N/A | N/A |
CodeLlama-13b | Good | Good | N/A |
CodeLlama-13b-Instruct | Good | Moderate | Good |
CodeLlama-13b-Python | Good | N/A | N/A |
CodeLlama-34b | Good | N/A | N/A |
CodeLlama-34b-Instruct | Good | N/A | Good |
CodeLlama-34b-Python | Good | N/A | N/A |
Code generation
The following examples were run on the CodeLlama-34b-Instruct model with payload parameters "parameters": {"max_new_tokens": 256, "temperature": 0.2, "top_p": 0.9}
:
Code infilling
Code infilling involves returning generated code given surrounding context. This differs from the code generation task because, in addition to a prefix code segment, the model is also provided with a code segment suffix. Special tokens were used during fine-tuning to mark the beginning of the prefix (<PRE>
), the beginning of the suffix (<SUF>
), and the beginning of the middle (<MID>
). Input sequences to the model should be in one of the following formats:
- prefix-suffix-middle –
<PRE> {prefix} <SUF>{suffix} <MID>
- suffix-prefix-middle –
<PRE> <SUF>{suffix} <MID> {prefix}
The following examples use the prefix-suffix-middle format on the CodeLlama-7b model with payload parameters {"max_new_tokens": 256, "temperature": 0.05, "top_p": 0.9}
:
Code instructions
Meta also provided an instruction-tuned variant of Code Llama. Example queries in this section can only be applied to these instruction-tuned Code Llama models, which are the models with a model ID instruct suffix. The Code Llama format for instructions is the same as the Llama-2-chat prompt format, which we detail in Llama 2 foundation models are now available in SageMaker JumpStart
A simple user prompt may look like the following:
You may also add a system prompt with the following syntax:
Finally, you can have a conversational interaction with the model by including all previous user prompts and assistant responses in the input:
These examples were run on the CodeLlama-13b-Instruct model with payload parameters “parameters”: {"max_new_tokens": 512, "temperature": 0.2, "top_p": 0.9}
:
Clean up
After you’re done running the notebook, make sure to delete all resources that you created in the process so your billing is stopped. Use the following code:
Conclusion
In this post, we showed you how to get started with Code Llama models in SageMaker Studio and deploy the model for generating code and natural language about code from both code and natural language prompts. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.
Resources
- SageMaker JumpStart documentation
- SageMaker JumpStart Foundation Models documentation
- SageMaker JumpStart product detail page
- SageMaker JumpStart model catalog
About the authors
Gabriel Synnaeve is a Research Director on the Facebook AI Research (FAIR) team at Meta. Prior to Meta, Gabriel was a postdoctoral fellow in Emmanuel Dupoux’s team at École Normale Supérieure in Paris, working on reverse-engineering the acquisition of language in babies. Gabriel received his PhD in Bayesian modeling applied to real-time strategy games AI from the University of Grenoble.
Eissa Jamil is a Partner Engineer RL, Generative AI at Meta.
Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker JumpStart team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.
Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.
Vivek Singh is a product manager with SageMaker JumpStart. He focuses on enabling customers to onboard SageMaker JumpStart to simplify and accelerate their ML journey to build Generative AI applications.