AWS Marketplace

Experimenting with GPT-2 XL machine learning model package on Amazon SageMaker

New deep-learning model architectures push what’s possible in the field of natural language processing (NLP). NLP is the study of methods of processing and analysis of human language data. In machine learning (ML), transfer learning takes model parameters learned on one task and uses them as a basis for another task with some additional fine-tuning. When applied to NLP, transfer learning enables ML practitioners to use large, pre-trained language models. You can fine-tune models for your specific downstream tasks with a fraction of the cost required to train a similar model from scratch.

One such NLP use case is to generate coherent paragraphs or even whole corpuses of text based on input text. This process is called text generation.

In this post, Alex and I show you how to subscribe to and deploy a pretrained GPT-2 XL model, available in AWS Marketplace, to Amazon SageMaker to process natural language. We focus primarily on text generation and will show you how to run experiments generating first prose and then poetry. We’ll also dive deep into text parameters to understand how changing them affects output.

Solution overview

At a high level, the following architecture shows how the model is deployed on SageMaker and used to make predictions. The solution consists of four steps.

  1. Authenticate into your AWS account and subscribe to the pretrained ML model available in AWS Marketplace.
  2. Create an Amazon SageMaker notebook instance and deploy the model to Amazon SageMaker using this sample Jupyter notebook. Using the sample code in the notebook, you can create a SageMaker inference endpoint for single predictions. You can also use SageMaker in batch transform mode for batch predictions. Both modes facilitate interactive experimentation with the language model.
  3. Use the SageMaker endpoint to experiment with the GPT-2 XL model for various NLP use cases.
  4. Perform the same experiments in batch transform mode. Refer to the following diagram.

Step 1: Subscribe to the GPT-2 XL model

To subscribe to the model in AWS Marketplace, follow these steps.

  1. Log in to your AWS account.
  2. Open the GPT-2 XL listing in AWS Marketplace.
  3. Read Highlights, Product Overview, Usage information, and Additional resources. Review the supported instance types.
  4. Choose Continue to Subscribe.
  5. Review End user license agreement, Support Terms, and Pricing Information.
  6. Review the pricing list, which is based on instance type, EULA, and support terms.
  7. When you are ready, choose Accept Offer.

Step 2: Create the SageMaker notebook instance and deploy the model

Step 2.1 Set up the notebook instance and the model

  1. Create a classic notebook instance using these instructions.
  2. Open the notebook instance. You should have access to all SageMaker examples.
  3. In the AWS Marketplace section, use creative-writing-using-gpt-2-text-generation.ipynb to follow along with the rest of this post.

Step 2.2 Experiment with the sample code

To get the most out of this post, read the use cases and experiment with the sample code. To get set up, do the following.

  1. In the notebook, execute the code cells up to step 2.
  2. After you create an endpoint, in step 3 of the notebook, run the code cells for each use case.
  3. To ensure that you delete the SageMaker endpoint after you’re done experimenting, execute steps 4 and 5 of the notebook.

Step 3: Explore text generation use cases and experiments

In this section, we walk through text generation use cases and experiments. We include generating a sonnet based on Shakespearean input and generating prose based on sample input. We dive deep into the parameters for that experiment, using a histogram to evaluate how different text parameters affect the model’s output.

You can use either a SageMaker endpoint to make distinct API calls or batch transform to make bulk inferences.

Step 3.1 Text generation parameters

In order to explore use cases, you must first look at some model parameters that influence the text generation behavior. Understanding what these parameters are and how they work provides insight into the experiments and helps you define your own experiments. Here is the list of parameters you can modify when you call the model package to generate text:

Model parameter Type and value Description
input String; required The input text.
length Int; default = 50 The number of words to generate.
num_return_sequences Int; default = 1 The number of different sequences to generate. All sequences start from the same input.
temperature Float; default = 1.0 The temperature of softmax. Higher values increase creativity and decrease output coherence.
k Int; default = 50 The top-k sampling. The model chooses from top k most probable words. Lower values eliminate fewer coherent words.
p Float; default = 1.0 The top-p nucleus sampling. This should be between 0 and 1 to activate. As the alternative to top-k, it selects a minimum number of candidate words where the cumulative probability exceeds p. Values closer to 1.0 generally provide more coherent outputs.
repetition_penalty Float; default = 1.0 A higher value discourages the model from repeating the same token.

Step 3.2: Text generation use cases

The primary use case for GPT-2 XL is to predict text based on contextual input. To demonstrate this, we set up experiments to have the model generate first prose and then poetry. For the poetry experiment, we chose a recognizable style: Shakespearean.

Step 3.2.a Experiment 1: Writing prose with model assistance

In the following code sample, we specify an input prompt on a specific topic to mimic a prose author’s writing style. To try this, execute the code cell in your Jupyter notebook to examine your own unique output.

Example payload:

payload = '{"input": "Machine learning is great for humanity. It

helps", "length": 40, "repetition_penalty": 10,"num_return_sequences": 1}'

response = client.invoke_endpoint(





output = response["Body"].read()


Example output:

{'output': [' us predict our future – not just in business environments like financial ''markets and medical diagnoses, but also as a society building better '
'structures to support more human-centric outcomes (eSport? Better '

This API call model generates text based on given input context. With the input ‘Machine learning is great for humanity. It helps…, this experiment generated the phrase completion … us predict our future – not just in business environments like financial markets and medical diagnoses, but also as a society building better structures to support more human-centric outcomes.’

Step 3.2.b Experiment 2: Authoring a poem and experimenting with different text-generation parameters

This experiment shows how to use GPT-2 XL to generate text based on the style of an input prompt, which in this case is a Shakespearean sonnet.

In the following chart, the first column is the sample prompt, which are fragments of William Shakespeare’s Sonnet 1. The second column in the model’s text generation response. The third column is the parameters used to generate the response.

Sample prompt Response Parameters

From fairest creatures we desire increase,

That thereby beauty’s rose might never die,

But as the riper should by time decrease,

His tender heir might bear his memory:

The first was of heaven: the other of earth.

And how sweet these were that were not to be,

The first of heaven hath to us been named the sun,

And from him hath received his due recompense.

Twas to him that first God gave, and is with us still,





From fairest creatures we desire increase,

That thereby beauty’s rose might never die,

But as the riper should by time decrease,

His tender heir might bear his memory:

O fairest, fairest son of earth! thou’rt thy parent.

I’ll follow thee, thy noble heart shall bear me,The fairest of thy kin, with sweet affection:

And in the shade of thine eyes let us lie,

And while thou art sleeping, while my hear





From fairest creatures we desire increase,

That thereby beauty’s rose might never die,

But as the riper should by time decrease,

His tender heir might bear his memory:

And as the rose, if it be but once,

Should be lost, so the fair of his life should be lost.

But as the rose, if it be but once,

Should be lost, so the fair of his life should be lost.





The model output in the Response column mimics the style of the sample prompt. It is passable as a Shakespearean-style sonnet.

Step 3.2.c: Understanding how text generation parameters influence model output

In the experiment in step 3.2.b, the text generation parameters we chose for each prompt affected the response. To further visualize how a combination of different text generation parameters influences model output, we plotted the following histogram of a distribution of common syntactic categories of words for the output of length 70.

The following histogram shows you the outcome of eleven experiments to generate a continuation of Shakespeare’s Sonnet 1 fragment using different combinations of text generation parameters. Each bar corresponds to the frequency of usage for a specific word category in the generated text. The following categories were evaluated: conjunctions, prepositions, adjectives, modals, nouns, pronouns, adverbs, and verbs. Each color of a bar reflects to the distinct experiment; see the legend of the graph for more detail. Refer to the following chart.

For each parameter, we selected several values and tested their combined influence on the output.

  • The temperature parameter. This has the greatest influence on the generated text across all categories of words. The default value is 1, which is optimal for starting with in most cases. Large values of temp dramatically shift probability distribution for the majority of the tokens and might hurt output quality. You might notice disproportional spikes for adjectives on the graph when temp is set to 5. On the other side, temperatures closer to zero forces the model to gravitate to mostly memorized text.
  • The sampling parameters, p and k. These are good for fine-tuning the text that the model generates. You can use them independently or in combination. Both parameters influence the sampling pool of tokens to choose the next generated word. The k parameter determines a fixed number of most probable tokens to choose from. A higher value of p allows for more candidate tokens to select from when the majority have similar probabilities.
  • Tuning the quality of a text. Both sampling parameters can affect the frequency of prepositions and modals in the generated text. Unless you set these parameters to extreme values, they enable you to alter the style and tune quality of a text that the model produces, without impacting its overall coherence as much as temp.
  • Influencing the quality of an output. The combination of text generation parameters such as temp, p, k, and length can substantially influence quality of an output produced by the model, as well as its style.

Step 3.2.d: Experiment 3: Analyzing the distribution of words in the response

Next, you can use the same Shakespeare fragment to quantify and compare the linguistic properties of responses.

Start with the fragment of Sonnet 1 and compare the subsequent lines in the original to the lines that GPT-2 XL generates. Following each continuation, the Shakespeare text and the model-generated text, plot the distribution of words by categories: parts of speech, or syntax.

From fairest creatures we desire increase,

That thereby beauty’s rose might never die,

But as the riper should by time decrease,

His tender heir might bear his memory:

Original Shakespeare text:

But thou, contracted to thine own bright eyes,

Feed’st thy light’st flame with self-substantial fuel,

Making a famine where abundance lies,

Thyself thy foe, to thy sweet self too cruel.

Model-generated text:

O’er his brow should glory the light

Of eternal suns, from whose rays

There may still the world renew and live,

If he who shall possess the sun’s mantle,

Should yet by the sun’s dying star be destroyed.

In the distribution of words by categories for both texts, nouns prevail over pronouns and verbs over adverbs. In the model-generated text, the distribution of syntactic categories of words is different, but still similar to the distribution in the original Shakespeare sonnet. The following bar charts show the distribution of conjunctions, prepositions, adjectives, nouns, pronouns, adverbs, and verbs for the input and for the model-generated sonnet.

Chart 1 Frequency of parts of speech, original poem

Chart 2 Frequency of parts of speech, model-generated poem

In contrast, if we set the model’s text generation parameters to be different from the default values (for example, temp=2, p=0.1, and k=2), the model response might become inconsistent with the input, as shown in the following image. Now the generated text is highly repetitive, which changes the syntactic distribution significantly: prepositions and adjectives are almost the same, as are nouns and pronouns, and modals and verbs are much lower.

The following chart shows the model-generated text on the left and a bar chart of the parts of speech in it on the right. The bar chart shows a high number of nouns, pronouns, and adjectives that does not align to the distribution in the original text.

And thus, in the world’s end, the fairest flower
Should be the last to die.
The last of the fair, the last of the fair,
The last of the fair, the last of the fair,
The last of the fair, the last of the fair,

Chart 3 Frequency of parts of speech, model-generate with temp=2, p=0.1, and k=2

By experimenting with the model’s text-generation parameters, you might find optimal values that produce consistent and more meaningful output for your input prompt. Also, by analyzing the relative distribution of words by categories, you can implement automatic filtration of incoherent output.


In this blog post and our notebook, we showed you how to run experiments with the GPT-2 XL model, available in AWS Marketplace. We showed how to subscribe to the model, create the SageMaker notebook instance, deploy the model, and experiment with the sample code.

We focused on exploring text generation use cases and experiments, starting with explaining model parameters and their effect on text generation. We ran experiments generating first prose and then poetry, creating our own Shakespeare-like sonnet, and we dove deep into showing how the text parameters temp, p, and k influence the output of the model. We also analyzed the distribution of parts of speech when those parameters changed

Next steps

Try experimenting with question answering, reading comprehension, and even language translation.

About the Authors

Alex IgnatovAlex Ignatov is a Senior Data Scientist at AWS WWPS Professional Services, where he works with customers across different industries to facilitate adoption of AWS machine learning and AI services. He is passionate about cutting-edge deep learning techniques.



Mrudhula BalasubramanyanMrudhula Balasubramanyan is a senior solutions architect with AWS WWPS Solutions Architecture. She specializes in AI/ML and enjoys innovating on behalf of her mission-driven nonprofit customers. When not obsessing over them, she can be seen hiking and biking the trails of the great Pacific Northwest.