AWS for M&E Blog

Democratize documentation summarization with Hugging Face on Amazon SageMaker


As the media and entertainment (M&E) industry evolves, companies within the space are finding more opportunities to use artificial intelligence (AI) and machine learning (ML) to deliver a better customer experience. Customer acquisition, retention, and engagement are key areas where AI/ML technologies are poised to deliver significant value. With increasingly complex content consumption patterns, AI/ML solutions help M&E customers effectively engage with their consumers.

Document summarization is an example of a key use case within the M&E industry. From generating article headlines to creating short-form summaries for push notifications, an automated document summarization solution can help content providers more effectively engage with their consumers. Traditionally, summarization tasks are performed manually. But that is not always practical, particularly with the recent explosion in user-generated content. New, state-of-the-art natural language generation (NLG) models can automatically generate high-quality document summaries. And solutions from Amazon Web Services (AWS) Partners make using these models easier. Using the Transformers library from Hugging Face, an AWS Partner, alongside Amazon SageMaker—ML for every data scientist and developer—deploying these models is simpler than ever. In this blog post, we explore how Amazon SageMaker can be used to deploy a state-of-the-art document summarization solution using the Hugging Face library.

Overview of document summarization approaches

Before delving into the solution details, it is worthwhile to cover two high-level approaches to document summarization: extractive and abstractive models. With extractive summarization, the ML model extracts key sentences from a large body of text verbatim, which might not always produce the highest quality summary. Abstractive methods, in contrast, generate entirely new summaries capturing the crux of the source document. Thus, these methods generate output similar to what a human summarizer could produce by paraphrasing the document.

Hugging Face supports both methods, and both have applications within the M&E industry. Traditional publishers have a need to summarize long-form articles for faster discoverability by readers. Video providers can use summarization to create blurbs for electronic program guides (EPGs) as well as media asset management (MAM). Further, content producers can create synopses of dailies by using automatic speech recognition (ASR) and summarization in the same workflow. Figure 1 provides an illustration of both approaches.

Figure 1: Extractive summarization extracts key terms verbatim. Abstractive summarization paraphrases the document.

While it is possible to fine-tune or even train summarization models from scratch, in this blog we deploy a pretrained model from the Hugging Face Hub. Although a pretrained model might not work in every domain, such as summarizing medical journals, it might suffice as a starting point for more general summarization use cases. Additionally, we use endpoints from the recently introduced Amazon SageMaker Asynchronous Inference, which queues incoming requests and processes them asynchronously.

An asynchronous endpoint is optimal for use cases that do not require real-time low latency predictions. Document summarization might fall into this category in cases where it is adequate to receive a generated summary within seconds or minutes as opposed to under a second. Additionally, this mode of inference facilitates the automatic scaling of the endpoint instances down to zero when the endpoint becomes idle for a set period of time. Also, by removing latency constraints, we can use lower-priced CPU instances in lieu of more expensive GPUs. Figure 2 is an illustrative example of what an asynchronous document summarization architecture could look like.

With this approach, a document lands on Amazon Simple Storage Service (Amazon S3)—an object storage service offering industry-leading scalability, data availability, security, and performance—initiating a function from AWS Lambda, which lets users run code without thinking about servers or clusters. The function then invokes the Amazon SageMaker endpoint. The endpoint saves the summary output to Amazon S3 and sends a notification on Amazon Simple Notification Service (Amazon SNS), a fully managed messaging service for both application-to-application and application-to-person communication. Optionally, a queue on Amazon Simple Queue Service (Amazon SQS)—a fully managed message queuing service that lets users decouple and scale microservices, distributed systems, and serverless applications—can subscribe to the Amazon SNS notifications to fan out the summary results to downstream consumers.

Figure 2: Example of asynchronous document summarization architecture where a document summarization is initiated when a document is uploaded to Amazon S3

Deployment walk through

Here, we walk through an example to deploy an asynchronous endpoint that summarizes documents stored on Amazon S3. All of the code is available in a notebook that you can clone from this GitHub repository

To follow along with this example, you need access to Amazon SageMaker Studio, a fully integrated development environment for ML, or an Amazon SageMaker notebook instance, an ML compute instance running the Jupyter Notebook App. Instructions for onboarding to Amazon SageMaker can be found in this documentation . There are several ways to load pretrained models from the Hugging Face Hub as outlined in this blog announcing Amazon SageMaker Hugging Face inference capabilities. We use a slightly different approach here. Namely, we manually download the model from the Hugging Face Hub and package it up into an Amazon SageMaker model artifact. In doing so, we can host the model artifact on our own Amazon S3 bucket instead of loading it from the Hugging Face Hub every time the endpoint scales out or is updated.

Running the shell commands that follows installs Git Large File Storage (LFS) so that we can clone the model repo using Git. Here we download a distilbart-cnn-12-6 model that can be used for abstractive summarization.

! tar rvf  model.tar code/
! gzip  model.tar model.tar.gz
! rm model.tar

Next we combine the Tokenizer and the Model into a single model.tarartifact

! cd distilbart-cnn-12-6 && tar --exclude=".*" -cvf  model.tar * && mv model.tar ../model.tar
! rm -r distilbart-cnn-12-6/

Next, we create a custom entry point python script that handles model loading and input processing at inference time. Here, the model_fn is responsible for loading the pretrained model. The contents of the previously created model.tar.gz artifact is extracted into the model_dir, which is the argument into the model_fn. From there, the Hugging Face pipeline construct can be used to create a summarization pipeline. The transform_fn is responsible for processing the input data with which the endpoint is invoked. The following example expects a text payload, which is then passed into the summarization pipeline.

%%writefile src/
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
from transformers import pipeline
import json
def model_fn(model_dir):
    tokenizer = BartTokenizer.from_pretrained(model_dir)
    model = BartForConditionalGeneration.from_pretrained(model_dir)
    nlp=pipeline("summarization", model=model, tokenizer=tokenizer)
    return nlp
def transform_fn(nlp, request_body, input_content_type, output_content_type="application/json"):
    if input_content_type == "text/csv":
        result = nlp(request_body, truncation=True)[0]
        raise Exception(f"Content {input_content_type} type not supported")
    return json.dumps(result)

To deploy the endpoint with a custom entry point script, the script needs to be packaged with the model artifact. The following code adds the entry point script to the model.tararchive and compresses it with gzip to create the final model.tar.gzartifact.

! tar rvf  model.tar code/
! gzip  model.tar model.tar.gz
! rm model.tar

Next, we need to upload the model artifact to Amazon S3 and look up the appropriate managed inference container image. Amazon SageMaker Python SDK, an open-source library for training and deploying ML models on Amazon SageMaker, provides a helper function that we can use to look up the Amazon Resource Name (ARN) of the inference image on Amazon Elastic Container Registry (Amazon ECR), a fully managed container registry offering high-performance hosting. The following code gets the ARN for the Hugging Face version 4.6.1 inference image with PyTorch 1.8.1 as the underlying deep learning framework.

s3_model_data = sess.upload_data("model.tar.gz", bucket, key_prefix)
inference_image_uri = image_uris.retrieve(

With the Amazon ECR image ARN and the combined artifact, we have all of the requisite parts to create an Amazon SageMaker model resource.

sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")
model_name = "document-summarization"
create_model_response = sm_client.create_model(
        "Image": inference_image_uri,
        "ModelDataUrl": s3_model_data,
        "Environment": {
            "SAGEMAKER_PROGRAM": "",
            "SAGEMAKER_REGION": region,
model_arn = create_model_response["ModelArn"]

Next, we deploy an endpoint configuration with the previously created model.

endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.m5.xlarge",
            "InitialInstanceCount": 1,
        "OutputConfig": {
            "S3OutputPath": f"s3://{bucket}/{key_prefix}/async-output",
            # Optionally specify Amazon SNS topics
            # "NotificationConfig": {
            #   "SuccessTopic": "arn:aws:sns:us-east-2:123456789012:MyTopic",
            #   "ErrorTopic": "arn:aws:sns:us-east-2:123456789012:MyTopic",
            # }
        "ClientConfig": {"MaxConcurrentInvocationsPerInstance": 4},

Finally, the following code deploys the endpoint.

create_endpoint_response = sm_client.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name

The endpoint can now be invoked by passing in the Amazon S3 location of a text document that you wish to summarize.

response = smr_client.invoke_endpoint_async(

After a few seconds, you can check the output of the invocation.

result = sess.read_s3_file(bucket=bucket, key_prefix="/".join(response["OutputLocation"].split("/")[3:]))

Running the model through on the text from the prior-mentioned blog post announcing the Amazon SageMaker Hugging Face inference capabilities yields the following result.

{"summary_text": " Hugging Face is the technology startup, with an active open-source
 community, that drove the worldwide adoption of transformer-based models thanks to i
 ts eponymous Transformers library . We discuss different methods to create a 
 SageMaker endpoint for a Hugging face model . We also discuss how to train and 
 deploy models on Amazon SageMaker . Earlier this year, Huging Face and AWS 
 collaborated to enable you to train 10,000 pre-trained models ."}


In this blog post, we covered different approaches for document summarization and how they can benefit M&E organizations. We walked through an approach for deploying a pretrained summarization model available on the Hugging Face Hub on an Amazon SageMaker asynchronous endpoint. We then tested the endpoint with sample documents to demonstrate the functionality. With this capability, M&E organizations can deploy state-of-the-art natural language models to perform document summarization.

About Hugging Face

At Hugging Face, our mission is to democratize good, state-of-the-art ML. We do this through our open source, our open science, and our products and services. Currently, Hugging Face has over 20,000 state-of-the-art transformer models and over 1,600 free and available datasets on our open-source and ML solution. Click here to learn more about Hugging Face.

Hugging Face Logo

James Yi

James Yi

James Yi is a Sr. AI/ML Partner Solutions Architect in the Emerging Technologies team at Amazon Web Services. He is passionate about working with enterprise customers and partners to design, deploy and scale AI/ML applications to derive their business values. Outside of work, he enjoys playing soccer, traveling and spending time with his family.

Simon Zamarin

Simon Zamarin

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.