AWS for M&E Blog

Part 1: Build powerful news stories with historical context using machine learning


The expansive and ongoing market penetration of broadband services and mobility devices has triggered exponential demand for content across the globe. News content is no exception, with steady growth in the number of players in the news business. User generated content (UGC) on social media platforms is also a strong disruptor. At times, breaking news may first appear on social media platforms ahead of large news networks broadcast reporting.

In an evolving content space, it’s imperative to stay ahead and establish new market trends through innovation. But how can a news network create differentiators and unique content propositions? How can a news network align its content quality with its fundamental principles of journalism? In this post, we will explore how Amazon Web Services (AWS) machine learning services such as Amazon SageMaker, Amazon Transcribe, Amazon Rekognition, and graph-based content storage such as Amazon Neptune can help to address such challenges.


Today, news content is in high demand. What can be added as vital enablers to support high-quality journalism and effective newsroom research? Historical data and context are a good choice here. News content is more impactful when substantiated with deeper political, economic, and environmental context and supplementary illustrations.

Solution approach

News archives are typically preserved as unstructured content, be it audio, video, or text transcript. Established news broadcasters may have decades of such archives available. A reliable and scalable mechanism is needed to harness the power of this unstructured content. The conversion of unstructured content to a structured format is essential to blend and enrich new editorial content. A solution for conversion of content should be able to extract entities, establish relationships and context between entities, and align to a taxonomy for a given news category. Let’s explore how different services from AWS can be used to build such a solution.

An AWS architecture diagram which explains the different services used for news archive ingestion, machine learning based analysis, storage and access. The archive is uploaded to S3 bucket, event triggered for Amazon Lambda. The content processing is done using Amazon Transcribe, Amazon Rekognition, Amazon Comprehend and Amazon SageMaker for custom NLP models. The content graph is persisted on a Neptune graph database. The content is accessed through a lambda function and Amazon API Gateway, over REST APIs.

Figure 1: Technical architecture for news archive analysis

The previous architecture diagram has three logical modules:

  1. Archive Ingestion: Archived content relevant for analysis needs to be pushed to a designated Amazon Simple Storage Service (Amazon S3) The content upload event on the Amazon S3 bucket triggers an AWS Lambda function, which in turn triggers an AWS Batch. The purpose of the batch is to analyze the archived content and convert it to a structured format.
  2. Content Analysis: Archived content such as images, audio and video are analyzed using machine learning.

Amazon Rekognition is used for video analysis to detect celebrities, human expressions and emotions, places of interest, breaking news overlay text, sponsor logos, and visuals depicting violence or firearms. Amazon Rekognition is a managed machine learning service that provides out-of-the-box access to such content metadata without any requirement for custom model training.

Amazon Transcribe is used for audio analysis of archived assets and generates transcribed text. Amazon Transcribe is a managed machine learning service that produces text transcript out-of-the-box. Moreover, if the news content is specific to a domain such as finance or sports, a custom language model can be trained within Amazon Transcribe to improve overall transcription accuracy. Once transcription text is generated, it runs through a series of machine learning services to obtain a structured metadata, with labeled entities, associated relationships, context, and additional enrichment information.

A flowchart explaining how a text transcript goes through topic modelling and sentence clustering. Post segregation, the sentences are split and rephrased to smaller sentences for better analysis. Such reduced sentences are used for entity recognition and parts of speech. The tuples generated are enriched with URL to visual assets and finally persisted on graph databases.

Figure 2: Workflow of audio transcript analysis with machine learning

The previous diagram outlines the different machine learning services that generate structured information from the original transcript and video. Following is a synopsis of key steps:

  • Amazon Comprehend is a managed machine learning service used to extract important entities such as organization, place, person, etc. It is used to capture the overall sentiment of the text, be it positive, negative, neutral, or mixed. Optionally, custom entity recognition may be used to detect entities for a niche domain, if that is of relevance to the business.
  • Amazon Comprehend is again used for topic modelling on the corpus of audio transcript. News programs may address multiple topics in a single show. Machine learning is used to approximately split the content into distinct topics. Amazon SageMaker is used to host custom model inferences for sentence level clustering. Topic modeling or sentence cluster needs further human-in-the-loop to review, validate, and meaningfully label the topic and sentences.
  • Sentences spoken by journalists are often long and complex. It can be difficult to determine the subject, object, and connecting verbs of such complex sentences using natural language processing (NLP) models. Split-and-Rephrase models exposed through Amazon SageMaker inferencing services are used to split large and complex sentences into smaller ones. Such smaller sentences provide better results during NLP parts-of-speech detection. Let us understand this concept with an example.

Original sentence – “A and B are stunned by the expenses involved with a wedding. They decide they will get married in City Hall and give the money to charity.

Rephrased sentence – “A and B are stunned. They are stunned by the expenses involved with a wedding. They decide they will get married in City Hall. They will then give the money to charity.

Once we analyze the parts of speech for the sentence “They decide they will get married in City Hall”, we get the following output:

{“subject”: “They”}, {“object”: “City Hall”}, {“relationship”: “will get married”}.

  • Once the unstructured news text is analyzed into a subject-relationship-object tuple, it is stored in a graph database. Amazon Neptune is used for this storage. Amazon DynamoDB is used to store additional metadata or annotation associated with the content, such as visuals and sentiment analysis. The following diagram provides a fictitious example of how news archive content is analyzed and stored in a graph structure using machine learning.

A sample or representative content graph, indicating how the result will look post analysis. In particular it outlines the career of a person's political career, spanning from association with different political parts, being in news as a senator, governor or Vice President, associated controversies, tweets, etc.

Figure 3: Representative news content graph

3. User Interface: The user interface is implemented with Amazon API Gateway that communicates with a set of AWS Lambda The AWS Lambda function queries and traverses across the graph database based on the searched entities or relationships of interest. The query response is further enriched by supplementary metadata stored on Amazon DynamoDB.

The art of the possible and additional use cases

Graph-style storage of news archives is a foundational investment for any news business. It can be extended to serve multiple use cases as the depth and width of the graph increases with time. The following outlines potential scenarios:

  1. While news archives are analyzed and stored in a graph database, a domain-specific content model develops. Each genre of news, such as politics or sports, has its own content model definition. The content model may also vary depending on the style of journalism and news production for a given business. This content model becomes a baseline that can be optionally used to measure drifts in the quality of journalism. For example: over a period of time, has the amount of onsite video feeds increased drastically within news programs? Has the frequency of seeking opinions from experts significantly fallen? Similarly, has the tonality of the journalism been neutral, positive, or negative over the time frame?
  2. Measure the network’s news coverage perspective on a “T-Scale”. Today, if we consider social media platforms or news wires as master sources, how efficiently did a network touch upon all the important topics and issues prevailing in the world today? Similarly, is the desired level of in-depth coverage being conducted for topics that channel stakeholders intend?


This post outlines a ‘reimagination’ for today’s news business and its stakeholders. It covers how archive content can be processed using machine learning and stored into graph databases, for easy retrieval and compelling use cases for next-gen news stories. This significantly optimizes the time and effort of newsroom editors, as they research and build news stories of interest for their audience. In the next post, we will deep dive into the implementation of such use cases, with sample code snippets. We believe this will speed up enterprise adoption. Stay tuned!

Neha Garg

Neha Garg

Neha Garg is the Enterprise Solutions Architect, focused on helping customers to build scalable, modern, and cost-effective solutions on AWS.

Punyabrota Dasgupta

Punyabrota Dasgupta

Punyabrota Dasgupta is a principal solutions architect at AWS. His area of expertise includes machine learning applications for media and entertainment business. Beyond work, he loves tinkering and restoration of antique electronic appliances