AWS Database Blog
Build a knowledge graph on Amazon Neptune with AI-powered video analysis using Media2Cloud
A knowledge graph allows us to combine data from different sources to gain a better understanding of a specific problem domain. In this post, we use Amazon Neptune (a managed graph database service) to create a knowledge graph about technology products. In addition to the data we already have in the graph, we add the content of technology news videos. We use the Media2Cloud on AWS solution to perform AI-powered analysis of the videos to report faces, celebrities, entities, and key phrases observed in the video, and add those findings into the graph to discover new and interesting relationships.
Overview of solution
The following diagram describes our solution architecture.
The workflow steps are as follows:
- We create a Media2Cloud (M2C) application and upload technology videos to it. M2C uses AI services Amazon Rekognition, Amazon Transcribe, and Amazon Comprehend to find things in the video. M2C saves a summary of its results to tables in Amazon DynamoDB. M2C also saves detailed results to a bucket in Amazon Simple Storage Service (Amazon S3). M2C uses additional services, described in Media2Cloud on AWS.
- We create a Neptune cluster and seed it with initial technology product data.
- Using a notebook in Neptune Workbench, we transform the M2C results to Resource Description Framework (RDF). RDF is a graph data representation supported by Neptune.
- Using the notebook, we ingest the RDF data into the Neptune database. We query the Neptune database to discover new relationships. Many of the M2C objects, notably celebrities, have RDF data in public sources, such as Wikidata. In this post, we show how to run federated queries to combine data from both Neptune and Wikidata.
Let’s now step through the solution, provisioning in your AWS account a Neptune cluster, M2C application, and notebook.
Prerequisites
To run this example, you need an AWS account with permission to create resources such as a Neptune cluster and M2C.
Running this example incurs charges. M2C has both a one-time cost for processing ingest and analysis, plus recurring costs for S3 storage, portal use, and search engine access. Refer to the M2C cost guide. Neptune costs include on-demand instances, storage and I/O, and workbench. S3 costs apply to the staging bucket that we create for this demo.
You will be provisioning resources in one AWS Region. You must choose a Region supported by both Neptune and M2C.
Set up M2C
Create M2C by launching its AWS CloudFormation stack. For instructions, refer to Launch the stack. When it’s complete, you receive an email (to the address you specified as a parameter to the stack) with instructions to log in to the M2C portal. Follow the email instructions to confirm that you can access the portal.
Additionally, obtain the name of the S3 analysis bucket that M2C creates. On the AWS CloudFormation console, find the stack whose name contains CoreStack
. On the Outputs tab of that stack, copy the value of the output ProxyBucket
.
Upload videos to M2C
Your first task is to find videos about technology products. If you have your own videos, you may use them. You may also obtain videos from a third party, complying with their usage rules.
- To download the videos used in this post, run the following script on the machine from which you access the M2C portal:
If you don’t have wget, you can open each link in your browser and save or download a copy of the MP4 file to your local drive.
Now upload these files to M2C and wait for the analysis to complete.
- On the Upload tab of the portal, drag the files in as directed and then choose Quick upload.
- When prompted, start the upload by choosing Start now.
- When the uploads are complete, choose Done.
Next, M2C runs its analysis, which takes several minutes. Monitor the Processing tab for progress.
When processing is complete, choose the Stats tab to review a summary of findings. The Top known faces chart indicates that celebrities such as Steve Jobs and Elon Musk were recognized in the videos. Notice also the Top labels (accessories and airport, for example) and Top key phrases (Apple and Newton, for example) charts.
Set up a Neptune cluster and notebook
In the same Region in which you installed M2C, create Neptune resources using AWS CloudFormation. First, download a copy of the CloudFormation template. Then complete the following steps:
- On the AWS CloudFormation console, choose Create stack.
- Choose With new resources (standard).
- Select Upload a template file.
- Select Choose file to upload the local copy of the template that you downloaded. The name of the file is
NepM2C.yaml
. - Choose Next.
- Enter a stack name of your choosing.
- In the Parameters section, enter a value for
M2CAnalysisBucket
. Use the value collected after setting up M2C. Use defaults for the remaining parameters. - Choose Next.
- Continue through the remaining sections.
- Read and select the check boxes in the Capabilities section.
- Choose Create stack.
- When the stack is complete, navigate to the Outputs section and follow the link for the output
NeptuneSagemakerNotebook
. This opens in your browser the Jupyter files view in your browser. - In Jupyter, select
M2CForKnowledgeGraph.ipynb
to open the notebook that use you in the remaining steps.
In addition to creating a Neptune cluster and notebook instance, our CloudFormation stack also creates a staging bucket to hold RDF data to be bulk-loaded to the Neptune database.
We encourage you to review the stack with your security team prior to using it in a production environment.
Seed the cluster
In M2CForKnowledgeGraph.ipynb
, complete the following steps in the following screenshot.
- Run the cell under the label Extract names of S3 buckets from environment to extract from environment variables the names of the M2C analysis bucket and a staging bucket that we use for data to be bulk-loaded to the Neptune database.
- Run the cell under the label Create local folder for analysis results to create a folder on the notebook instance to build RDF analysis data prior to uploading it to the Neptune database.
- Run the cell under the label Bulk-load to Neptune from S3 to load seed data from the staging bucket to the Neptune database. Wait for a status of
LOAD COMPLETE
. - Under Check status of load, run the code cell to check the load status and verify that there were no errors or duplicates.
Explore seed data
In the notebook, run the cells under the heading Query the seed data. There are three SPARQL queries:
- The first query, under Persons and role in org, finds people and their roles in their respective organizations.
- The second query, under Federated query of person, queries the Neptune database for the Wikidata URI of a specific person, then queries Wikidata itself to bring back RDF data of that person from the Wikidata SPARQL endpoint. This query requires that the Neptune cluster have internet connectivity to Wikidata’s public endpoint. If you use an existing cluster that does not have internet connectivity, you may skip this query. Refer to Neptune’s documentation on SPARQL federation.
- The third query, under Orgs and products, queries the Neptune database for organizations as well as products that are produced by that organization.
Convert M2C analysis to RDF
After you uploaded videos to M2C, you observed a summary view of the analysis results in the M2C portal. Full analysis is sent to the M2C analysis bucket in Amazon S3. The notebook has code to read that data and transform it into RDF form so that it may be combined with the seed data.
Under Now add in the M2C analysis in the notebook, run the first four code cells:
- Run the code cell under Boto3 helpers to bring the video files from S3. This cell defines Python functions to retrieve data from the analysis bucket. It uses the AWS SDK for Python (Boto3).
- Run the code cell under Install RDFLib. RDFLib is a Python library to programmatically build RDF data.
- Run the code cell under RDFLib helpers to build the triples for M2C video analysis. This cell defines Python functions to build RDF triples representing video analysis.
- Run the code cell under For each video in analysis bucket, build triples and write to RDF file on the notebook instance. This cell calls functions defined previously to find analysis of each uploaded video and save results in RDF form.
The following figure shows the data model for our RDF data.
Seed data is shown on the left in the shaded boxes. There are classes for Person
, Role
, Organization
, Product
, and RoleType
. The first four classes have Wikidata references via the hasWikidataRef
object property (shown as an arrow). There are object properties among the classes. For example, hasRole
relates Person
to Role
, hasRoleOrg
relates Role
to Organization
, and producedBy
relates Product
to Organization
.
Analysis results are shown on the right side in the unshaded boxes. The main class is VideoAnalysis
, which collects M2C analysis result. Among its properties are the M2C analysis id
, the MP4 filename
, as well as several findings from Amazon Comprehend (indicated by (C) in the figure) and Amazon Rekognition (indicated by (R)). Amazon Comprehend findings are the following:
- Keyphrases, as the data type property
extractedKeyphrase
ofVideoAnalysis
. There can be many extracted keyphrases. - Sentiment, as the
sentimentCount
data type properties ofVideoAnalysis
. There are four counts, for positive, negative, neutral, and mixed sentiments. The values indicate the frequency of each sentiment over the course of the video. - Extracted entities, which are broken into a separate class,
Entity
, linked toVideoAnalysis
via thehasExtractedEntity
object property. There can be many extracted entities.
Amazon Rekognition findings are the following:
- Observed text, as the data type property
observedText
inVideoAnalysis
. There can be many values. - A count of observed persons, as the
personCount
data type property inVideoAnalysis
. The count is sufficient for graph purposes. Full Amazon Rekognition analysis of each person is available in the analysis bucket if needed. - Face analysis, which is expressed in the
Emotion
class, which relates toVideoAnalysis
via thehasEmotion
object property. An emotion has asubtype
(for example,HAPPY
,SAD
) and acount
, indicating how frequently that emotion was observed. Additionally, the emotion is associated with a gender viahasGender
. Therefore, counts of emotions are broken down by gender. - Celebrity analysis, which is modeled as follows. First, the
Celebrity
class captures the details of the celebrity, including ahasWikidataRef
object property indicating the celebrity’s URI in Wikidata. Second,VideoAnalysis
links to the celebrity via ahasCelebrityAppearance
relationship toAppearance
.Appearance
tracks the duration the celebrity appears in the video (appearance
data type property). It links to the celebrity via thehasAppearanceSubject
object property. - Label analysis. There is a
Label
class.VideoAnalysis
links to it via ahasLabelAppearance
object property toAppearance
. Amazon Rekognition has a label taxonomy. If it finds a specific label, it also provides parent terms in the taxonomy. For example, if it finds the labelMobile Phone
, it also reports the parent termElectronics
. In our RDF representation, we model the label as a SKOS concept and the parent relationship as using the SKOSbroader
relationship type.
View the full ontology, as well as seed data, in seeddata.ttl from our GitHub repo.
Ingest RDF
In the notebook, continue from where you left off by running the next three code cells.
Specifically, complete the following steps:
- Run Upload RDF files to S3 to add the M2C RDF data to the staging bucket so that we may bulk load it to the Neptune database.
- Run Bulk-load these to Neptune to perform the load.
- Run Check load status to checks the load has no errors.
Explore video analysis RDF data
The data is now loaded into the Neptune database. To begin querying the video analysis RDF data, run the cell under Summary of videos. There are five videos with varying person counts and sentiments.
Run the cell under Show celeb appearances to show names and Wikidata URIs of celebrities in the videos.
Find links between seed and video analysis data
To find links in the data, complete the following steps:
- Run the cell under LINK: Tie celebs in videos to persons in seed, matching on Wikidata ref. The query joins people and celebrities on the Wikidata URI. For our demonstration, it finds three celebrities from the video who are also people from our seed data.
- Run the cell under LINK: Tie extracted entities to persons, org, products. This query compares Amazon Comprehend extracted entities from video analysis to the names of people, organizations, and products in the seed data. It finds instances of all three.
- Run the cell under LINK: Tie any text in video analysis to seed object. This broadens the previous search by comparing extracted entities, extracted keyphrases, observed text, and labels to seed names. There are numerous matches.
- Finally, run the cell under LINK: Match on extracted entity and show what else is there.This query matches a person, product, or organization from an extracted entity in a video, but then lists other extracted entities in that video. The first result, for example, shows that the video matching Jony Ive also mentioned John Sculley, who is also a person from the seed data. The video relates the two people.
Clean up
If you’re done with the solution and wish to avoid future charges, delete the M2C and Neptune stacks. The S3 staging bucket is deleted as well, The M2C buckets are retained intentionally but can be removed manually.
Conclusion
In this post, we showed how to build an RDF knowledge graph with video analysis from Media2Cloud. We combined video analysis with our own product dataset, plus additional data about products and celebrities federated from Wikidata. We demonstrated SPARQL queries to bring this data together.
To learn more about how to model AI analysis of unstructured content for a Neptune database, refer to Building a knowledge graph in Amazon Neptune using Amazon Comprehend Events and Supercharge your knowledge graph using Amazon Neptune, Amazon Comprehend, and Amazon Lex.
About the Author
Mike Havey is a Senior Solutions Architect for AWS with over 25 years of experience building enterprise applications. Mike is the author of two books and numerous articles – read them on his author page.