AWS Database Blog

Build a knowledge graph on Amazon Neptune with AI-powered video analysis using Media2Cloud

A knowledge graph allows us to combine data from different sources to gain a better understanding of a specific problem domain. In this post, we use Amazon Neptune (a managed graph database service) to create a knowledge graph about technology products. In addition to the data we already have in the graph, we add the content of technology news videos. We use the Media2Cloud on AWS solution to perform AI-powered analysis of the videos to report faces, celebrities, entities, and key phrases observed in the video, and add those findings into the graph to discover new and interesting relationships.

Overview of solution

The following diagram describes our solution architecture.

Overview of solution

The workflow steps are as follows:

  1. We create a Media2Cloud (M2C) application and upload technology videos to it. M2C uses AI services Amazon Rekognition, Amazon Transcribe, and Amazon Comprehend to find things in the video. M2C saves a summary of its results to tables in Amazon DynamoDB. M2C also saves detailed results to a bucket in Amazon Simple Storage Service (Amazon S3). M2C uses additional services, described in Media2Cloud on AWS.
  2. We create a Neptune cluster and seed it with initial technology product data.
  3. Using a notebook in Neptune Workbench, we transform the M2C results to Resource Description Framework (RDF). RDF is a graph data representation supported by Neptune.
  4. Using the notebook, we ingest the RDF data into the Neptune database. We query the Neptune database to discover new relationships. Many of the M2C objects, notably celebrities, have RDF data in public sources, such as Wikidata. In this post, we show how to run federated queries to combine data from both Neptune and Wikidata.

Let’s now step through the solution, provisioning in your AWS account a Neptune cluster, M2C application, and notebook.

Prerequisites

To run this example, you need an AWS account with permission to create resources such as a Neptune cluster and M2C.

Running this example incurs charges. M2C has both a one-time cost for processing ingest and analysis, plus recurring costs for S3 storage, portal use, and search engine access. Refer to the M2C cost guide. Neptune costs include on-demand instances, storage and I/O, and workbench. S3 costs apply to the staging bucket that we create for this demo.

You will be provisioning resources in one AWS Region. You must choose a Region supported by both Neptune and M2C.

Set up M2C

Create M2C by launching its AWS CloudFormation stack. For instructions, refer to Launch the stack. When it’s complete, you receive an email (to the address you specified as a parameter to the stack) with instructions to log in to the M2C portal. Follow the email instructions to confirm that you can access the portal.

Additionally, obtain the name of the S3 analysis bucket that M2C creates. On the AWS CloudFormation console, find the stack whose name contains CoreStack. On the Outputs tab of that stack, copy the value of the output ProxyBucket.

Upload videos to M2C

Your first task is to find videos about technology products. If you have your own videos, you may use them. You may also obtain videos from a third party, complying with their usage rules.

  1. To download the videos used in this post, run the following script on the machine from which you access the M2C portal:
    wget https://cnet.redvideo.io/2023/01/05/addd2ee6-7f76-46e1-a79b-87410dc50f68/ring-car-cam-final_360h700k.mp4
    wget https://cnet.redvideo.io/2023/03/09/0083c301-d1d5-4c36-ae68-36e5de995249/omt-ep12-yellow-finalcnet_720h3200k.mp4
    wget https://cnet.redvideo.io/2022/08/17/a6e0564f-c4a4-4821-b909-9dca57ba3d1c/motoedge-fl-1_720h3200k.mp4
    wget https://cnet.redvideo.io/2023/03/03/ad8423f6-4a6c-4fa1-9591-6f0b8aab641b/omt-ep11-newton-finalcnet_720h3200k.mp4
    wget https://cnet.redvideo.io/2023/03/01/b5af4e52-424c-4c54-bc15-ec15a10498b0/230301-teslaelectricboatsplanes_720h3200k.mp4

    If you don’t have wget, you can open each link in your browser and save or download a copy of the MP4 file to your local drive.

    Now upload these files to M2C and wait for the analysis to complete.

  2. On the Upload tab of the portal, drag the files in as directed and then choose Quick upload.
    M2C upload
  3. When prompted, start the upload by choosing Start now.
    Start M2C
  4. When the uploads are complete, choose Done.
    M2C Done

Next, M2C runs its analysis, which takes several minutes. Monitor the Processing tab for progress.

M2C processing

When processing is complete, choose the Stats tab to review a summary of findings. The Top known faces chart indicates that celebrities such as Steve Jobs and Elon Musk were recognized in the videos. Notice also the Top labels (accessories and airport, for example) and Top key phrases (Apple and Newton, for example) charts.

M2C summary

Set up a Neptune cluster and notebook

In the same Region in which you installed M2C, create Neptune resources using AWS CloudFormation. First, download a copy of the CloudFormation template. Then complete the following steps:

  1. On the AWS CloudFormation console, choose Create stack.
  2. Choose With new resources (standard).
  3. Select Upload a template file.
  4. Select Choose file to upload the local copy of the template that you downloaded. The name of the file is NepM2C.yaml.
  5. Choose Next.
  6. Enter a stack name of your choosing.
  7. In the Parameters section, enter a value for M2CAnalysisBucket. Use the value collected after setting up M2C. Use defaults for the remaining parameters.
  8. Choose Next.
  9. Continue through the remaining sections.
  10. Read and select the check boxes in the Capabilities section.
  11. Choose Create stack.
  12. When the stack is complete, navigate to the Outputs section and follow the link for the output NeptuneSagemakerNotebook. This opens in your browser the Jupyter files view in your browser.
  13. In Jupyter, select M2CForKnowledgeGraph.ipynb to open the notebook that use you in the remaining steps.

Notebook select

In addition to creating a Neptune cluster and notebook instance, our CloudFormation stack also creates a staging bucket to hold RDF data to be bulk-loaded to the Neptune database.

We encourage you to review the stack with your security team prior to using it in a production environment.

Seed the cluster

In M2CForKnowledgeGraph.ipynb, complete the following steps in the following screenshot.

  1. Run the cell under the label Extract names of S3 buckets from environment to extract from environment variables the names of the M2C analysis bucket and a staging bucket that we use for data to be bulk-loaded to the Neptune database.
  2. Run the cell under the label Create local folder for analysis results to create a folder on the notebook instance to build RDF analysis data prior to uploading it to the Neptune database.
  3. Run the cell under the label Bulk-load to Neptune from S3 to load seed data from the staging bucket to the Neptune database. Wait for a status of LOAD COMPLETE.
  4. Under Check status of load, run the code cell to check the load status and verify that there were no errors or duplicates.

Seed Neptune cluster

Explore seed data

In the notebook, run the cells under the heading Query the seed data. There are three SPARQL queries:

  1. The first query, under Persons and role in org, finds people and their roles in their respective organizations.
    Neptune query - persons and roles
  2. The second query, under Federated query of person, queries the Neptune database for the Wikidata URI of a specific person, then queries Wikidata itself to bring back RDF data of that person from the Wikidata SPARQL endpoint. This query requires that the Neptune cluster have internet connectivity to Wikidata’s public endpoint. If you use an existing cluster that does not have internet connectivity, you may skip this query. Refer to Neptune’s documentation on SPARQL federation.
    Neptune query - federated query of persons
  3. The third query, under Orgs and products, queries the Neptune database for organizations as well as products that are produced by that organization.
    Neptune query - orgs and products

Convert M2C analysis to RDF

After you uploaded videos to M2C, you observed a summary view of the analysis results in the M2C portal. Full analysis is sent to the M2C analysis bucket in Amazon S3. The notebook has code to read that data and transform it into RDF form so that it may be combined with the seed data.

Under Now add in the M2C analysis in the notebook, run the first four code cells:

  1. Run the code cell under Boto3 helpers to bring the video files from S3. This cell defines Python functions to retrieve data from the analysis bucket. It uses the AWS SDK for Python (Boto3).
  2. Run the code cell under Install RDFLib. RDFLib is a Python library to programmatically build RDF data.
  3. Run the code cell under RDFLib helpers to build the triples for M2C video analysis. This cell defines Python functions to build RDF triples representing video analysis.
  4. Run the code cell under For each video in analysis bucket, build triples and write to RDF file on the notebook instance. This cell calls functions defined previously to find analysis of each uploaded video and save results in RDF form.

The following figure shows the data model for our RDF data.

Data model

Seed data is shown on the left in the shaded boxes. There are classes for Person, Role, Organization, Product, and RoleType. The first four classes have Wikidata references via the hasWikidataRef object property (shown as an arrow). There are object properties among the classes. For example, hasRole relates Person to Role, hasRoleOrg relates Role to Organization, and producedBy relates Product to Organization.

Analysis results are shown on the right side in the unshaded boxes. The main class is VideoAnalysis, which collects M2C analysis result. Among its properties are the M2C analysis id, the MP4 filename, as well as several findings from Amazon Comprehend (indicated by (C) in the figure) and Amazon Rekognition (indicated by (R)). Amazon Comprehend findings are the following:

  • Keyphrases, as the data type property extractedKeyphrase of VideoAnalysis. There can be many extracted keyphrases.
  • Sentiment, as the sentimentCount data type properties of VideoAnalysis. There are four counts, for positive, negative, neutral, and mixed sentiments. The values indicate the frequency of each sentiment over the course of the video.
  • Extracted entities, which are broken into a separate class, Entity, linked to VideoAnalysis via the hasExtractedEntity object property. There can be many extracted entities.

Amazon Rekognition findings are the following:

  • Observed text, as the data type property observedText in VideoAnalysis. There can be many values.
  • A count of observed persons, as the personCount data type property in VideoAnalysis. The count is sufficient for graph purposes. Full Amazon Rekognition analysis of each person is available in the analysis bucket if needed.
  • Face analysis, which is expressed in the Emotion class, which relates to VideoAnalysis via the hasEmotion object property. An emotion has a subtype (for example, HAPPY, SAD) and a count, indicating how frequently that emotion was observed. Additionally, the emotion is associated with a gender via hasGender. Therefore, counts of emotions are broken down by gender.
  • Celebrity analysis, which is modeled as follows. First, the Celebrity class captures the details of the celebrity, including a hasWikidataRef object property indicating the celebrity’s URI in Wikidata. Second, VideoAnalysis links to the celebrity via a hasCelebrityAppearance relationship to Appearance. Appearance tracks the duration the celebrity appears in the video (appearance data type property). It links to the celebrity via the hasAppearanceSubject object property.
  • Label analysis. There is a Label class. VideoAnalysis links to it via a hasLabelAppearance object property to Appearance. Amazon Rekognition has a label taxonomy. If it finds a specific label, it also provides parent terms in the taxonomy. For example, if it finds the label Mobile Phone, it also reports the parent term Electronics. In our RDF representation, we model the label as a SKOS concept and the parent relationship as using the SKOS broader relationship type.

View the full ontology, as well as seed data, in seeddata.ttl from our GitHub repo.

Ingest RDF

In the notebook, continue from where you left off by running the next three code cells.

Ingrest RDF

Specifically, complete the following steps:

  1. Run Upload RDF files to S3 to add the M2C RDF data to the staging bucket so that we may bulk load it to the Neptune database.
  2. Run Bulk-load these to Neptune to perform the load.
  3. Run Check load status to checks the load has no errors.

Explore video analysis RDF data

The data is now loaded into the Neptune database. To begin querying the video analysis RDF data, run the cell under Summary of videos. There are five videos with varying person counts and sentiments.

Neptune query - summary of videos

Run the cell under Show celeb appearances to show names and Wikidata URIs of celebrities in the videos.

Neptune query - celebrity appearances

Find links between seed and video analysis data

To find links in the data, complete the following steps:

  1. Run the cell under LINK: Tie celebs in videos to persons in seed, matching on Wikidata ref. The query joins people and celebrities on the Wikidata URI. For our demonstration, it finds three celebrities from the video who are also people from our seed data.
    Neptune query - link celebs in video to persons from seed
  2. Run the cell under LINK: Tie extracted entities to persons, org, products. This query compares Amazon Comprehend extracted entities from video analysis to the names of people, organizations, and products in the seed data. It finds instances of all three.
    Neptune query - Tie entitities extracted from M2C to persons/orgs in seed data
  3. Run the cell under LINK: Tie any text in video analysis to seed object. This broadens the previous search by comparing extracted entities, extracted keyphrases, observed text, and labels to seed names. There are numerous matches.
    Neptune query - Tie text in M2C to resource in seed
  4. Finally, run the cell under LINK: Match on extracted entity and show what else is there.This query matches a person, product, or organization from an extracted entity in a video, but then lists other extracted entities in that video. The first result, for example, shows that the video matching Jony Ive also mentioned John Sculley, who is also a person from the seed data. The video relates the two people.
    Neptune query - match extracted entitities, show other entities in same video

Clean up

If you’re done with the solution and wish to avoid future charges, delete the M2C and Neptune stacks. The S3 staging bucket is deleted as well, The M2C buckets are retained intentionally but can be removed manually.

Conclusion

In this post, we showed how to build an RDF knowledge graph with video analysis from Media2Cloud. We combined video analysis with our own product dataset, plus additional data about products and celebrities federated from Wikidata. We demonstrated SPARQL queries to bring this data together.

To learn more about how to model AI analysis of unstructured content for a Neptune database, refer to Building a knowledge graph in Amazon Neptune using Amazon Comprehend Events and Supercharge your knowledge graph using Amazon Neptune, Amazon Comprehend, and Amazon Lex.


About the Author

Mike Havey is a Senior Solutions Architect for AWS with over 25 years of experience building enterprise applications. Mike is the author of two books and numerous articles – read them on his author page.