Exploring Knowledge Graphs on Amazon Neptune Using Metaphactory

By Artem Kozlov, Software Engineer at metaphacts GmbH
By Kunal Sengupta, Software Dev Engineer at AWS

How does Thomson Reuters help customers navigate a complex web of global tax policies and regulations? What kind of technology is Siemens considering for applications ranging from semantic master data management and production monitoring, to finance and risk management?

The answer to both questions is simple—they use knowledge graphs.

A knowledge graph allows you to store information in a graph model and use graph queries to enable users to easily navigate highly-connected datasets.

Using a knowledge graph, you can add topical information to product catalogs, build and query complex models of regulatory rules, or model general information.

Suppose a user is interested in the Mona Lisa. As shown in Figure 1, you can also help them discover other works of art by Leonardo da Vinci, or other works of art located in The Louvre.

Knowledge graphs are gaining prominence in enterprise data management because they offer advantages for data integration. They also help build smarter applications that use machine learning and artificial intelligence (AI) methods.

In this post, we’ll show you how to get started with knowledge graphs using the metaphactory platform backed by Amazon Neptune. Offered by metaphacts GmbH, an AWS Partner Network (APN) Select Technology Partner, metaphactory helps you build knowledge graphs and smart applications.

Amazon Neptune supports open-source and open-standard API operations and allows you to use existing information resources to build your knowledge graphs and host them on a fully managed service. We’ll load a knowledge graph with a subset of Wikidata, publicly available information about corporations, and GeoNames.

After it’s loaded, we’ll show you how querying the knowledge graph can answer questions like, “Where is Netflix incorporated?” or “Who are the company’s independent directors?” If this isn’t your domain, you can reuse these same concepts for many different applications and subject areas.

Neptune-Metaphactory-1

Figure 1 – An example knowledge graph.

Goals for this Post

After going through this post, you will be familiar with the following:

Creating an Amazon Neptune stack with metaphactory as the client application using a one-step deployment AWS CloudFormation automation script.
Bulk loading RDF datasets from Amazon Simple Storage Service (Amazon S3) to Neptune.
Making simple SPARQL queries to Neptune’s SPARQL endpoint.
Building graphical visualization using metaphactory.
Using the query catalog to save and organize SPARQL queries for later use.

This post is designed for anyone who wants to get familiar with knowledge graphs. You don’t need to have any prior knowledge of the RDF data model or SPARQL query language to try it out.

Amazon Neptune and Metaphactory

Amazon Neptune is a purpose-built graph database service that efficiently stores and navigates highly-connected data, allowing developers to create sophisticated, interactive graph applications that can query billions of relationships with millisecond latency. Customers use Neptune’s standards-based API operations, including RDF, SPARQL, and Apache Gremlin/Tinkerpop (a de facto standard) to build social networks, recommendation engines, fraud detection, knowledge graphs, drug discovery applications, and more.

Metaphactory is an end-to-end platform for creating and utilizing enterprise knowledge graphs built using the RDF/SPARQL stack—from semantic graph data management to data-driven application development. It offers capabilities and features to support the entire lifecycle of dealing with knowledge graphs.

Metaphactory’s generic approach offers great flexibility in different usage scenarios and for various application areas. Operating on top of the Amazon Neptune graph database using standard SPARQL 1.1 queries for communication, it offers rich knowledge management functionality that expert users, developers, and data scientists use for administration, creation, and analysis of knowledge graphs.

More general users can quickly find the answers they need using interactive interfaces for data exploration, navigation, visualization, and search. Metaphactory provides customizable user interface (UI) components and backend services that developers can use to build declarative applications.

Neptune-Metaphactory-2

Figure 2 – Metaphactory platform architecture.

Architecture

The overall architecture of our setup is sketched in Figure 3 . The Neptune graph database service deploys so-called “clusters” into a virtual private cloud (VPC) based on the Amazon VPC. A Neptune cluster is a collection of database instances, where one instance is the writer and up to 15 are reader instances (also called read replicas). You can add read replicas dynamically at any time, and these are used to achieve high availability (HA) by using fast failover, and to scale up the number of queries processed in parallel.

Each Neptune cluster comes with a cluster endpoint delegating requests to the writer instance, and a reader endpoint distributing (read-only) queries to the read replicas. A Neptune cluster can consist of a single instance, in which case the writer and reader endpoints point to the same (writer) instance.

Neptune-Metaphactory-3

Figure 3 – Overall architecture of the setup.

Note that the CloudFormation script used for setting up the infrastructure discussed in this post provides an input parameter to set up read replicas. By default, it provisions a single (read-write) instance only, which is sufficient to walk through the examples in this post.

To interact with the Neptune cluster, we deploy an Amazon Elastic Compute Cloud (Amazon EC2) client instance into the same VPC, where security groups are used to configure permissions. In our setup, the metaphactory application acts as a client and is connected to Neptune through its cluster endpoint. This starts a web server that accepts incoming traffic on port 80. You can configure the IP range from which metaphactory is accessible as part of the setup. Use this IP range to restrict access to metaphactory, such as to a company’s internal network.

Neptune also offers fast bulk loading functionality from files available in Amazon S3. This feature requires a properly-configured AWS Identity and Access Management (IAM) role that enables access to the S3 bucket from Neptune and VPC endpoints so that Neptune can access the S3 resources from its VPC. Although this setup is automated by the CloudFormation script, you can find more details on the IAM and S3 configuration in the Neptune bulk load documentation.

Setup Using AWS CloudFormation

The metaphactory platform is available on AWS Marketplace as a CloudFormation template you can use to set it up together with Amazon Neptune in just one click. You can use the CloudFormation stack in your own infrastructure to follow along with the hands-on exercises presented throughout the remainder of this post.

After you subscribe to metaphactory for Amazon Neptune, you can choose the CloudFormation template. For this post, you can use a db.r4.large instance for Amazon Neptune and a t2.small for the metaphactory platform. These are selected by default in the “metaphactory and Neptune – Minimum” CloudFormation template.

For pricing information, see Amazon Neptune Pricing and Amazon EC2 Pricing. You can delete the stack once you are finished with it to stop any charges.

After choosing the CloudFormation template, do the following:

Choose Continue to Launch.
For Action, choose Launch CloudFormation, and choose Launch.
The CloudFormation console appears with the selected template already inserted for Specify an Amazon S3 template URL.
Choose Next.

Neptune-Metaphactory-4

Figure 4 – AWS CloudFormation configuration stack.

In the CloudFormation script, the mandatory parameters are these:

Stack Name – Choose a name for your stack.
Instance type for Amazon Neptune.
Amazon S3 Access Policy – Provide the Amazon Resource Name (ARN) of the policies allowing access to certain S3 buckets. If you’re unsure what to put here, then you can use *, but keep in mind that doing this grants Neptune read access to any S3 bucket.
Instance type for metaphactory.
CIDR for HTTP access – A range of IPv4 addresses for the HTTP access to the metaphactory instance in the form of a Classless Inter-Domain Routing (CIDR) block. If you aren’t sure what to put for this parameter, you can use 0.0.0.0/0, but keep in mind this makes metaphactory accessible from any IP address.
CIDR for SSH access – The same as previous parameter but for Secure Shell (SSH) access to the Amazon EC2 instance with metaphactory.
EC2SSHKeyPairName – The key pair for SSH access to the metaphactory instance

After you make your choices, choose Next.

On the Options screen, keep the default parameters on the Options screen and choose Next.

On the Review screen, you will see all parameter details of your stack:

Check the box that you acknowledge that IAM resources will be created (the policies to access S3 buckets by Amazon Neptune).
Choose Create.

You will then be redirected to the CloudFormation stacks overview. After a few moments, your new stack will appear, or just refresh to see it immediately.

After the integrated stack is created, choose the Output tab, as shown in Figure 5, to find the access details of your stack. You can access the running metaphactory instance using the web browser by opening the metaphactoryURL stack output (essentially connecting on port 80 against the DNS of the provisioned instance). This opens the metaphactory web UI, in which you can now log in using user admin with the password being the value of the MetaphactoryPassword output.

Neptune-Metaphactory-5

Figure 5 – AWS CloudFormation output.

The start page of metaphactory serves as an entry point to various platform functionality. You can go back to the start page by clicking on the logo in the page header.

Neptune-Metaphactory-6

Figure 6 – Metaphactory start page.

Load the Data

To illustrate the idea of knowledge graphs, this post uses public RDF-based data from the Open PermID Basic Dataset (using portions of the dataset licensed under CC-BY 4.0), extensions from Wikidata (CC0 1.0), and GeoNames (CC-BY 4.0).

The data describes organizations, related persons, and places (called geographical features). As illustrated in Figure 7, the different datasets cover different aspects of the data.

The PermID dataset provides information about the organizations and their relations to persons, who hold positions in these organizations. Wikidata provides some additional background about organizations (such as number of employees) and is interlinked with organizations from PermID through shared Legal Entity Identifiers (LEIs) of these organizations. An organization is also linked to a location (called “Features”), where additional location data such as latitude and longitude of places are available through the GeoNames dataset.

Neptune-Metaphactory-7

Figure 7 – Conceptual graph of the combined dataset.

To get started quickly, we have prepared a small representative subset of the full dataset. You can download the full PermID dataset from the official PermID dataset as RDF on demand. Our sample dataset is available from public S3 buckets in the us-east-1, us-east-2, us-west-2, and eu-west-1 Regions. For example, to access the data from the us-east-1 Region (N. Virginia), use the following S3 URI:

s3://aws-neptune-customer-samples-us-east-1/bulkload-datasets/nquads/permid-sample/v01/

To access the data from the other available AWS Regions, change the AWS Region name in the end of the URI to the one you need.

Neptune provides HTTP-based API operations for bulk data loading that you can use to load the data from an accessible S3 bucket. You can use these API operations directly from the metaphactory UI through its Data Import & Export functionality.

To load the data, simply navigate to the Data Import & Export page and fill in the form as shown in the Figure 8:

Enter the S3 URI of the S3 bucket, or of the specific file you want to load, in the Load New Data form. Choose a bucket residing in the same AWS Region as your Neptune instance. If your S3 URI points to a folder, then all files in all subfolders of the bucket will be loaded. If the S3 URI points to a file, only this file will be loaded.
Choose the RDF format of the files you want to load. In our case, we choose nquads, which is the format of all our sample datasets.
Enter the AWS Region of the S3 bucket.
Enter the ARN of the IAM role that allows the Neptune cluster to access the S3 bucket. Use the value of the NeptuneLoadFromS3RoleArn output from the Neptune CloudFormation stack.
Choose Start Load.

Neptune-Metaphactory-8

Figure 8 – Bulk data loading form.

After a few minutes, the data load status (obtained through simple HTTP calls against Neptune’s data load status endpoint) will switch to Complete, indicating a successful load. After it succeeds, it should report that a total of 53,543 records have been loaded.

Neptune-Metaphactory-9

Figure 9 – Data loading status.

While data loading progresses, we can look into how graphs are represented in the RDF data model. This knowledge will help us better understand the dataset and later query it with SPARQL query language.

Work with RDF Graphs

Conceptually, RDF graphs consist of nodes and directed, labeled edges. For instance, in Figure 9 we can see a tiny RDF sample graph centered around a resource. The ID https://permid.org/1-4295902158 represents the organization with the name “Netflix Inc,” its phone number, and incorporation location (organization:isIncorporatedIn), which is another resources identified by http://sws.geonames.org/6252001, with country code US.

Neptune-Metaphactory-10

Figure 10 – Visualization of the simple RDF graph.

One interesting aspect in RDF is that resource identifiers look like URLs. In fact, IDs in RDF are Internationalized Resource Identifiers (IRIs). At its core, an IRI consists of three parts: scheme, authority, and path (the suffix). For example, the IRI https://permid.org/1-4295902158 has scheme https, authority permid.org, and path /1-4295902158.

Using IRIs as identifiers clears the way for globally unique identifiers and is an important aspect in data publishing, interlinking, reuse, and interchange of datasets. Edge labels are represented as IRIs. For readability, we often abbreviate them with prefixes. For example, in Figure 10 we use the prefix “organization” as a shortcut for http://permid.org/ontology/organization/.

You can serialize RDF graphs as a set of triples (subject, predicate, object), where each triple represents a directed edge from subject to object, labeled with the given predicate. Our sample graph preceding consists of four such triples:

<https://permid.org/1-4295902158> organization:isIncorporatedIn <http://sws.geonames.org/6252001/>

<https://permid.org/1-4295902158> vcard:organization-name "Netflix Inc"

<https://permid.org/1-4295902158> organization:hasRegisteredPhoneNumber "13026587581"

<http://sws.geonames.org/6252001/> iso:countryCode ”US”

The first triple represents the outgoing “isIncorporatedIn” edge for our subject with IRI <https://permid.org/1-4295902158>, representing the company Netflix. This IRI is connected by the predicate organization:isIncorporatedIn to the object with IRI <http://sws.geonames.org/6252001/>, representing the country USA.

Informally speaking, this triple represents a single fact saying that Netflix is incorporated in the USA. The first triple’s object IRI,<http://sws.geonames.org/6252001/>, is also used in subject position in the last triple, which describes the country code for USA. In that case, our object “US” is what we call a literal. Literals are sinks in the data graph and can’t have outgoing edges.

Explore the Knowledge Graph

Now that we have familiarized ourselves with the RDF data model, we can start exploring the loaded dataset graph. Before we move on, double-check the data loading job has finished successfully.

You can view every RDF resource stored in the Neptune database as a web page in metaphactory. To navigate to the corresponding page, you can use the Navigate to Resource menu from the quick links in the application header.

Neptune-Metaphactory-11

Figure 11 – Navigate to resource from quick links.

Now, let’s navigate to the Netflix, Inc. resource. Copy and paste https://permid.org/1-4295902158 into the input form and choose Navigate.

Neptune-Metaphactory-12

Figure 12 – Navigate to the Netflix Inc. resource.

As a result, you’ll navigate to the page of the Netflix, Inc. resource. The page consists of various visualization components that are powered by SPARQL queries parameterized with the current resource IRI, issued directly against the backing Neptune endpoint.

Neptune-Metaphactory-13

Figure 13 – Netflix, Inc. resource page.

Metaphactory offers three graph-specific views on the data:

Incoming and Outgoing Edges View for the current resource in a simple tabular format.
Default Graph View with an unstyled graph showing incoming and outgoing edges for the current resource.
Diagram View for interactive exploration of the current resource.

Before reading on, we encourage you to explore the Incoming and Outgoing Edges View and Default Graph View on your own. You can also choose the nodes, links, or both to browse through the graph.

Next, let’s take a closer look at the Diagram View, which you can use to interactively explore the graph. Assume, for instance, that we want to look into people involved in Netflix, as well as how they are connected to other organizations.

Open the Diagram View for the Netflix resource by choosing the diagram icon. By default, Diagram View shows the current resource together with its type, and you can use to interactively explore the graph. A click on the small icon on the right side of Netflix, Inc. will show all outgoing and incoming edges.

Neptune-Metaphactory-14

Figure 14 – Diagram exploration.

Neptune-Metaphactory-15

Figure 15 – Incoming and outgoing edges for the Netflix, Inc. node.

When you select the incoming “is position in” edge, you see a list of all “position” nodes that are connected to the Netflix, Inc. node. Uncheck Select All, and then select only the first one. Choose Add Selected to add the selected node to the diagram.

Neptune-Metaphactory-16

Figure 16 – Select only one node for “is position in” edge.

When the selected node is added to the diagram, you can continue unfolding the graph by adding nodes that are connected with “has holder” and “has position type” edges to this new node. Keep in mind you can use the zoom-in and zoom-out icons in the diagram view toolbar to change the zoom. You can also move the nodes around as you want to.

In addition, you can grab the white background area with the mouse and move the whole canvas around to focus on the relevant part of the diagram.

Neptune-Metaphactory-17

Figure 17 – Position info for Netflix, Inc.

Below, Figure 18 shows an example in which the graph has been expanded to show the qualifications of Mr. Richard Barton and the positions he holds in other companies.

Neptune-Metaphactory-18

Figure 18 – Expanded graph for Mr. Richard Barton node.

You can also save the diagram to use it later for data documentation or to communicate interesting patterns found in the data.

Query the Knowledge Graph with SPARQL

Having shown how to explore the data model visually, we continue our exploration of the knowledge graph exploration by issuing queries directly. For RDF data, Neptune supports the SPARQL graph query language, which has been standardized by the W3C.

If you’re not familiar with the SPARQL query language, you might want to try our interactive “Neptune SPARQL Tutorial” available through the Quick Links menu in your metaphactory instance. However, we will provide simple queries and explanations as part of this post as well.

Issue SPARQL Queries Using the Metaphactory Platform Interface

Metaphactory has a built-in SPARQL query editor that provides syntax highlighting and query validation. It’s accessible through the SPARQL link in the page header or from the start page.

Let’s try to issue some queries and along the way save them to the query catalog. In the previous section, we visually explored the resource Netflix, Inc., which is uniquely identified by the IRI <https://permid.org/1-4295902158>. Now, let’s try to explore this resource with SPARQL. As a first step, let’s query for all outgoing edges and nodes from the Netflix, Inc. by using the following query:

SELECT ?property ?node WHERE { <https://permid.org/1-4295902158> ?property ?node. }

The preceding query consists of two parts: a SELECT part listing all the variables whose values should be returned (variables are identified by a leading “?”), and a WHERE part that specifies a graph pattern. As you can see, the graph pattern of the query is similar to the triple representation we’ve discussed for RDF edges before. The only difference is that instead of concrete IRIs in property and object positions, we use variables.

The idea here is that by fixing the “subject” with the Netflix IRI and making the predicate and object variable, the query will extract all predicate labels (?property) and object node URIs (?node) for triples having subject https://permid.org/1-4295902158. This corresponds exactly to the outgoing edge labels and the nodes pointed to by the Netflix node.

Neptune-Metaphactory-19

Figure 19 – Results of the simple query for outgoing edges of the Netflix, Inc. node.

Use the Query Catalog to Save and Reuse Queries

The metaphactory platform provides a query catalog you can use to maintain and organize queries for later reference. We can now save the query to the Query Catalog by choosing Save and filling in the form metadata using the dialog box in Figure 20. In our example, we can use “Netflix Inc. outgoing edges” as a label for the query.

Neptune-Metaphactory-20.1

Figure 20 – Save query form.

To see how you can use the Query Catalog to manage queries, let’s save one more query. Assume we want to select the name, the phone number, the place of incorporation with latitude and longitude, and the location of Netflix, Inc.

Neptune-Metaphactory-21

Figure 21 – Graph for information about Netflix, Inc.

Here is the SPARQL query you can use to get the data from the example:

PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX fibo: <http://www.omg.org/spec/EDMC-FIBO/BE/LegalEntities/CorporateBodies/>
PREFIX organization: <http://permid.org/ontology/organization/>
PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
PREFIX geo: <http://www.geonames.org/ontology#>

SELECT * WHERE {
  <https://permid.org/1-4295902158> vcard:organization-name ?organizationName .
  <https://permid.org/1-4295902158> organization:hasRegisteredPhoneNumber ?phoneNumber .
  <https://permid.org/1-4295902158> organization:isIncorporatedIn ?isIncorporatedIn .
  <https://permid.org/1-4295902158> fibo:isDomiciledIn ?location.
  ?location geo:countryCode ?countryCode . 
  ?location wgs84:lat ?lat.
  ?location wgs84:long ?long.
}

You can see one pattern for each property in the WHERE part of the query. Each pattern is completed with a final dot. In this example, we use the ?location variable first in the object position, to get the value for the outgoing edge of the Netflix, Inc. location. But just the node about the location of Netflix alone is not very meaningful.

We want to get more information about the location and therefore need to explore the properties of the ?location itself. So, we need to get the location and all its outgoing properties. For this, we use the same ?location variable in the subject position to get values for outgoing edges of the location, such as latitude and longitude.

Let’s also save this second query in the query catalog in the same way as the previous one, using “Information about Netflix, Inc.” as a label. You can find more examples in the “Neptune SPARQL Tutorial” available through the Quick Links menu.

Now, when you know how to issue SPARQL queries, it’s time to use them to visualize the graph data.

Visualize the Graph

As we’ve already seen on the Netflix resource page, you can use metaphactory to visualize the data with graphs and other visualizations components like charts, tree views, timeline, and more. The web page associated with the resource can apply all these visualizations based on an underlying template mechanism. For example, the Netflix resource page is based on the custom template for PermID Organization type.

Every resource page in metaphactory is a simple HTML page, where various visualizations are available as custom HTML5 web components that you can use in the similar way as built-in html tags like <div> and <h1>. For example, the HTML tag for rendering a SPARQL query output into a table is called <semantic-table>, and for a graph it’s <semantic-graph>. The metaphactory documentation contains more details on available visualizations and various interactive components for search, data authoring, and exploration.

As an example of how to use custom visualizations, we show how to create a simple graph visualization displaying up to 25 persons holding positions in companies that are reference customers of AWS. So far, we looked only into SPARQL SELECT queries that return data in a tabular form. To visualize the data as a graph, we need to retrieve the data using SPARQL CONSTRUCT queries. You can use CONSTRUCT to extract (and transform) subgraphs from RDF data.

To visually show the graph, we use the semantic-graph component, which takes SPARQL CONSTRUCT query as a parameter and renders the resulting graph.

Let’s first create a fresh page in the metaphactory application. To do so, navigate to the metaphactory administration UI.

Neptune-Metaphactory-22

Figure 22 – Metaphactory administration UI link.

Go to Template & Application Pages management. Because every page in metaphactory identified by its own IRI, let’s use https://permid.org/GraphView as IRI for our new page. Copy and paste it into the form and choose Navigate to go to the page.

Neptune-Metaphactory-23

Figure 23 – Navigate to new page form.

Edit the page and paste the following HTML code snippet into it:

<h3>Graph</h3>   
    
<semantic-graph height='800px' width='800px' user-zooming-enabled='true'
             	query='   
              	PREFIX person: <http://permid.org/ontology/person/>
              	PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
              	PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
                      PREFIX aws: <http://aws.amazon.com/vocabulary/>
              	CONSTRUCT WHERE {
                	<https://permid.org/1-5038056692> aws:hasReferenceCustomer ?referenceCustomer.
                	?referenceCustomer vcard:organization-name ?referenceCustomerName.
                	?tenure person:isTenureIn ?referenceCustomer.
                	?tenure person:hasHolder ?holder.
                	?holder rdfs:label ?holderName.
                	?holder person:holdsPosition ?holderPosition.
                	?holderPosition person:hasReportedTitle ?holderTitle.
              	}
              	LIMIT 25
            	'
>
</semantic-graph>

Save the page, and you should get the graph view similar to the one shown in Figure 24. You can use your mouse wheel to zoom in and zoom out the graph.

Neptune-Metaphactory-24

Figure 24 – Custom graph visualization.

Because all web components in metaphactory are fully customizable, you can further adjust the graph visualization with custom styling for nodes and edges, and also different layout algorithms. You can find more details about possible customization options in the metaphactory documentation.

Neptune-Metaphactory-25

Figure 25 – Customized graph visualization.

Summary

In this post, we showed you how to set up Amazon Neptune together with the metaphactory platform and use this software stack to explore and query RDF-based knowledge graphs. We have demonstrated the setup using an open data knowledge graph based on the PermID dataset.

We encourage you to try to load your own RDF data into Amazon Neptune and explore it with metaphactory. You can use all mentioned exploration techniques to view and query any RDF graph stored in the Neptune database, without any additional configuration.

We are always open to feedback, so don’t hesitate to get in touch with us to share your experience with Amazon Neptune and metaphactory.

metaphacts GmbH – APN Partner Spotlight

metaphacts is an APN Technology Partner. Its metaphactory platform helps you build knowledge graphs and smart applications by making authoring, curating, editing, linking, searching, and visualizing graph data easy, fast, and affordable.

Contact metaphacts | Solution Overview | AWS Marketplace

*Already worked with metaphacts? Rate this Partner

*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.