Analyzing social media feeds using Amazon Neptune

Data generated from various media applications is valuable for organizations to understand customer sentiments and product feedback. It can also be used to recommend new products and services to their customers. This type of data is highly connected and is challenging to ingest and store in a relational database. A relational database does not store relationships directly, but rather as foreign keys between tables. Querying using a relational database requires a large number of join operations, which can be difficult to write, inefficient to execute, and must be updated as the data models change. Using a graph database, you can express and query the relationships directly, which allows to you build applications of highly connected data more quickly. The purpose of this blog post is to show how to use a purpose-built graph database to analyze social media feeds.

This post walks through using a sample social media dataset with Amazon Neptune. Amazon Neptune is a fully managed graph database service that can store billions of relationships within highly connected datasets such as social media feeds. Neptune supports popular graph models such as Property Graph and W3C’s RDF. Their respective query languages, Apache TinkerPop Gremlin, and SPARQL, allow you to build queries that efficiently navigate highly connected datasets.

We also share the utility to generate synthetic social media feeds and show how this data can be analyzed in Neptune. The utility is available in the amazon-neptune-samples GitHub repo. You can use this utility to generate millions of vertices (nodes) and edges (relations) and load them into your Neptune cluster. The utility code included in this post shows how to generate a synthetic social media dataset, use Neptune APIs, ingest data, and query a social media graph. The utility code from the GitHub repo is fully extensible; developers can change the number and type of vertices and edges by modifying the JSON configuration files.

The following diagram is the graph data model that you would get from the relationships within the dataset that the utility generated. Vertices are represented by the circle and edges are represented by the directional arrow. Vertices have labels and properties, which are represented by the boxes. Edges also have labels, which are represented by the words on the arrows and may also have one or more properties.

This example dataset is similar to a real-world social media source. For example, in a Twitter dataset, a user entity becomes a vertex and their relationship with followers becomes an edge. A User vertex can have properties such as Username, City, and Birth Date, and a Follows relation can also have properties such as Follows since and Weight. Posts and Tweets are another vertex, and they have a Tweets relation with the User vertex. Organizations can get insight on various activities and use those to identify behavior, popularity, and provide recommendations for a specific product or a service by navigating the vertices and edges throughout the graph.

Solution overview

The solution contains the following steps to generate, load, and query the data:

Clone the GitHub repo.
Generate social media example data using Neptune java utility in the repo (graph data is available in Amazon S3).
Load this data into Neptune using Neptune Loader.
Issue search queries and update the graph using the Apache TinkerPop Gremlin client.

The following architecture diagram illustrates the components in use and the four steps.

Launching the AWS CloudFormation stacks

Before getting started on the solution, launch the AWS CloudFormation stacks using the and AWS CloudFormation templates. The following infrastructure and applications components are configured automatically after creating the stacks from the preceding templates:

Single-node Neptune cluster (the default DB instance type is r4.large, which you can change as part of the stack creation)
S3 bucket to store data in CSV format
IAM role attached to the Neptune cluster for read-only access to S3
S3 VPC endpoint created and attached to the VPC of the Neptune cluster
Amazon EC2 instance with instance profile that allows you to read and write to S3
Java and Maven installed and configured on the EC2 instance used to access the Neptune cluster
Apache TinkerPop Gremlin client installed and configured to query graph data stored in the Neptune cluster

The following screenshot shows the Specify stack details section while provisioning the VPC through the first AWS CloudFormation template link. This post names the first stack neptune-util-vpc.

The following screenshot shows the Specify stack details while launching the second AWS CloudFormation template to create resources inside the VPC.

The following screenshot shows additional details of the Specify stack details page. This post names the S3 bucket for the CSC files nep-ej-n500.

The following screenshot shows additional details of the Specify stack details page. The VPCStack field has the value of neptune-util-vpc, from the previous AWS CloudFormation template.

After the status for both the AWS CloudFormation stacks show as Complete, connect to the EC2 instance and run the following steps on the EC2 instance provisioned as part of the Cloud Formation template.

Cloning the GitHub repo

This step clones the amazon-neptune-samples GitHub repo on the EC2 instance, which makes the Neptune Java utility available to generate and load Twitter-like data. Use the EC2 key pair from the AWS CloudFormation execution to connect to EC2 instance.

You may need to connect over SSH to your EC2 instance. For more information, see Connecting to Your Linux Instance Using SSH.

To clone the repo, complete the following steps:

On the AWS CloudFormation console, choose Outputs.
Copy the value of EC2BastionHostName.
In the following code, replace host-name with the value of EC2BastionHostName:
```
ssh -i <keypair-name.pem> ec2-user@<host-name> 
```

Enter the following code to clone the GitHub repository:

sudo yum install git
git clone https://github.com/aws-samples/amazon-neptune-samples.git
cd amazon-neptune-samples/gremlin/neptune-social-media-utils
mvn package

Generating the example graph

In the neptune-social-media-utils folder, enter the following code. Replace the values for s3-bucket with the S3 bucket name you provided while launching the neptune-social-media-util-template AWS CloudFormation template. bucket-folder can be any folder name that you created under the S3 bucket.

./run.sh gencsv csv-conf/twitter-like-w-date.json <s3-bucket> <bucket-folder>

This command generates Twitter-like synthetic social media data into the /tmp folder on a local filesystem of the EC2 instance and uploads it to the S3 bucket you specified as an argument automatically.

Loading the graph data into Neptune

Neptune provides a utility for loading data from external files directly into a Neptune Database instance. You can use this utility instead of executing a large number of INSERT statements, addVertex and addEdge steps, or other API calls. This utility is called a Neptune Loader. For more information, see Loading Data into Amazon Neptune.

When you launched the previous AWS CloudFormation templates, you already created an S3 VPC endpoint, so the Neptune cluster has access to S3 over a private network.

To use the Neptune Loader, first create and attach an IAM role, which allows the cluster to issue GET requests to S3. Complete the following steps:

From the AWS CloudFormation console, note the value of ‘NeptuneIAMRole’ key from the Outputs tab.
On the Neptune console, choose Clusters.
Choose the Neptune cluster you created.
From the Actions menu, choose Manage IAM roles.
From Add IAM roles to this cluster, select the same role you noted in step 1.
Choose Add role.

When the status changes from Active to Done, return to the Clusters dashboard.

The following screenshot shows the Manage IAM roles page. The name of the IAM role to Neptune is dev-neptune-iam-role-us-west-2.

You are now ready to load the data from the S3 bucket into the Neptune cluster. Complete the following steps:

On the AWS CloudFormation console, choose Outputs.
Copy the values of the NeptuneEndpointAddress, NeptuneEndpointPort, and NeptuneIAMRole.
You use these values in place of neptune-cluster-endpoint, port, and iam-role-arn in the next step.

In the neptune-social-media-utils folder, enter the following code, replacing s3-bucket-name/folder-name from the previous ./run.sh command:

./run.sh import <neptune-cluster-endpoint> <port> <iam-role-arn> <s3-bucket-name>/<folder-name> <aws-region-code>

For example, your code may resemble the following:

./run.sh import mytwitterclst.cluster-crhihlsciw0e.us-east-2.neptune.amazonaws.com 8182 arn:aws:iam::213930781331:role/s3-from-neptune-2 neptune-s3-bucket/twitterlikeapp us-east-1

This utility uses Neptune Loader internally to bulk load data into Amazon Neptune.
Alternatively, to load data into Neptune using Neptune Loader, use the following example curl command:

curl -X POST -H 'Content-Type: application/json' https://<amazon-neptune-cluster-endpoint>:8182/loader -d '
{
"source": "s3://bucket-name/bucket-folder/",
"format": "csv",
"iamRoleArn": "arn:aws:iam::<account-number>:role/<role-name>",
"region": "us-east-1",
"failOnError": "FALSE"
}'

You are now ready to run queries on the dataset you loaded into Neptune.

Running interactive remote queries

You already installed and configured the Apache Tinkerpop Gremlin client on the EC2 instance you created as part of the AWS CloudFormation template.

To run interactive remote queries, first configure the Apache Gremlin Console to connect to the Neptune database instance.

You can now run the following query:

[ec2-user@ip-172-31-59-189 bin]$ cd /home/ec2-user/apache-tinkerpop-gremlin-console-3.4.1
/home/ec2-user/apache-tinkerpop-gremlin-console-3.4.1/bin
[ec2-user@ip-172-31-59-189 bin]$ bin/gremlin.sh 

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin> :remote connect tinkerpop.server conf/neptune-remote.yaml
==>[neptune-cluster-endpoint]/172.31.23.188:8182
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [neptune-cluster-endpoint]/172.31.23.188:8182] - type ':remote console' to return to local mode
gremlin>

You can find an example user from User vertices and use it for your queries. See the following code example:

gremlin> g.V().has('User','~id','1').valueMap()
==>{name=[Brenden Johnson]}

To discern who follows this user, enter the following code example:

gremlin> g.V().has('name', 'Brenden Johnson').in('Follows').values('name')
==>Jameson Kreiger
==>Yasmeen Casper
==>Maverick Altenwerth
==>Isabel Gibson
...

To find this user’s followers that retweeted their Tweets, enter the following code example:

gremlin> g.V().has('name', 'Brenden Johnson').in('Follows').as('a').out('Retweets').in('Tweets').has('name', 'Brenden Johnson').select('a').values('name').dedup()
==>Quentin Watsica
==>Miss Vivianne Gleichner
==>Mr. Janet Ratke
...

For more example search and insert queries, see twitter-like-queries.txt in the GitHub repo.

Modifying your JSON file

This post demonstrated how to generate and load a simple dataset into Neptune. To change the number of generated edges or vertices and their properties, modify the twitter-like-w-date.json file (under the csv-conf/ folder). A few sample JSON configuration files for tiny, small, medium, and large dataset have been provided. You can use these configuration files to generate the dataset and run interactive queries on the data.

Conclusion

Social media datasets are increasingly common and of high value to organizations for analyzing customer sentiments, identifying relationships, and providing recommendations. This post shows how Amazon Neptune can be used to analyze synthetic social media feeds using Apache TinkerPop Gremlin. You can also use this approach to load test your applications on a Neptune cluster and benchmark the query performance on large datasets. Please let us know your feedback in the comments section.

About the Authors

Ejaz Sayyed is a Partner Solutions Architect with the Global System Integrator (GSI) team at Amazon Web Services. His focus areas include AWS database services as well as database and data warehouse migrations on AWS. Recently, he is also supporting GSIs building data lakes on AWS for our customers.

Bala Ravilla is a Partner Solutions Architect with Global System Integrator(GSI) team at Amazon Web Services. He provides guidance to GSIs in building scalable, highly available and secure solutions on AWS cloud. His focus areas include IoT, Serverless and Migrations.