Graph relationships with Amazon Neptune
In this lesson, you build a fraud-detection service for your restaurant-rating application. When a rating comes in to your application, add the data to your fraud detection service. Then analyze the rating to see if it should be flagged for manual review or for removal. This service uses Amazon Neptune, a fully managed graph database, for its data storage.
This lesson teaches you how to use a fully managed Neptune database in an application. First, you learn why you would want to use a graph database such as Neptune. Then you walk through the steps to create a Neptune database, design your data model, and use the database in your application. At the end of this lesson, you should feel confident in your ability to use Neptune in your application.
Time to complete: 30–45 minutes
Neptune is a fully managed graph database provided by AWS. A graph database is good for highly connected data with a rich variety of relationships. Many companies use graph databases for the following use cases:
- Recommendation engines: You can use a graph database to map users and followers in a social network or to map customers to item purchases in an ecommerce application. By analyzing the connections between similar users or customers, you can provide accurate recommendations of friends to follow or additional items to purchase.
- Fraud detection: Payment companies use graph databases to identify rings of fraudulent transactions. By analyzing the relationships between email addresses, IP addresses, and other shared information, it is easier to flag suspicious activity.
- Knowledge graphs: You can use graph databases to connect related pieces of information to show connections between people, places, and concepts. This can enable rich context around entities in your storefront or knowledge hub.
With Neptune, you get a fully managed graph database experience. This means you don't need to focus on instance failover, database backups and recovery, or software upgrades. You can focus on building your application and delivering value to your customers.
In this lesson, you learn how to build a fraud-detection service that uses Neptune for data storage. This lesson has five steps.
-
1. Create an AWS Cloud9 environment
In this module, you create and prepare an AWS Cloud9 environment. AWS Cloud9 is a cloud-based integrated development environment (IDE). It gives you a fast, consistent development environment from which you can quickly build AWS applications.
To get started, navigate to the AWS Cloud9 console. Choose Create environment to start the AWS Cloud9 environment creation wizard.
(click to zoom)On the first page of the wizard, give your environment a name and a description. Then choose Next step.
(click to zoom)The next step allows you to configure environment settings, such as the instance type for your environment, the platform, and network settings.
The default settings work for this lesson, so scroll to the bottom and choose Next step.
(click to zoom)The last step shows your settings for review. Scroll to the bottom and choose Create environment.
(click to zoom)Your AWS Cloud9 environment should take a few minutes to provision. As it is being created, the following screen is displayed.
(click to zoom)After a few minutes, you should see your AWS Cloud9 environment. There are three areas of the AWS Cloud9 console to know, as illustrated in the following screenshot:
- File explorer: On the left side of the IDE, the file explorer shows a list of the files in your directory.
- File editor: In the upper right area of the IDE, the file editor is where you view and edit files that you’ve chosen in the file explorer.
- Terminal: In the lower right area of the IDE, the terminal is where you run commands to execute code samples.
(click to zoom)In this lesson, you use Python to interact with your Neptune database. Run the following commands in your AWS Cloud9 terminal to download and unpack the module code.
cd ~/environment
curl -sL https://s3.amazonaws.com/aws-data-labs/fraud-detection.tar | tar -xvRun the following command in your AWS Cloud9 terminal to view the contents of your directory.
ls
You should see two directories in your AWS Cloud9 terminal:
- scripts/: The scripts directory includes files necessary for configuring and preparing your database. Use this to test your database connection and load sample data into your database.
- application/: The application directory contains files that are similar to what you have in your application. They show how to query your graph database to satisfy your data access patterns.
Run the following command in your terminal to install the dependencies for your application.
sudo pip install -r requirements.txt
pip install gremlinpython
In this module, you configured an AWS Cloud9 instance to use for development. In the next module, you create a Neptune database.
-
2. Create a Neptune database
In this module, you create a Neptune database. This database is used to power the fraud-detection service in your application.
To get started, navigate to the Neptune console. Choose Create database to begin the database creation wizard.
(click to zoom)In the Engine options section, use the default Neptune version. Then in the Settings section, give your database the identifier, fraud-detection.
(click to zoom)The database creation wizard includes templates to make the creation process easier. For this lesson, choose the Development and Testing template to choose defaults that work well for this tutorial.
(click to zoom)In the DB instance size section, keep the default of a db.t3.medium instance.
In the Availability & durability section, keep the default to not use a Multi-AZ deployment. In a production deployment, you likely would want to have a multi-AZ deployment for better availability in the event of failure.
(click to zoom)In the Connectivity section, in the Additional connectivity configuration subsection, for the VPC security group choose Create new to create a new group. Then give your new security group the name fraud-detection.
(click to zoom)You can configure tags or update additional configuration options, but the defaults work for this tutorial.
Choose Create database to create your Neptune database.
(click to zoom)AWS begins provisioning your Neptune database. As your database is being provisioned, it shows a Status of Creating.
When your database is ready, it shows a Status of Available.
(click to zoom)(click to zoom)After your database is created, you need to configure its security group to allow access from your AWS Cloud9 environment.
To do that, navigate to the Security Groups page of the Amazon EC2 console. You should see security groups for both your AWS Cloud9 environment and your Neptune database.
Choose the Security group ID for your Neptune database.
(click to zoom)The subsequent page displays the details and networking rules for your security group. Choose Edit inbound rules to edit the inbound networking rules for your security group.
(click to zoom)There should be an existing rule that allows TCP traffic on port 8182 from the IP address you used to create the Neptune database. Edit the Source value so that it uses the AWS Cloud9 security group instead.
Then choose Save rules to save your inbound rules.
(click to zoom)Finally, you should check to ensure you have configured everything correctly and can connect to your Neptune database from your AWS Cloud9 environment.
Return to the Neptune console and find your Neptune database. Choose the DB identifier with a Role of Cluster to see information about your cluster.
(click to zoom)On the cluster details page is a Cluster endpoint value. Copy this value.
(click to zoom)In your AWS Cloud9 environment, run the following command in the terminal to set your cluster endpoint as an environment variable.
export NEPTUNE_ENDPOINT=<yourClusterEndpoint>
Be sure to replace the value you copied from the Neptune console for <yourClusterEndpoint> before running the command.
There is a file in the scripts/ directory called test_connection.py. Open the file in your file editor. The contents should look as follows.
import os from gremlin_python.process.anonymous_traversal import traversal from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection endpoint = os.environ["NEPTUNE_ENDPOINT"] g = traversal().withRemote( DriverRemoteConnection(f"wss://{endpoint}:8182/gremlin", "g") ) results = g.V().count().next() print(f"Connected to Neptune! There are {results} vertices in the database")
This script attempts to connect to Neptune to ensure your configuration is correct. First, it imports needed libraries from the gremlinpython library. Gremlin is a popular graph traversal language, and you can use the Apache TinkerPop project to use it in your applications.
After importing the libraries, the script creates a Gremlin graph object using the Neptune endpoint you set in your environment. Finally, the script runs a simple count operation to test the connection.
Run the following command in your terminal to test the connection to your Neptune database.
python scripts/test_connection.py
You should see a message that you were able to connect to your Neptune database and that there are no vertices in your database.
In this module, you created a graph database by using Neptune. Neptune provides a fully managed graph database that is compatible with open-source graph languages such as Gremlin. After creating your database, you configured your security group to allow inbound traffic from your AWS Cloud9 environment. Finally, you saw how to connect to Neptune and ran a script to test your connection.
In the next module, you design your data model for the fraud-detection service and load your table with sample data.
-
3. Design your graph data model and load sample data
In this module, you learn the basics of data modeling with a graph database. Then you design a data model for your fraud-detection service and load your database with sample data.
Graph databases might be different than databases you have used in the past, such as relational databases. There are a few key terms to know about graph databases:
- Graph: This refers to the database as a whole. It is similar to a table in other databases.
- Vertex: A vertex (also called a node) represents an item in the graph. It is generally used to represent nouns or concepts such as people, places, and terms. The plural of vertex is vertices, a term that is used in this lesson.
- Edge: A connection between two vertices. Edges often represent relationships between entities. For example, two people who work together might be connected by a WorksWith edge.
- Label: Can be used to indicate the type of vertex or edge being added. For example, you might have vertices with the label User to indicate users in your application, as well as vertices with the label Interest to indicate an interest that people can follow.
- Property: You can add key-value pairs to your vertices and edges. These are known as properties. For example, your user vertices have a username property.
When querying a graph, you often start at a vertex and traverse the edges to find relationships to that original vertex. In your fraud-detection use case, you might start with a User vertex and traverse the Reviewed edges to find the Restaurant vertices that the user has reviewed.
When building your graph data model, you should think about the entities in your application and how they relate to each other. The information that you model in your graph database may be different than what you store in your primary data store.
For example, your primary database might include information about the date of birth for each of your users. Though you might want to use that in your primary database for actions such as sending your users special offers on their birthdays, it's unlikely to be useful in the fraud-detection service. Because date of birth is unlikely to affect fraud, you can leave it out of that database entirely.
Conversely, there may be information that you don't store in your primary database that would be useful in your graph database. When a user leaves a review for a restaurant, you may not care to store the IP address used when the review was left. However, the IP address could be very useful in the fraud-detection service as you look for clusters of fraudulent activity from bots using the same IP address.
With that in mind, a rough example of the data model for the fraud-detection service is shown in the following diagram.
(click to zoom)In the preceding data model, the vertices are shown as ovals. There are four vertices in this example of three different types:
- User: Represents a user in your application. A User vertex has a User label and a username property.
- Restaurant: Represents a restaurant in your application. A Restaurant vertex has a Restaurant label and a name property.
- IPAddress: Represents an IP address that was used by a User when reviewing a Restaurant. An IPAddress vertex has an IPAddress label and an address property.
Additionally, there are three edges of two types in the diagram:
- Reviewed: Indicates a review submitted by a User for a Restaurant. A Reviewed edge has a Reviewed label and a rating property.
- Used: Indicates an IPAddress was used by a User for a review. A Used edge has a Used label.
Now load some example data into the graph database to test the access patterns.
In the scripts/ directory, there is a file called bulk_load_database.py. Open the file in your file editor. You should see the following contents.
import json import os from gremlin_python.process.anonymous_traversal import traversal from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection endpoint = os.environ["NEPTUNE_ENDPOINT"] g = traversal().withRemote( DriverRemoteConnection(f"wss://{endpoint}:8182/gremlin", "g") ) with open("scripts/vertices.json", "r") as f: for row in f: data = json.loads(row) if data["label"] == "User": g.addV("User").property("username", data["username"]).next() elif data["label"] == "Restaurant": g.addV("Restaurant").property("name", data["name"]).next() elif data["label"] == "IPAddress": g.addV("IPAddress").property("address", data["address"]).next() with open("scripts/edges.json", "r") as f: for row in f: data = json.loads(row) if data["label"] == "Used": g.V().has("User", "username", data["username"]).as_("user").V().has( "IPAddress", "address", data["ip_address"] ).as_("ip_address").addE("Used").from_("user").to("ip_address").next() elif data["label"] == "Reviewed": g.V().has("User", "username", data["username"]).as_("user").V().has( "Restaurant", "name", data["restaurant"] ).as_("restaurant").addE("Reviewed").from_("user").to( "restaurant" ).property( "rating", data["rating"] ).property( "username", data["username"] ).property( "restaurant", data["restaurant"] ).next() print("Loaded data successfully!")
This file loads data from two files—scripts/vertices.json and scripts/edges.json—and inserts the data into your Neptune database. Look at the code to see how you create vertices and edges in your application code. You will use the addV() and addE() methods from the Gremlin package to add vertices and edges in your code.
To execute the bulk load script and insert the records into your database, run the following command in your terminal.
python scripts/bulk_load_database.py
It takes a few moments to load all the records. You should see output indicating that the data was loaded successfully.
Execute the scripts/test_connection.py script again to see the number of vertices in your table after the load.
$ python scripts/test_connection.py
Connected to Neptune! There are 118 vertices in the databaseSuccess! You have loaded items into your Neptune database.
In this module, you learned the basic terminology for working with graph databases. Then you designed your data model for your fraud-detection service. Finally, you loaded some sample data into your database.
In the next module, you run some queries against your graph database to help identify fraudulent activity.
-
4. Use a graph database in your application
In this module, you learn how to use a graph database in your application. You execute some Gremlin graph traversal queries to identify clusters of potentially fraudulent activity in your restaurant reviews application.
Graph databases are efficient at traversing relationships across your application. They can be used to find connections between entities that are difficult to discover with other databases. For this reason, they are often used in fraud-detection and social-networking applications.
Let's start with the first use case. Imagine you have a problem with automated bots that are leaving large numbers of one-star ratings for restaurants. Restaurant owners are upset with your service because users are damaging restaurants’ reputations by leaving low ratings in bulk. You want to discover the bot traffic and remove the offending reviews.
Let’s say you have an IP address that you have flagged as suspicious. You want to find which users have used this IP address and whether those users have given a high number of one-star reviews.
In the applications/ directory, there is a file called find_users_of_suspicious_ip_addresses.py. Open that file in your file editor. The contents should look as follows.
import os from gremlin_python.process.anonymous_traversal import traversal from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection from gremlin_python import statics statics.load_statics(globals()) endpoint = os.environ["NEPTUNE_ENDPOINT"] g = traversal().withRemote( DriverRemoteConnection(f"wss://{endpoint}:8182/gremlin", "g") ) def find_users_of_suspicious_ip_addresses(ip_address): results = ( g.V() .has("IPAddress", "address", ip_address) .as_("ip_address") .in_("Used") .aggregate("ip_address_users") .outE("Reviewed") .has("rating", 1) .values("username") .groupCount() .order(local) .by(values, desc) .limit(local, 10) .toList() ) return [ {"username": k, "1-star reviews": v} for result in results for k, v in result.items() ] suspicious_users = find_users_of_suspicious_ip_addresses("173.153.51.29") for user in suspicious_users: print( f"User {user['username']} has written {user['1-star reviews']} one-star reviews." )
After the library imports and database connection logic, there is a function called find_users_of_suspicious_ip_adddresses. This function takes in an IP address and returns a list of the top 10 users to use that address according to the number of one-star reviews they have left. This is similar to a function you might have in your fraud-detection service.
Look at the query used to query your Neptune database. It starts with the graph object and locates the IP address in question by using the V().has() syntax to find the vertex that represents the given IP address. Next, it traverses the edges from the IP address vertex to find the Users that have used that IP address. Then it views the Reviewed edges from those Users to look for those with a Rating of 1. Finally, it groups the Users by the number of one-star ratings and returns the top 10 Users found.
At the bottom of the module is an example of how to call the function with a given IP address. It then prints out the results.
To test the function, run the following command in your terminal.
python application/find_users_of_suspicious_ip_addresses.py
You should see the following results in your terminal.
$ python application/find_users_of_suspicious_ip_addresses.py
User clester has written 10 one-star reviews.
User hhouston has written 5 one-star reviews.Nice! You were able to discover two Users that could be fraudulent.
Though this approach could find bot users that are using the same IP address, your antagonists may get smarter and spread their actions across multiple IP addresses. You could use your graph database to find clusters of compromised IP addresses as well.
In the application/ directory, there is a file called find_related_suspicious_ip_addresses.py. Open this file in your file editor. The contents should look as follows.
import os from gremlin_python.process.anonymous_traversal import traversal from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection from gremlin_python import statics statics.load_statics(globals()) endpoint = os.environ["NEPTUNE_ENDPOINT"] g = traversal().withRemote( DriverRemoteConnection(f"wss://{endpoint}:8182/gremlin", "g") ) def find_related_suspicious_ip_addresses(ip_address): results = ( g.V() .has("IPAddress", "address", ip_address) .as_("ip_address") .in_("Used") .aggregate("ip_address_users") .out("Used") .where(neq("ip_address")) .values("address") .groupCount() .order(local) .by(values, desc) .limit(local, 10) .toList() ) return [ {"ip_address": k, "user_overlap": v} for result in results for k, v in result.items() ] suspicious_ip_addresses = find_related_suspicious_ip_addresses("173.153.51.29") for user in suspicious_ip_addresses: print( f"IP address {user['ip_address']} has {user['user_overlap']} overlapping users." )
The contents of this file are similar to the previous one. There is a function called find_related_suspicious_ip_addresses that is similar to a function you would have in your service. This function takes an IP address and returns any IP addresses that were used by Users that used the suspicious IP address. This could be used to identify clusters of bad actors.
At the bottom of the file is a statement to execute the function with the suspicious IP address. Test the script by running the following command in your terminal.
python application/find_related_suspicious_ip_addresses.py
You should see the following output in your terminal.
$ python application/find_related_suspicious_ip_addresses.py
IP address 174.70.217.249 has 1 overlapping user.Success! You found another suspicious IP address. You can use this IP address in your previous function to see if there are suspicious users on that IP address as well.
In this module, you saw how to use a graph database in your application. You used the Gremlin query language to traverse your graph to find related entities and identify fraudulent activity.
In the next module, you clean up the resources you created in this lesson.
-
5. Clean up the resources you created
In this lesson, you created a graph database by using Neptune that serves as the database for a fraud-detection service. Graph databases are a great fit for traversing highly connected data to discover hidden relationships between entities in your application. With Neptune, you get a fully managed graph database that allows you to focus on building features that delight your users.
In this module, you clean up the resources you created in this lesson to avoid incurring additional charges.
First, delete your Neptune database. To do so, navigate to the Neptune console. Choose the Writer instance for the database you created, and then choose Delete in the Actions dropdown.
(click to zoom)A confirmation page is displayed before you delete the database. For this lesson, you can decline to keep a final snapshot of your database. Choose Delete to confirm the deletion.
(click to zoom)The Neptune page shows that your Writer instance is being deleted.
(click to zoom)After your Writer instance is deleted, Neptune deletes your Cluster instance as well.
(click to zoom)Additionally, you need to delete your AWS Cloud9 development environment.
To do so, navigate to the AWS Cloud9 console. Choose the environment you created for this lesson, and then choose Delete.
(click to zoom)
In this module, you learned how to clean up the Neptune database and the AWS Cloud9 environment that you created in this lesson.
In this lesson, you learned how to create and use a Neptune database in your application. First, you created a Neptune database and configured network access so that you could connect to the database. Then you learned about data modeling with a graph database and loaded your database with sample data. Finally, you saw how to query a graph database in your application to traverse relationships in your data. You can use these patterns when building applications with Neptune.