In this module, you design the data model for your recommendation engine.

Graph databases may be different than databases you have used in the past, such as relational databases. There are a few key terms you should know with a graph database:

  • Graph: This refers to the entire database as a whole. It can be similar to a ‘table’ in other databases.
  • Vertex: A vertex (also called a node) represents an item in the graph. It is generally used to represent nouns or concepts -- people, places, terms, etc. The plural of vertex is vertices and you will see that term used below.
  • Edge: An edge is a connection between two vertices. Edges often represent relationships between entities. For example, two people who work together might be connected by a WorksWith edge.
  • Label: A label can be used to indicate the type of vertex or edge being added. For example, you may have vertices with the label User to indicate users in your application, as well as vertices with the label Interest to indicate a particular interest that people can follow.
  • Property: You can add key-value pairs to your vertices and edges. These are known as properties. For example, your user vertices will have a username property.

When querying a graph, you often start at a particular vertex and traverse the edges to find relationships to that original vertex. In your use case, to find all the people that a particular user follows, you start at the given user and traverse all edges with the label Follow out from that user.

In the following steps, you complete some basic graph queries. First, you load some sample data into your cluster. Then, you see how to query to find a user’s current interests. Finally, you see a query to generate recommendations for a particular user.

Time to Complete Module: 30 Minutes


  • Step 1. Load sample data

    First, you load some sample data into your Neptune database.

    In your data model, there are two vertex types: User and Interest. There are also two edge types: Follow (which represents one User following another User), and InterestedIn (which represents a User indicating interest in one of the defined Interests).

    In the scripts/ directory, there is a vertices.json file that includes fifty sample Users and six sample Interests. There is also a script called insertVertices.js that reads the sample vertices and loads them into Neptune.

    The contents of the insertVertices.js script is as follows:

    const fs = require('fs');
    const path = require('path');
    
    const gremlin = require('gremlin');
    const DriverRemoteConnection = gremlin.driver.DriverRemoteConnection;
    const Graph = gremlin.structure.Graph;
    
    const connection = new DriverRemoteConnection(`wss://${process.env.NEPTUNE_ENDPOINT}:8182/gremlin`,{});
    
    const graph = new Graph();
    const g = graph.traversal().withRemote(connection);
    
    const createUser = async (username) => {
      return g.addV('User').property('username', username).next()
    }
    
    const createInterest = async (interest) => {
      return g.addV('Interest').property('interest', interest).next()
    }
    
    const raw = fs.readFileSync(path.resolve( __dirname, 'vertices.json'));
    const vertices = JSON.parse(raw)
    
    const vertexPromises = vertices.map((vertex) => {
      if (vertex.label === 'User') {
        return createUser(vertex.username)
      } else if (vertex.label === 'Interest') {
        return createInterest(vertex.name)
      }
    })
    
    Promise.all(vertexPromises).then(() => {
      console.log('Loaded vertices successfully!')
      connection.close()
    })

    The script imports the needed libraries and initializes the Neptune connection, as in our test database script. It then defines two functions -- createUser and createInterest -- that create the User and Interest vertices, respectively. The script reads the sample users and iterates over the example data to create Users and Interests.

    You can run the script with the following command:

    node scripts/insertVertices.js

    You should see the following output in your terminal:

    Loaded vertices successfully!

    There is also a script called insertEdges.js in the scripts folder. It loads some sample edges from the edges.json file and inserts them into your Neptune database.

    You can run the script with the following command:

    node scripts/insertEdges.js

    You should see the following output in your terminal:

    Edges loaded successfully!

    You have now loaded your sample vertices and edges. If you try running the testDatabase.js script from the last module -- you now have some vertices in your graph:

    { value: 56, done: false }

    Success! You found 56 vertices -- fifty Users and six Interests.

    In the next step, you learn how to query all Interests for a given User.

  • Step 2. Fetch interests for a user

    Now that you have some data loaded into Neptune, you can query it to answer some questions.

    Gremlin’s query language can be difficult to understand at first, so let’s walk through an example. Imagine that your application wants to fetch and return all of the Interests for a particular User.

    In the scripts/ directory, there is a findUserInterests.js file with the following contents:

    const gremlin = require('gremlin');
    const DriverRemoteConnection = gremlin.driver.DriverRemoteConnection;
    const Graph = gremlin.structure.Graph;
    
    const connection = new DriverRemoteConnection(`wss://${process.env.NEPTUNE_ENDPOINT}:8182/gremlin`,{});
    
    const graph = new Graph();
    const g = graph.traversal().withRemote(connection);
    
    const findUserInterests = async (username) => {
      return g.V()
        .has('User', 'username', username)
        .out('InterestedIn')
        .values('interest')
        .toList()
    }
    
    findUserInterests('amy81').then((resp) => {
      console.log(resp)
      connection.close()
    })

    The top of the file again contains the imports and initialization code for connecting to a graph database. The key part is the findUserInterests function. This is similar to an internal function that you would have in your application. It takes a username argument and returns the Interests for that user.

    Let’s walk through the actual query in your function. There are five lines. Let’s see what each of them is doing.

    First, it starts with g.V(). The g variable is your graph instance. Using the V() operator after it indicates that you are operating on vertices (rather than edges).

    The next line is .has('User', 'username', username). This part narrows down the query to a specific vertex rather than all of them in the graph. It specifies you want a vertex with a label of User (the first argument in the has() operator). Then it says you also want the username property to be the value of the given username (the second and third arguments in the has() operator).

    Once you have selected your User vertex, then you want to find the User’s interests. You do this with the out() operator. The out() operator says to traverse the edges that go from your given vertex to another vertex. In this query, you narrowed it down to edges that have the InterestedIn label.

    At this point, you have an array of vertices that are the object of an InterestedIn relationship from the given User. You want to return the user-friendly name of these vertices, so you use the values() operator. You tell it that you want to return the interest property.

    Finally, you call the toList() operator to execute the traversal operation and gather the results as an array.

    At the bottom of the script, it calls your findUserInterests function using one of the sample users. You can execute the script with the following command:

    node scripts/findUserInterests.js

    You should see the following output in your terminal:

    [ 'Nature', 'Sports', 'Woodworking', 'Cooking' ]

    Success! It shows that the user amy81 in interested in four interests -- Sports, Woodworking, Cooking, and Nature.

    In the next step, you learn how to generate friendship recommendations in your application.

  • Step 3. Generate recommendations for a user

    Now that you know some basic graph traversals, let’s try something more difficult.

    In your application, you want to generate user-specific recommendations of other users they should follow. A common way to generate these recommendations is to look for other users that are following similar people as you. If there are additional people that are commonly followed by these similar users, it is a good indication that you may want to follow them as well.

    For example, look at the following diagram.

    friend-rec-diagram

    This diagram shows Users, as represented by ovals, as well as Follow relationships, as indicated by arrows from one oval to another. In this example, the User on the far left, MyUser is following the User at the top, PopularPolly. You can see that PopularPolly is also followed by two other Users -- SimilarSam and MirrorMax. Both of these two Users are also following a different User named InterestingIngrid. Because SimilarSam and MirrorMax are following some of the same people as MyUser, it's more likely that MyUser will be interested in following InterestingIngrid as well.

    In the scripts/ folder, there is a file called findFriendsOfFriends.js. The contents of that file are as follows:

    const gremlin = require('gremlin');
    const DriverRemoteConnection = gremlin.driver.DriverRemoteConnection;
    const Graph = gremlin.structure.Graph;
    const neq = gremlin.process.P.neq
    const without = gremlin.process.P.without
    const order = gremlin.process.order
    const local = gremlin.process.scope.local
    const values = gremlin.process.column.values
    const desc = gremlin.process.order.desc
    
    const connection = new DriverRemoteConnection(`wss://${process.env.NEPTUNE_ENDPOINT}:8182/gremlin`,{});
    
    const graph = new Graph();
    const g = graph.traversal().withRemote(connection);
    
    const findFriendsOfFriends = async (username) => {
      return g.V()
        .has('User', 'username', username).as('user')
        .out('Follows').aggregate('friends')
        .in_('Follows')
        .out('Follows').where(without('friends'))
        .where(neq('user'))
        .values('username')
        .groupCount()
        .order(local)
        .by(values, desc)
        .limit(local, 10)
        .next()
    }
    
    findFriendsOfFriends('davidmiller').then((resp) => {
      console.log(resp.value)
      connection.close()
    })

    Like the others, there are a fair bit of imports and initialization to get started. The interesting part is the findFriendsOfFriends function defined in the file. This is similar to a function in your application that looks for ‘friends of friends’ to generate useful recommendations.

    Let’s walk through this complex graph query step-by-step again.

    First, you find the relevant User in the graph with the g.V() .has('User', 'username', username).as('user') portion. This part uses the given username to find the proper User vertex. You use the as function to save that vertex as ‘user’ so that you can refer to that vertex later in the query.

    Next, you want to find everyone that this user currently follows. You do that with .out('Follows').aggregate('friends'), which traverses the Follow edges going out from your User vertex to find the followed Users. You then aggregate these into a variable called friends that you can refer to later in the query.

    Next, you want to find the other users that are following these same users. You can do that with .in_('Follows'), which looks to all vertices that have a Follows edge point to these users.

    You now have users that are similar to the requested user. The next step is to find the other users that these users are following, as they are likely to be interesting to your original user. You can find these with .out('Follows').where(without('friends')). This follows an outward edge with the Follows label from your users. Notice the where clause -- it uses the saved friends variable from your original user to filter out people that your original user already follows. You don’t want to include existing friends in their recommendations!

    Next, you use the .where(neq('user')) clause to filter out your original user, so you don't recommend the user follow themselves. Then, you fetch the username for each discovered node by using the .values('username') clause.

    At this point, your graph includes one entry for each 'Follows’ edge from a similar user to another user. This means there would be duplicates in your results -- if two similar users both followed the same user, the followed user would show up twice. This is really useful, as you can group the followed users by the number of times they have been followed by similar users. The users that have been followed by more similar users are more likely to be relevant to the original user.

    You can execute the script by running the following command in your terminal:

    node scripts/findFriendsOfFriends.js

    You should see the following output in your terminal:

    Map {
      'paullaurie' => 23,
      'thardy' => 20,
      'ocarrillo' => 18,
      'toddjones' => 18,
      'michaelunderwood' => 17,
      'ihensley' => 17,
      'paulacruz' => 17,
      'annette32' => 17,
      'morenojason' => 16,
      'bergjames' => 16 }

    Great! These are the top ten recommended users for the given user, with a count of the number of similar users that followed them.

  • Step 4. Generate recommendations for new users

    In the previous step, you learned how to generate recommendations for users that already have some people they’re following. However, one of the difficulties with recommendation engines is bootstrapping new users -- how do you generate recommendations for users that aren’t following any users yet?

    The query in the previous step would return no recommendations for users that aren’t following anyone. To help generate some recommendations for these users, you can fallback to looking at the interests that users have specified.

    In the scripts/ directory, there is a findFriendsWithInterests.js script. The contents of that script are as follows:

    const gremlin = require('gremlin');
    const DriverRemoteConnection = gremlin.driver.DriverRemoteConnection;
    const Graph = gremlin.structure.Graph;
    const neq = gremlin.process.P.neq
    const order = gremlin.process.order
    const local = gremlin.process.scope.local
    const values = gremlin.process.column.values
    const desc = gremlin.process.order.desc
    
    const connection = new DriverRemoteConnection(`wss://${process.env.NEPTUNE_ENDPOINT}:8182/gremlin`,{});
    
    const graph = new Graph();
    const g = graph.traversal().withRemote(connection);
    
    const findFriendsWithInterests = async (username) => {
      return g.V()
        .has('User', 'username', username).as('user')
        .out('InterestedIn')
        .in_('InterestedIn')
        .out('Follows')
        .where(neq('user'))
        .values('username')
        .groupCount()
        .order(local)
        .by(values, desc)
        .limit(local, 10)
        .next()
    }
    
    findFriendsWithInterests('alistephanie').then((resp) => {
      console.log(resp.value)
      connection.close()
    })

    This is similar to the previous step. There is a findFriendsWithInterests function that is similar to an internal function in your application. For a given user, it searches for other users that have the same interests. Then, it returns the user that are most followed by these similar users.

    At the bottom of the file, the function is called with a user that is not currently following any users.

    You can execute the script using the following command:

    node scripts/findFriendsWithInterests.js

    You should see the following output:

    Map {
      'thardy' => 27,
      'paullaurie' => 26,
      'michaelunderwood' => 26,
      'paulacruz' => 20,
      'petersonchristina' => 19,
      'annette32' => 19,
      'ocarrillo' => 18,
      'evanewing' => 18,
      'hortonamy' => 18,
      'rodriguezjoseph' => 18 }
    

    Success! Even though the user isn’t following any users, you are able to generate some relevant recommendations for the user to follow.


    In this module, you learned graph database terminology and query mechanics. You implemented these learnings by loading data into your database and making a basic query. Then you saw how to generate recommendations for users by traversing the graph and looking at similar users. 

    In the next module, you configure authentication for your application using Amazon Cognito.