Diagram-as-code using generative AI to build a data model for Amazon Neptune

To be successful with a graph database—such as Amazon Neptune, a managed graph database service—you need a graph data model that captures the data you need and can answer your questions efficiently. Building that model is an iterative process. The earliest stage of the process, in which you are merely getting initial elements on paper or whiteboard, is often the hardest. Modeling is demanding, but graph modeling in particular is unfamiliar to many data architects. Getting past that initial hurdle is essential to the success of your project.

In this post, we show you how, in stage one of the process, you can use generative AI and diagram-as-code tools to discover that initial model as simply and quickly as possible while meeting specific exit criteria:

The model is researched. You looked at the corpus to learn how others in this problem domain have built similar models. Generative AI helps meet this requirement because it draws on the same corpus you would consider if you were to research on your own.
The model captures the essential objects and relationships you require.
The model answers your questions. You can test-drive it to confirm.
The model avoids implementation lock in. In particular, it can be mapped to either of the two dominant graph representations: labeled property graph (LPG) and Resource Description Framework (RDF).
Delivering the model does not require elaborate tooling or superior graph database skill. Besides generative AI, diagram-as-code tools allow you to show the model as a picture without the need for sophisticated modeling tools.

The outcome is a design the team can rally around.

We also give tips how to grow the initial model through the remainder of the process.

Overview of solution

To demonstrate the early stage, we use the following example: In the role of data architect for a firm that contracts construction companies, you are given the requirement to build a database of construction companies, their relationships, and whether a company appears on a legal exclusion list.

We show not only the data model for this, but also how to arrive at that model. To get started, you need only a browser. As the following diagram shows, you use a generative AI chat playground in Amazon Bedrock plus a diagram-as-code drawing tool called PlantUML using the PlantText editor.

Tools for data model discovery

There are three steps:

Chat with a large language model (LLM) in Bedrock’s playground to discover an initial model. Bedrock is a managed service offering foundation models from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API to build generative AI applications.
Enter that model into PlantText to see it rendered visually. You might decide to ask follow-up questions in the playground. You might also adjust the model manually in PlantText.
When you are happy with the model, ask for test data in the playground.

You can go further by loading the sample data into a Neptune database and running queries to test-drive the model. For this, you need a Neptune cluster and a Neptune Workbench notebook.

The following diagram illustrates the solution workflow.

Discovery with test drive on Neptune

You run through the steps shown in the diagram using a notebook. The first three steps are the same as what we just discussed, except they are run programmatically as cells. We have included them in the notebook in case you would like to run the entire flow solely from the notebook. The steps are as follows:

Chat with an LLM to discover an initial model.
Enter that model into PlantUML to see it rendered visually. You can ask follow-up questions and make any adjustments.
Ask the LLM for test data.
Modify the test data and then load it into the Neptune database.
Query the test data using natural language queries. This is accomplished using the LLM augmented by Neptune using the LangChain library. We discuss how this works in greater depth later in this post.

Refer to the notebook in our GitHub repository for setup instructions.

This solution incurs a cost. In particular, it requires a Neptune cluster, a Neptune notebook instance, and the use of a Bedrock LLM. Refer to the pricing pages for Neptune and Bedrock for more detail.

Discover a model

Start discovering a data model using generative AI. In your browser, navigate to Bedrock and select the Chat playground from the menu. For model category, select Anthropic. For model, select Claude V2.

Playground selections

Enter your initial question under the heading Add instructions. Ask the question:

Describe a data model for a company exclusion list.

Here is an excerpt of the response we received. Yours may differ. We added emphasis to highlight noteworthy terms:

A knowledge graph representing a company exclusion list could potentially model entities like companies, sectors, regions, and reasons for exclusion. It might link together companies with attributes like industry, location, business practices, and controversy level. The graph could illustrate why certain companies are excluded by connecting them to concepts like human rights violations, environmental damage, or corruption. Relationships like “company X is excluded for reason Y” could be modeled.

Visualize the model

The response makes a helpful recommendation, but it’s more helpful to see a picture. Continue in the playground. In the same chat, ask a follow-up question by entering under Add instructions:

Show as a PlantUML class diagram.

The response includes the following code snippet.

@startuml

class Company {
  - name
  - sector
  - region
  - controversy_level 
}

class ExclusionReason {
  - name
  - description  
}

class ExclusionList {
  - name
  - owner 
}

Company "1" -- "0..*" ExclusionReason : is_excluded_for > 
ExclusionList "1" -- "0..*" Company : contains >

class Sector {
  - name  
}

Company "1" -- "1" Sector : belongs_to >

class Region {
  - name
}

Company "1" -- "1" Region : located_in >


class Controversy {
  - category
  - description  
}

Company "1" -- "0..*" Controversy : involved_in >

@enduml

We asked to see a Unified Modeling Language (UML) class diagram because it is an effective visualization of a data model. UML is a complex modeling language, but emerging tools like PlantUML make it straightforward to use. PlantUML is a diagram-as-code tool. You provide a few lines of text as input and PlantUML provides an image as output. The LLM can’t render a UML diagram as an image, but it can render PlantUML-compliant text.

Enter the code block from your response into PlantText or in the notebook run the cell under the heading Show PlantUML diagram. For our example, PlantUML rendered the following diagram.

Initial plant model

The main class is Company. It can belong to an ExclusionList. It is associated with Sector, Region, and Controversy. This is a great start, but we would like to make a few enhancements. For one, we would like the diagram to show how companies are related to each other.

Ask and visualize model enhancements

In the playground, as part of the same chat, enter the following under Add instructions:

Include how companies are related to each other.

The LLM suggests additions to the model. It modifies the PlantUML code block to add a related_to association between companies.

Company "0..*" -- "0..*" Company : related_to >

Paste that into PlantText to see a visualization.

Plant model with changes

The diagram shows the related_to relationship connecting Company to itself.

Adjust the model

You can adjust the model manually either in PlantText or under the notebook heading Let’s tweak a bit. In PlantText, enter the following modified model:

@startuml
class Company {
+Name: String
+Industry: String
+Location: String
}

class ExclusionCriteria {
+Type: String
+Description: String
+SeverityLevel: String
}

class Exclusion {
+ExclusionStartDate: Date
+ExclusionEndDate: Date
+ExclusionType: String
}

Company "0" --> "*" Exclusion : has exclusion
Company "0" --> "0..1" Exclusion : has current exclusion
Exclusion "1" --> "1" ExclusionCriteria : has criteria
Company "1" --> "*" Company: related to

note on link : reltype: String, established date: Date

@enduml

The next figure shows the rendered image.

Model with architect's changes

Here is a summary of the changes:

Company remains, but you removed Sector and Region classes and replaced with Industry and Location attributes of Company.
You enhanced company relationship by adding reltype and established date attributes of the related_to association.
You restructured the representation of exclusions, removing ExclusionList, ExclusionReason, and Controversy. You added an Exclusion class. Over time, a company can have zero or more exclusions, represented by the has exclusion association. The has current exclusion association indicates that the company currently has an exclusion. An exclusion can have a start date and end date. ExclusionCriteria enumerates the types of exclusions and their severity. Every Exclusion is linked to one ExclusionCriteria.

UML is a powerful modeling language, but we uses its features sparingly. All of our relationships are associations. UML also offers dependencies, aggregations, and compositions, but we wanted to keep it simple. The related to association has attributes. We indicate these in a note. A more advanced approach would be to use an association class.

Generate test data

Ask the LLM to create sample data for the updated model. Back in the playground, in the same chat, ask these two questions under Add instructions:

Generate OpenCypher queries to create sample data based on [paste PlantUML code here]
Generate RDF sample based on [paste PlantUML code here]

Significantly, you generate from these prompts both LPG (using CREATE commands in OpenCypher, one of main query languages for LPG) and RDF data. Neptune supports both representations, and your model is flexible enough to permit mock data generation in each. We focus on the OpenCypher/LPG data in this post.

We saved the responses in the notebook. The data is promising, but you may decide to make a few minor adjustments to set up specific test cases.

Load test data into a Neptune database

The remaining steps use the Neptune database. We demonstrate using the OpenCypher test data. In the notebook, run the cell under the heading Load data from OC into Neptune. This runs OpenCypher CREATE statements to load an updated version of the generated test data. The data contains five companies: A, B, C, D, and E. A is related to B and C. C is related to D. A has a current exclusion and a past exclusion. B has a current exclusion. C has a past exclusion.

Here is an excerpt of the OpenCypher code, showing the creation of companies A and B. The full code can be found in the notebook.

// Create company nodes
CREATE (:Company {`~id`: 'compA', Name: 'Company A', Industry: 'Tech', Location: 'California', Tag: 'demo'})
CREATE (:Company {`~id`: 'compB', Name: 'Company B', Industry: 'Finance', Location: 'New York', Tag: 'demo'})

// Create exclusion nodes
CREATE (:Exclusion {`~id`: 'xa2', ExclusionStartDate: datetime('2022-01-01T00:00:01'), ExclusionType: 'Financial', Tag: 'demo'})
CREATE (:Exclusion {`~id`: 'xa1', ExclusionStartDate: datetime('2020-01-01T00:00:01'), ExclusionEndDate: datetime('2020-06-11T00:00:01'), ExclusionType: 'Financial', Tag: 'demo'})
CREATE (:Exclusion {`~id`: 'xb1', ExclusionStartDate: datetime('1985-01-01T00:00:01'), ExclusionType: 'Legal', Tag: 'demo'})

// Create edges

// First, locate the source/target nodes
MATCH (companyA:Company {`~id`: 'compA'})
MATCH (companyB:Company {`~id`: 'compB'})
MATCH (xa2:Exclusion {`~id`: 'xa2'})
MATCH (xa1:Exclusion {`~id`: 'xa1'})
MATCH (xb1:Exclusion {`~id`: 'xb1'})

// create company relationship edge
CREATE (companyA)-[r1:relatedTo {reltype: 'Partner', establishedDate: datetime('2020-01-01T00:00:01')}]->(companyB)

// create exclusion edges
CREATE (companyA)-[x1:hasCurrentExclusion]->(xa2)
CREATE (companyA)-[x2:hasExclusion]->(xa2)
CREATE (companyA)-[x3:hasExclusion]->(xa1)
CREATE (companyB)-[x4:hasCurrentExclusion]->(xb1)
CREATE (companyB)-[x5:hasExclusion]->(xb1)

Ask questions

One of the goals of this stage was to test-drive the model. Now that you have test data loaded into the Neptune database, you can run queries against the model. Generative AI enables you to do this without having to write the OpenCypher queries by hand. You can ask a natural language question and see the OpenCypher query it generates. It even responds back in natural language.

The notebook uses LangChain, a Python library that provides an interface to LLM models. LangChain supports the Retrieval Augmented Generation (RAG) pattern, which allows the LLM to get help from a database to answer a question. The LLM in this case gets help from the Neptune database:

The Neptune database provides the graph schema: node labels, properties, and edge labels. The LLM generates an OpenCypher query, but it uses this schema to know the labels and properties to use.
The query is run in the Neptune database.

Run the cells under the heading Ask questions. There are six questions. Let’s investigate the first under subheading Q2 – Which companies have current exclusions.

The code cell is the following:

chain.run('''Which companies have current exclusions. Hint: no need to check ExclusionEndDate''')

The LLM, having the Neptune schema at hand, generates the following query:

MATCH (c:Company)-[:hasExclusion]->(e:Exclusion) WHERE e.ExclusionType = 'Financial' RETURN c.Name

It looks for node c labeled Company with a hasExclusion edge to node e labeled Exclusion. The ExclusionType property of node e must be Financial.

The query is run against the Neptune database and returns the result as JSON:

[
{'c': {'entityType': 'node', 'properties': {'Tag': 'demo', 'Location': 'Texas', 'Industry': 'Retail', 'Name': 'Company C'}}},
{'c': {'entityType': 'node', 'properties': {'Tag': 'demo', 'Location': 'Texas', 'Industry': 'Retail', 'Name': 'Company D'}}},
{'c': {'entityType': 'node', 'properties': {'Tag': 'demo', 'Location': 'Texas', 'Industry': 'Retail', 'Name': 'Company E'}}}
]

The LLM provides a natural language response.

Based on the provided information, there are 3 companies listed: Company C, Company D, and Company E. None of them have any information indicating they currently have an exclusion. So it seems that none of the companies listed have a current exclusion

Check this result against the test data and verify that it is correct.

The next stage

The initial stage is a success. In the next stage, you evolve the model. Part of this involves adding more classes, attributes, and relationships to address requirements. There are additional considerations that we now discuss.

Mapping initial model to LPG or RDF

The initial model avoids locking in to LPG or RDF. In fact, it is flexible enough to generate mock data in both representations. Neptune supports both, and the two forms can coexist in the same Neptune cluster. Further along in the process you must decide on the physical structure of your data as LPG or RDF.

It helped that generative AI could generate mock data, but you also need a mapping to populate instances into your graph on your own. For LPG, these instances are nodes and edges. In this section, we focus on LPG. Refer to Mapping initial model to RDF for details on mapping to RDF.

Here are rules of thumb to guide the mapping.

A class specifies the structure of a node, its label, and its properties.

Class - company

To represent a company named Octank in the banking industry and located in New York, we create a node with label Company and properties Name (value Octank), Industry (value Banking), Location (value New York).

An association specifies an edge, its label, and the nodes it connects.

Association - has exclusion

Assuming the exclusion applies to Octank, create an edge with label hasExclusion whose source node is the Octank node described above. The target node is another node labeled Exclusion.

Attributes of the association are represented as edge properties. Assuming the following model, how would you represent that Octank has a partnership relationship with the company Havey Enterprises established on Jan 1, 2000?

Attributes of association

If you have nodes labeled Company representing Octank and Havey Enterprises, create an edge connecting the former to the latter. Label the edge relatedTo and assign edge properties reltype (value Partnership) and establishedDate (value 2000-01-01).

Reshape for performance

You might need to reshape the model to improve query performance. Considerations such as edge direction, generic compared to specific edges, node or edge, and representation of edge properties (if RDF) may necessitate model changes. We recommend the Skills Builder course Data Modeling for Amazon Neptune for a deeper understanding of these issues.

Being a good knowledge graph citizen

Your mandate for this use case is not just to build company exclusions, but to build a knowledge graph. The essential nature of a knowledge graph is to combine multiple types of entities from several sources. There will be much more than company exclusions in the graph. With this in mind, consider the following:

Ensure your exclusion model integrates well with other data. It should use common conventions for IDs and naming, and it should use well-known properties and predicates. Can other entities link to the data in your model?
Think big. Treat the exclusion model as a data product. A consumer of a data mesh marketplace should easily understand your product and put it to use. But more importantly, think of your knowledge graph as a central place to bring data products together. Not only are you producing a data product, you’re building a mesh-scale meeting place for numerous products.

These considerations are often overlooked in the initial stage.

Mapping initial model to RDF

We have focused on LPG. For completeness, here are rules of thumb for mapping the initial model to RDF.

A class specifies the structure of a resource, its type, and its datatype properties. Suppose we have the class shown.

Class - company

To represent in RDF a company named Octank in the banking industry and located in New York, build a set of triples describing a resource known as https://myorg.com/company/octank. We can shorten the resource name to :octank by specifying that “:” is a prefix for https://myorg.com/company/. Each line has three terms: subject, predicate, object.

@prefix : <https://myorg.com/company/> .

# each line: Subject Predicate Object

:octank rdf:type :Company . # the resource has the specified type
:octank :Name "Octank" . # the resource has the specified name
:octank :Industry "Banking" . # the resource has the specified industry
:octank :Location "New York" . # the resource has the specified location

An association represents a predicate, or object property, that connects resources. Let’s represent the has exclusion association shown in the following figure.

Association - has exclusion

In RDF, assuming the exclusion is a resource known as :octank_exclusion_1, specify the relationship using a triple:

:octank :hasExclusion octank_exclusion_1 .

Attributes of the association are represented as statements about the relationship. Assuming the following model, how would you represent that the company Octank has a partnership relationship with company Havey Enterprises established on Jan 1, 2000?

Attributes of association

There are several ways to express this in RDF, and one of your tasks is to evaluate them to select the most suitable. One way is to define the relationship as its own resource :octank_havey_rel.

:octank :hasCompanyRel :octank_havey_rel . # octank has this rel

:octank_havey_rel :reltype “Partnership” .
:octank_havey_rel :estDate “2000-01-01” .
:octank_havey_rel :relatedCompany :haveyEnterprises . # this rel relates to havey ent

Clean up

If you setup a Neptune cluster and notebook instance to follow along, delete those resources if you wish to avoid further costs.

Conclusion

In this post, we showed how, in stage one of your graph initiative, to discover a graph data model using generative AI and diagram-as-code tools. This allows you to quickly build a starting point model and test-drive it to ensure it meets your requirements.

Generative AI accelerates discovery. The alternative is to research on your own, but the LLM uses the same corpus and provides compelling answers instantly.

Diagram-as-code tools, notably PlantUML, quickly generate useful data model diagrams that the team can rally around. Even better, in our test the LLM generated valid PlantUML code from its previous text model!

We showed how to test-drive the model. You generated test data using generative AI, loaded that data into a Neptune database, and used the LLM augmented by Neptune to run common queries. The results showed that the model supports these queries. It also demonstrated how to write these queries. Significantly, you didn’t need significant OpenCypher expertise to succeed in this stage. The LLM generated the queries.

Wrapping up, we looked ahead to the next stage. You need to assess whether to represent instances as LPG, RDF or both. Despite its simplicity and informality, the initial model maps to both LPG and RDF representations. For performance reasons, you might need to reshape the initial model. You can rise to the challenge of knowledge graphs by designing exclusions to fit with other data in the graph. Think big! Not only are you offering a data product, your knowledge graph is the meeting place of many data products.

About the Author

Mike Havey is a Senior Solutions Architect for AWS with over 25 years of experience building enterprise applications. Mike is the author of two books and numerous articles. Visit his Amazon author page.