Video Player is loading.
Play
Seek backward
Seek forward
Mute
Volume
Current Time 0:00
Duration 0:00
Simplifying Graph Queries with Amazon Neptune and LangChain: Harnessing AI for Intuitive Data Exploration
Loaded: 0%
NaN
 
Settings
Captions
Transcript
Fullscreen
Seek forward
Seek backward
Play
Transcript
Auto Scroll On
A
A
(icons popping) (techno music) Hello everyone. My name's Kelvin Lawrence. I'm a Senior Principal Graph Architect at Amazon Web Services. I'm gonna be talking today about some work we've done to allow easy integration between Amazon, Neptune and large language models using a framework called LangChain. And in this session, if you are not familiar with those terms we'll introduce them all as we go through. Basically what I'm going to go through today is a quick introduction to Neptune if you've not seen it before, show you how I loaded the data for the demo today. Show you one quick graph query so you get a feel for what working with Neptune is like if you've not seen it before. And then we're gonna do a deep dive into the integration between Neptune and a large language model using that LangChain framework. So imagine we've created our cluster, we've created a notebook, we wanna make sure our cluster is running so we can ask for the status. You can see there it is. If you're not familiar with Jupyter notebooks, things with a percent or a double percent in front of 'em are called Magicks, Jupyter terminology. And basically it says, I'm gonna run a command. And we've added special commands relevant to Neptune to all of these notebooks. So if I wanted to load data using our bulk loader which is loading CSV files from S3 or loading RDF files from S3, I could use this percent load command to do that. But what I did earlier was I actually used a percent seed command. With these notebooks we provide a set of sample data for both property graphs and RDF. Neptune, again, if you're not familiar with it supports two different data models, RDF, the resource description framework and property graphs and actually three query languages SPARQL, Gremlin and OpenCypher. Today we're gonna be working with a property graph, and with OpenCypher or Gremlin you can actually write queries over the same data. So I actually used some data that was in Gremlin format and I loaded that earlier using the C command. If you hit submit here, it would load it. Again, these are free sample data sets that come with the notebook. And then once we've loaded that data I just wanted to show you one query so you can see when we go and look at LangChain the data set we're actually working with. So this is an example of an OpenCypher query. These hints at the top here just control the visualization we're going to see in a moment, giving hints how to label things. But basically what this query says, if you're not familiar is find me an airport that has a route to another airport and also is in a country which is connected to that airport by a containing edge. So these are the nodes, these are the edges, and we want the place we're starting from to be LHR, which is London Heathrow. And we want the airports we're connecting to to be Amsterdam, JFK in New York Tokyo Narita or Perth in Western Australia. And then we're going to return the roots that we found. So if we run that query, we get the raw results back. But we can also look at a visualization and we can see that there's London Heathrow in the middle. You can see the airports that we're connecting to. And here are the countries. So we can see that it's 9,009 miles from London Heathrow to Perth. Perth is in Australia. And if we click on one of these nodes and bring up the details view we can actually see information about the property. So we can see London Heathrow, we can see the elevation the iCal code for the airport, the elevation of the airport, things like that. So that's just a very quick introduction if you haven't seen Neptune before. And an introduction to the data we're going to be using for the first half of our LangChain exploration. So let's flip back over to the other notebook. I do a lot of my work in Jupyter Notebooks. I find it a really convenient way to work. LangChain is a open source framework that helps you build generative AI applications. And this is a phrase I'm sure many of you if not all of you have now heard in the news, read about online, heard colleagues talking about it. And it really means using a large language model, a foundation model, to help us generate content. So you might have played around and say, write me a poem in the style of John Lennon or something like that. And the model will try and do something like that for you. However, the models only know the state of the world at the point they were trained. So imagine I trained my model three years ago. It might even know about all the airline routes that were in existence three years ago. But then COVID happened. Lots of airline routes change. Airline routes are now starting to come back and the world is different from when those models might have been trained. So what we need is a way that we can embellish what the models can do with a source of truth. So another database that knows the absolute current information. And what we're going to look at is how using LangChain you can combine a graph database, in this case Amazon Neptune, with the knowledge a model has to produce a kind of a chat experience but seeded with the latest data we have available, which is gonna be coming from Neptune. So the way the LangChain integration works, very simple picture here, is a user asks a question. So I might say, how far is it from Perth to London Heathrow? Like we saw in our query just now. And what we want the model to do is not to actually answer the question, but to give us back a graph query that we can pass on to Neptune that will answer the question. And why are we interested in this? Well, we want to build an experience where the user perhaps just uses their own natural language. They use, in my case English, to ask the question how far is it from somewhere to somewhere else? Imagine as a user, I might not know how to write that query might not be familiar with the query language. I might not know how the data's organized. So we saw in the earlier example that there were countries and that there were airports and there were edges. Everything has labels and properties but I might not actually know that schema, if you will, of how it's all laid out. So we can hide all that from the user using LangChain as an intermediary between the database and our large language model. So the flow you're going to see demonstrated is I ask a question, LangChain builds up information to pass the model that essentially says I want you to give me a graph query, in this case an OpenCypher that would answer that question. So I go to the model, get my query back we then run that query against Neptune. Neptune hopefully finds the query is a good one, and can run it and returns the result back to LangChain. LangChain then again goes back to the model and says, okay the answer has come back at so many miles but it's come back perhaps as a JSON result set. I want you to turn that into English so I can read it back or present it on the screen back to my human user. And that's fundamentally how the LangChain integration works. Sometimes you'll hear the acronym RAG or rag, which stands for Retrieval Augmented Generation. And that's what we're doing here. So if you hear somebody say RAG application which is a another new buzzword you'll hear going around a lot, that is essentially what we're building here. And that integration with Neptune and LangChain is available at these links. The first link is the LangChain project itself. And then these other two have some introductory documentation and also the specific documentation for the the Neptune integration. So let's see how to actually use LangChain with Neptune, but right now we're gonna be using the Python version. So all you would have to do is install it. So pip install LangChain and that installs everything you need to run the integration. And then all you do to get started is you give the integration the name of a Neptune cluster and the port it's available on. My Neptune cluster name is coming from an environment variable called Neptune. So behind the scenes I've loaded the name of my cluster, and then we create a graph object in LangChain that represents that Neptune graph that we're gonna be talking to. Behind the scenes when that executes the graph class that we are creating there goes up to Neptune and says I need to learn all about the data you've got. And it puts together a schema that represents the data in the graph. So the sort of thing that will get sent to the model as part of the prompting we're going to do, will have things like the node properties or the following. So there's a city, it's a string, there's number of runways each airport has, it's an integer, et cetera. And then same for the edges or the relationships there's any properties that are there. We will pass that information to the model. And lastly, we actually give the model the node with the edge label in the middle schema. So countries contain airports. Airports have roots to other airports, and we passed that in as part of the prompting we're going to do with the model. Because remember the model may have information about airports and air routes, but we don't actually want the model to use its own knowledge. We just want it to write a query given the information that that we are providing. Having created our graph object, we then just create the chain from LangChain. And as part of that, you pass in a large language model. LangChain has support for a large number of different models. For our first example we're just gonna use GPT four and we create a QA chain which the question answer chain passing in the name of the model and then some additional parameters as well as our graph. Basically what I passed in here is the right parameters to give us the most verbose output. So in the examples you are going to see we'll actually see the query and we'll see the work that the query did. You can actually hide that from the end user just by turning off some of these flags. As well as giving the model schema, what we do is we give the model some fairly strict guidance as what it is allowed and what it is not allowed to do. For example, we say generate the query in OpenCypher format and follow these rules. And then we have a number of rules I didn't bother to show here such as you're allowed to use these clauses in the language but you're not allowed to use these, because OpenCypher is a subset of the full Cypher query language that's part of the Neo4j on graph database. And we don't want the model trying to use things that aren't in the OpenCypher specification. So we've got a number of prompts in there that tell it you can do this, but you can't do that. We also have additional prompts that say things like only use what you've been given. So we're giving you the relationship types and properties. We're giving you a schema, you are not allowed to use any knowledge that you yourself might have of what an airport graph or an airline graph might look like. And then the very important other prompt that we include is don't give me any explanations or apologies. The models actually come across, they try to be very polite. They'll sometimes say things like I'm sorry this is all I could find. Or they might say, here's the query you were looking for, which is all very nice and friendly but we don't actually want that, because we're gonna pass whatever comes back from the model straight to Neptune. So just a reminder, what we saw on the other screen we've got airports connected to other airports with route with distances, and we've got countries that contain airports. So now what we can do having fed all that to our model, we've given the model additional information in the form of a prompt is we can actually start to interact now using LangChain. So for example, what is the distance from London Heathrow to Perth nonstop. So what's happened here that as you saw in the diagram question along with the prompt was sent to the model. The model has come back and generated the Cypher. The green is what the model actually has generated. So the things in gray and in black bold type are actually just debugged. So you that that wouldn't actually be passed to Neptune. So what has come back from the model is this match query, which is quite good. It's saying find me an airport where the city is London, find me an airport where the city is Perth, and then we're going to find the path between A and B, which is London and Perth. And the model has figured out from the prompting we gave it the the edges or the relationships which is what this R is container distance property or dist and that's what they need to check to get that distance. So the result comes back, the distance is 9,009 miles. And the model has read that back to us. Note that even with the prompting we did the model still did use a little bit of its own knowledge because we didn't tell it that this was actually miles, we just gave it information that said it's an integer. So even now the model has slightly embellished the answer. And a key point there is I think we're all still learning as we work with these models, which prompts work the best how to prompt the model, things to tell it to do things to sort of reiterate to do. And each model is still somewhat different. So the the same set of prompts don't always work exactly as well with all of the models. So there is still quite a bit of learning going on here in what makes a good prompt. You may hear the term prompt engineering being used to describe the generation of those prompts. And also it's important to note that the models don't always get the query right. I've seen for example, they'll put the arrow the wrong way, they might miss a bit of the query or they might still use a bit of knowledge they have and inject that into the query when they shouldn't. You'll notice this here actually that we didn't prompt on this, but the model added this extra bit of information to the query. But overall pretty good. If you were a user who didn't know how to write a Cypher query, an OpenCypher query, or you didn't know what the schema of the database was, everything here is is a nice query that answers the question. We'll just look at a few more examples. So can I fly from Austin where I currently live to Sydney with no stops? There is no direct flight in my graph. And indeed there is no direct flight from Austin to Sydney. And the answer comes back to the user having run the query. No, there are no direct flights available from Austin to Sydney. We could do the same thing with Houston to Auckland and there are indeed routes from Houston to Auckland. So we get the answer back, yes you can. We could say how many routes are there from Austin, and at the time I put my sample data together there's 98 in there, and the model has correctly come back with a query to count the number of routes which is it's 98. And then slightly more complicated question, find the top 10 airports sorted by the number of outgoing routes, and just give me back the airport code and the number of routes, and quite a bit more complicated work it had to do there. It had to figure out the path and then had to sort and then had to limit to the top 10. And as you can see, it came back with the top 10 which is then converted back into natural language. So thank you very much for spending time with us today and I hope this was interesting and I wish you luck as you explore the world of LangChain large language models and hopefully Amazon Neptune. (tranquil music) (tranquil music continues) (tranquil music continues)

Simplifying Graph Queries with Amazon Neptune and LangChain: Harnessing AI for Intuitive Data Exploration

Nov 22, 2024

Amazon Web Services

In this informative video, Kelvin Lawrence, Senior Principal Graph Architect at AWS, demonstrates how to integrate Amazon Neptune with large language models using LangChain. He explains the concept of Retrieval Augmented Generation (RAG) and shows how to use natural language queries to interact with graph data. Lawrence walks through creating a Neptune cluster, loading sample data, and using LangChain to generate OpenCypher queries from English questions. He showcases examples of querying airport and route data, highlighting how LangChain can bridge the gap between users unfamiliar with graph query languages and complex graph databases. This integration allows for more intuitive data exploration and showcases the power of combining graph databases with AI language models.

product-information
skills-and-how-to
generative-ai
ai-ml
databases
Show 1 more

Up Next

Autoplay Off
VideoThumbnail
15:58

Revolutionizing Business Intelligence: Generative AI Features in Amazon QuickSight

Nov 22, 2024
VideoThumbnail
1:01:07

Accelerate ML Model Delivery: Implementing End-to-End MLOps Solutions with Amazon SageMaker

Nov 22, 2024
VideoThumbnail
40:23

Set Up and Use Apache Iceberg Tables on Your Data Lake - AWS Virtual Workshop

Nov 22, 2024
VideoThumbnail
2:53:33

Streamlining Patch Management: AWS Systems Manager's Comprehensive Solution for Multi-Account and Multi-Region Patching Operations

Nov 22, 2024
VideoThumbnail
6:45

Grindr's Next-Gen Chat System: Leveraging AWS for Massive Scale and Security

Nov 22, 2024