AWS Database Blog

Running AWS Lambda-based applications with Amazon DocumentDB

Microservices-based applications architectures are the norm for building scalable applications. AWS makes creating these types of applications easier with Amazon DocumentDB (with MongoDB compatibility). Just bring your code and deploy an application with this fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads.

You can use the same MongoDB application code, drivers, and tools you do now to run, manage, and scale workloads on Amazon DocumentDB. Enjoy improved performance, scalability, and availability—without having to worry about managing the underlying infrastructure.

This post shows how to build an application to see the top events and emotions present when the Avengers Endgame movie released on April 26, 2019. You learn best practices for configuring and connecting the AWS Lambda application to run queries against Amazon DocumentDB, and also use AWS Secrets Manager and Amazon API Gateway.

Overview

Shopping sites and online publications rely on content and catalog management systems to serve their customers. These systems need fast and reliable access to user reviews, images, ratings, product information, and comments. The flexible document model, data types, indexing, and ability to run powerful and complex queries offered by Amazon DocumentDB help you store and find content quickly and intuitively.

The use case for this post uses a sample dataset from the Global Database of Events, Language and Tone (GDELT) public dataset. The GDELT Project “monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images, and events.”

Use the following AWS services to build your application:

  • Lambda — This service lets you run code without provisioning or managing servers. You pay only for the compute time you consume—there is no charge unless your code is running.
  • API Gateway — This fully managed service makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. With a few clicks in the AWS Management Console, you can create REST and WebSocket APIs. These act as a “front door” for applications to access data, business logic, or functionality from your backend services, including:
  • Workloads running on Amazon EC2
  • Code running on Lambda
  • Web applications
  • Real-time communication applications
  • Secrets Manager — This service helps you protect secrets for accessing your applications, services, and IT resources. You can easily rotate, manage, and retrieve database credentials, API keys, and other secrets throughout their lifecycle. Users and applications retrieve secrets with a call to Secrets Manager APIs, eliminating the need to hardcode sensitive information in plaintext.

Several other services make building and running serverless applications in the cloud easier. For more information, see serverless architecture.

Walkthrough

To get started, use an AWS CloudFormation template that provisions all the required resources and the AWS SAM templates used in this post. You can find the code in the amazon-documentdb-serverless-samples GitHub repo.

The solution described in this post includes the following tasks:

  1. Launch AWS CloudFormation to create resources in a VPC, which include Amazon DocumentDB, Amazon VPC, and AWS Cloud9:
    Amazon VPC lets you launch AWS resources into a virtual network that you define.
    AWS Cloud9 is a cloud-based integrated development environment (IDE) that you use to write, run, and debug your code with just a browser. This service comes with most of the tools and environments necessary to run this exercise.
  2. Set up the Secrets Manager integration with Amazon DocumentDB, as described in How to rotate Amazon DocumentDB and Amazon Redshift credentials in AWS Secrets Manager.
  3. Log in to the AWS Cloud9 environment and download the code packages and libraries from GitHub.
  4. Download the PyMongo library and set up a Lambda layer.
  5. Load the sample GDELT dataset from the AWS public data registry to Amazon DocumentDB.
  6. Deploy API Gateway and the Lambda AWS SAM template to provision AWS resources.
  7. Access API Gateway to run sample queries against Amazon DocumentDB.

AWS CloudFormation creates the basic environment necessary to load data into Amazon DocumentDB, then review and deploy the Lambda function and the API Gateway code. There are two major steps:

Processing and loading the GDELT data into the document database

  1. Creating the Lambda function and API Gateway API to run queries against the document database.

Processing and loading the GDELT data into Amazon DocumentDB

This step includes the actions in the following diagram

  1. The Python program pulls the event data from the GDELT site. The dataset is in the compressed CSV format
    http://data.gdeltproject.org/events/{yyyymmdd}.export.CSV.zip.
  2. The data is uncompressed, parsed line by line, and then each row is converted into a JSON document structure stored in the Amazon DocumentDB table.
  3. The data is submitted in batches to Amazon DocumentDB. After conversion, the sample JSON document looks like the following code:
{  
   'Actor2KnownGroupCode':'',
   'DATEADDED':'20190426',
   'Actor1Geo_FeatureID':'1659564',
   'Actor2Geo_FeatureID':'',
   'GoldsteinScale':'3.4',
   'Actor1Type2Code':'',
   'Actor1CountryCode':'USA',
   'Actor2Geo_Type':'0',
   'NumArticles':10,
   'IsRootEvent':'0',
   'ActionGeo_CountryCode':'US',
   'Actor1KnownGroupCode':'',
   'Actor2Geo_Long':'',
   'Actor1Geo_ADM1Code':'USCA',
   'QuadClass':'1',
   'Actor1Geo_CountryCode':'US',
   'AvgTone':3.90707497360085,
   'Actor1Religion2Code':'',
   'FractionDate':'2009.3233',
   'Actor2Geo_CountryCode':'',
   'Actor1EthnicCode':'',
   'SQLDATE':'20090428',
   'ActionGeo_Long':'-121.494',
   'Actor2Type3Code':'',
   'Actor2Geo_FullName':'',
   'Actor1Type1Code':'',
   'Actor1Code':'USA',
   'SOURCEURL':'https://thenextweb.com/podium/2019/04/25/an-entrepreneurs-guide-to-sacramentos-startup-scene/',
   'MonthYear':'200904',
   'NumSources':1,
   'ActionGeo_Lat':'38.5816',
   'Actor1Type3Code':'',
   'Actor2Name':'',
   'Actor2Type2Code':'',
   'ActionGeo_ADM1Code':'USCA',
   'Actor2Religion1Code':'',
   'Actor1Geo_Lat':'38.5816',
   'Actor2Geo_Lat':'',
   'NumMentions':10,
   'Actor2EthnicCode':'',
   'EventRootCode':'05',
   'Actor1Name':'SACRAMENTO',
   'ActionGeo_FullName':'Sacramento, California, United States',
   'GLOBALEVENTID':'840976753',
   'Actor2CountryCode':'',
   'EventCode':'051',
   'Actor2Code':'',
   'Actor2Type1Code':'',
   'EventBaseCode':'051',
   'Actor1Geo_Type':'3',
   'ActionGeo_Type':'3',
   'Actor1Geo_Long':'-121.494',
   'ActionGeo_FeatureID':'1659564',
   'Actor1Religion1Code':'',
   'Actor2Religion2Code':'',
   'Actor1Geo_FullName':'Sacramento, California, United States',
   'Year':'2009',
   'Actor2Geo_ADM1Code':'',
   '_id':ObjectId('5cd24827ca0e26e6107da9dc')
}  

Creating the Lambda function and API Gateway to run queries against Amazon DocumentDB

In this post, to query the GDELT data stored in Amazon DocumentDB, use API Gateway with Lambda proxy integration to query the Amazon DocumentDB database. Then, pass the query strings as a GET or POST method and process this inside a Lambda function. This way, you can search Amazon DocumentDB to serve a variety of use cases.

The Lambda function deploys in a VPC to access Amazon DocumentDB. Store the credentials in Secrets Manager, as it helps store and retrieve sensitive information like passwords for various services (including Amazon DocumentDB). The Lambda function uses a VPC interface endpoint to access and fetch the Amazon DocumentDB credentials from Secrets Manager.

Prerequisites

To complete this solution, you should have the following:

  • An online publication website that shows the latest trends for events and movies. You can store this information as a JSON document in Amazon DocumentDB and run a rich set of queries from applications hosted in Lambda.
  • An AWS account that provides access to the services needed in this post. The steps are performed in the US West (Oregon) Region. Before you start, make sure that the services used in this post are available in your AWS Region.
  • IAM permissions to provision and manage the resources in the post.
  • A working knowledge of Amazon DocumentDB, Amazon VPC, Lambda, and AWS CloudFormation.

Steps

The following sections walk you through the steps required to create and configure the solution.

Step 1: Launching the AWS CloudFormation template

Provision the VPC with two private and public subnets, the AWS Cloud9 environment, the VPC endpoint for Secrets Manager, and an Amazon DocumentDB cluster with one instance in the VPC:

  1. Copy or download the AWS CloudFormation template from this GitHub
  2. In the AWS CloudFormation console, choose Create Stack.
  3. Launch a template by uploading the JSON file created earlier.
  4. Specify the mandatory parameter values: Type a stack name, for example, DocDBstack, DBClusterName, DBInstanceName, Master User, MasterPassword, DBInstanceClass.
  5. The rest of the parameter values are optional. You can leave them as the default settings.
  6. Choose Next, Create.

Note

The AWS CloudFormation stack creation takes about 10–15 minutes. Note the password for the Amazon DocumentDB master user and the values of the cluster endpoint, Vpcid, SecurityGroupId, and SubnetId values in the AWS CloudFormation output section. You use them in subsequent steps.

Step 2: Setting up Secrets Manager integration with Amazon DocumentDB

Create a new secret for the Amazon DocumentDB master user and password. Follow the steps until “Phase 1” in How to rotate Amazon DocumentDB and Amazon Redshift credentials in AWS Secrets Manager.

The following screenshot shows the screen after creating the secret. Note the Secret name, as you use it later to configure the serverless application.

Amazon DocumentDB requires that the applications run in the same VPC, which means that the Lambda function deploys with VPC settings. As a best practice, use the VPC interface endpoint to connect to the Secrets Manager to retrieve Amazon DocumentDB credentials. For more information about setting up a VPC endpoint for Secrets Manager, see How to connect to AWS Secrets Manager service within a Virtual Private Cloud. In this post, the VPC endpoint is preconfigured as part of the AWS CloudFormation stack.

Step 3: Logging in to the AWS Cloud9 environment and setting up the environment

Use the AWS Cloud9 environment for configuring and deploying the solution.

In the AWS Cloud9 console, under AWS CloudFormation Outputs, get the AWS Cloud9 URL. Then log in.

$ sudo su -  #login to root

Clone the packages from the amazon-documentdb-serverless-samples GitHub repository using the following command:

# git clone https://github.com/aws-samples/amazon-documentdb-serverless-samples.git

# cd amazon-documentdb-serverless-samples

To encrypt data in transit, download the public key for Amazon DocumentDB as follows. Place this in the root folder (/root/amazon-documentdb-serverless-samples).

# wget https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem

The Repo file structure is shown in the following screenshot.

The root folder is your environment root folder. It may be different depending on the location from which you cloned the git repository.

The Gdelt_data folder is created when you run the GDELTDataParser file.

The Sam-app folder holds the application file for the Lambda function that creates the necessary resources to upload the Lambda function.

The Tests folder is created by default when you use the sam init command.

Step 4: Downloading the PyMongo library and setting up a Lambda layer

Use the PyMongo driver for connecting to Amazon DocumentDB from a Lambda function. This is licensed under Apache License 2.0. Use Lambda layers functionality to upload this driver and use it in your Lambda function, as it helps you manage the dependencies more effectively.

Use the following steps in your AWS Cloud9 environment:

  1. Install the PyMongo driver in the AWS Cloud9 environment, and then package it and upload it as a Lambda layer:

$ sudo su -

# pip install pymongo

# cd /tmp/

# mkdir pymongolayer

# cd pymongolayer/

  1. Create a folder structure required for the Python layer as follows (use an export variable to reuse it):

# export LAYER=python/lib/python3.6/site-packages

# pip install -t $LAYER pymongo==3.6

# zip –r pymongolayer.zip  *

  1. Export this as a Lambda layer after copying the pymongolayer.zip file to your Amazon S3 bucket and following the instructions in AWS Lambda Layers. Use the aws lambda publish-layer-version command that follows or use the console.

Before running the command, configure your CLI environment with aws configure:

#aws lambda publish-layer-version --layer-name pymongolayer --description "DocumentDB Python connectivity" --license-info "MIT" \

--content S3Bucket=<bucketname>,S3Key=pymongolayer.zip --compatible-runtimes python2.7 python3.6

Note the ARN of the Lambda layer, either from the console or by using the Lambda list-layers command. You need it in Step 6 when you configure the template.yaml file for creating the serverless application.

Step 5: Loading the sample dataset from GDELT to Amazon DocumentDB

Load the GDELT data as a JSON document in Amazon DocumentDB. Make sure that you enter the configuration details correctly.

From the downloaded source code, edit the following file to update the properties before you run the scripts to load:

/gdelt_parse_config.properties

This file contains the properties required to pull and load the data into the document database. This is a config file and needs to be updated as follows:

#Start with the Default config declaration.
[DEFAULT] 
#Config parameters.
#Enter the date in YYYYMMDD. This is the date file used to pull the file from the GDELT data store. The format has to match yyyymmdd as follows:
gdelt_load_date=20190426
#Enter the Region_name, for your demo as us-west-2.
region_name=us-west-2.
#For the port number, leave it as-is, if you have set default options while creating the document database.
document_db_port=27017
#Copy the Db host information and paste it here. Refer to the CloudFormation Outputs section.
docdb_host=docdb-xxxxxxxxxxxxxxxxxxx.cluster-xxxxxxxxxxxx.us-east-1.docdb.amazonaws.com
#The credentials that you have created for the document database
docdb_username=xxxxxxxx
docdb_password=xxxxxxxx
#The pem file is used for secure communication with DocumentDB. Leave it as is unless you changed the pem key file name after downloading the repo.
pem_locator=rds-combined-ca-bundle.pem
#Enter the fields that need to be parsed / entered as numeric fields. Default to the following ones for this demo. 
gdelt_numeric_fields=["NumSources","NumArticles","AvgTone","NumMentions"]

For illustrative purposes, download the GDELT data for 4/26/2019. You can modify the date to load the data based on your needs.

Load the sample dataset from GDELT to Amazon DocumentDB by executing the following Python file GDELTDataParser.py. This action creates a new database with the name GDELT_DB and a collection GDELT_COLL and loads the data into the collection.

#python GDELTDataParser.py

Verify that the data load completed successfully. You can use the Mongo shell to connect and run queries against the document database. For more information on setting up the Mongo shell, see Step 3: Access and Use Your Amazon DocumentDB Cluster Using the mongo Shell. For a quick reference, use the following sample commands:

rs0:PRIMARY> Show dbs;

rs0:PRIMARY> use GDELT_DB;

rs0:PRIMARY> Show collections;

rs0:PRIMARY> db.GDELT_COLL.count();

Alternatively, use the DocumentDBActions.py file and uncomment the queryTest() call, and see the data as it prints on your console. The data depends on which day you chose to pull the GDELT data.

If you already executed the data load script and want to start from the beginning, first delete the data in the document database. Then, use the Amazon DocumentDB DML command or the supplied DocumentDBActions.py file and do the following:

  1. Go to the end of the file, uncomment the cleanupDb() method, and comment the others.
  2. Save and run this Python file, using the following command:

#python DocumentDBActions.py

Step 6: Deploying the AWS SAM template to provision a serverless application

In this step, you deploy a serverless application using an AWS SAM template. The AWS SAM template file is a YAML or JSON configuration file that adheres to the open-source AWS SAM specification. Use the template to declare all of the AWS resources that comprise your serverless application.

Deploy the AWS SAM template for serverless application:

  1. In your AWS Cloud9 environment, go to the sam-app folder (/root/amazon-documentdb-serverless-samples/sam-app/).
  2. Enter the following command:

#Cd document_db_app

To encrypt data in transit, download the following public key for Amazon DocumentDB. Place this in the root folder (/root/amazon-documentdb-serverless-samples/sam-app/document_db_app). This pem file is packaged and deployed as part of the serverless application.

# wget https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem.

Now go back to the../sam_app folder. Open the template.yaml and make the following edits (secret, VPC, and layer information):

Parameters:

StageName:
    Type: String
    Default: demo # Name this something to represent a stage.
    Description: The Lambda function and API Gateway stage.
  FunctionName:
    Type: String
    Default: DocumentDBLambdaExample    #The name for the function name.
    Description: The Lambda function name
LambdaLayerArn:
    Type: String
    Default: arn:aws:lambda:us-west-2:XXXXXXXXXXXX:layer:LAYER NAME: version
    Description: Copy the ARN of the PyMongodb Lambda layer ARN from Step 4.
SecretManagerName
 	   Type: String
    Default: docdbcreds
    Description: Enter the name you have given for the DocumentDB Secrets Manager.
VpcConfig:
SecurityGroupIds:    # Use the Security Group ID, SubnetId details from the CloudFormation Outputs section in Step 1. 
                        - XXXXXXXXx
                SubnetIds: 
                    - subnet-xxxxxxxxx
                    - subnet-xxxxxxxxx

After you make these changes, invoke the following command in the AWS Cloud9 shell:

#sam Validate

Because the markup is sensitive to indentation, make sure that the template is valid. After the template is valid, execute the following commands to package the serverless app:

# sam package --s3-bucket <s3 bucketname> --output-template-file packaged.yaml

# sam deploy --template-file packaged.yaml --stack-name docdb-serverlessapp-v1 --capabilities CAPABILITY_IAM

 The S3 bucket should be in the same Region as the AWS SAM template deployment Region. If you don’t have an existing S3 bucket, create a new one before executing the previous command.

The AWS SAM deployment provisions an AWS CloudFormation stack with the Lambda function, IAM roles, and API Gateway with Lambda proxy integration. Verify that the stack creates successfully, and then proceed to the next step.

Step 7: Accessing API Gateway to run sample queries against Amazon DocumentDB

For illustrative purposes, your Lambda function is designed to query the document database to provide information on categories such as the greatest number of mentions, total number of events, and the most-talked-about event. Pass the query arguments using the API Gateway endpoint, which invokes the Lambda function to retrieve the information.

To do this, use the simple, built-in test functionality of API Gateway or use the following steps:

  1. Open the API Gateway console and find the DocumentDBQueryExample API created as part of this deployment.
  2. Go to Stages, Demo stage (ignore the default stage).
  3. Choose the Test

This demo supports both the GET and POST methods. With the GET method, send the query parameter. In the POST method, send the same using the body. Make sure to send the same values as those you describe later. Using the POST method may make it easier to implement complex queries to the document database.

 Use case1: Query to find the most-mentioned article on that day (Method: GET)

To test this function for the most-mentioned article on that day, copy the stage URL for the GET method. Open a browser and invoke the URL along with the query parameters similar to the following:

https://<API URL from the stage Demo>/demo?dbquery=most talked event

This should return the following response on the browser:

Most mentioned article (4398 Times) was https://www.washtimesherald.com/news/national_news/quarantines-at-la-universities-amid-us-measles-outbreak/article_86fa321a-b81c-58f1-9d55-91c67f70bde5.html

Use case2: Query to find the total number of events or most talked about event (Method: POST)

To test this function for total number of events, pass the following command into the body of the function:

{

"dbquery" : "total number of events"

}

Response:

Total number of events for the day reported 178000

To test this function for the most talked about event, pass the following command into the body of the function:

{

"dbquery" : "Most Talked Event"

}

Response:

Most mentioned article (4398 times) was https://www.washtimesherald.com/news/national_news/quarantines-at-la-universities-amid-us-measles-outbreak/article_86fa321a-b81c-58f1-9d55-91c67f70bde5.html

Use case3: Query to find the total number of mentions on Avengers Endgame (Method: POST)

To test this function for the number of times an event was mentioned, pass the following command into the body of the function:

{

"dbquery" : "number of mentions {avengers}"

}

Response:

Avengers Endgame was mentioned 300 times

Cleaning up

To avoid incurring future charges, delete the following AWS resources:

  • The AWS CloudFormation stack created in Step 1
  • The secret created in Step 2
  • The AWS CloudFormation stack created by the AWS SAM template in Step 7

Conclusion

In this post, you saw how to build and run a microservices-based application using API Gateway and Lambda to connect to Amazon DocumentDB. You also used the GDELT dataset and the code to download and convert the CSV dataset into a JSON document and store it in Amazon DocumentDB.

You also saw best practices for:

  • Setting up Amazon DocumentDB connectivity
  • Integrating with Secrets Manager for credentials management
  • Deploying a serverless application using an AWS SAM template

 


About the Authors

 

Raj Chilakapati is a Sr. Solutions Architect helping AWS customers build their infrastructure and applications on the cloud. When not at work, music and singing keeps him busy.

 

 

 

 

Gowri Balasubramanian is a Principal Database Solutions Architect at Amazon Web Services.