Using the Amazon Redshift Data API to interact from an Amazon SageMaker Jupyter notebook
The Amazon Redshift Data API makes it easy for any application written in Python, Go, Java, Node.JS, PHP, Ruby, and C++ to interact with Amazon Redshift. Traditionally, these applications use JDBC connectors to connect, send a query to run, and retrieve results from the Amazon Redshift cluster. This requires extra steps like managing the cluster credentials and configuring the VPC subnet and security group.
In some use cases, you don’t want to manage connections or pass credentials on the wire. The Data API simplifies these steps so you can focus on data consumption instead of managing resources such as the cluster credentials, VPCs, and security groups.
This post demonstrates how you can connect an Amazon SageMaker Jupyter notebook to the Amazon Redshift cluster and run Data API commands in Python. The in-place analysis is an effective way to pull data directly into a Jupyter notebook object. We provide sample code to demonstrate in-place analysis by fetching Data API results into a Pandas DataFrame for quick analysis. For more information about the Data API, see Using the Amazon Redshift Data API to interact with Amazon Redshift clusters.
After exploring the mechanics of the Data API in a Jupyter notebook, we demonstrate how to implement a machine learning (ML) model in Amazon SageMaker, using data stored in the Amazon Redshift cluster. We use sample data to build, train, and test an ML algorithm in Amazon SageMaker. Finally, we deploy the model in an Amazon SageMaker instance and draw inference.
Using the Data API in a Jupyter notebook
Jupyter Notebook is a popular data exploration tool primarily used for ML. To work with ML-based analysis, data scientists pull data from sources like websites, Amazon Simple Storage Service (Amazon S3), and databases using Jupyter notebooks. Many Jupyter Notebook users prefer to use data from Amazon Redshift as their primary source of truth for their organization’s data warehouse and event data stored in Amazon S3 data lake.
When you use Amazon Redshift as a data source in Jupyter Notebook, the aggregated data is visualized first for preliminary analysis, followed by extensive ML model building, training, and deployment. Jupyter Notebook connects and runs SQL queries on Amazon Redshift using a Python-based JDBC driver. Data extraction via JDBC drivers poses the following challenges:
- Dealing with driver installations, credentials and network security management, connection pooling, and caching the result set
- Additional administrative overhead to bundle the drivers into the Jupyter notebook before sharing the notebook with others
The Data API simplifies these management steps. Jupyter Notebook is pre-loaded with libraries needed to access the Data API, which you import when you use data from Amazon Redshift.
To provision the resources for this post, you launch the following AWS CloudFormation stack:
The CloudFormation template is tested in the
us-east-2 Region. It launches a 2-node DC2.large Amazon Redshift cluster to work on for this post. It also launches an AWS Secrets Manager secret and an Amazon SageMaker Jupyter notebook instance.
The following screenshot shows the Outputs tab for the stack on the AWS CloudFormation console.
The Secrets Manager secret is updated with cluster details required to work with the Data API. An AWS Lambda function is spun up and run during the launch of the CloudFormation template to update the secret (it receives input from the launched Amazon Redshift cluster). The following code updates the secret:
Working with the Data API in Jupyter Notebook
In this section, we walk through the details of working with the Data API in a Jupyter notebook.
- On the Amazon SageMaker console, under Notebook, choose Notebook instances.
- Locate the notebook you created with the CloudFormation template.
- Choose Open Jupyter.
This opens up an empty Amazon SageMaker notebook page.
- Download the file RedshiftDeepAR-DataAPI.ipynb to your local storage.
- Choose Upload.
Importing Python packages
We first import the necessary boto3 package. A few other packages are also relevant for the analysis, which we import in the first cell. See the following code:
The Data API calls an HTTPS endpoint. Because ExecuteStatement Data API calls are asynchronous, we need a custom waiter. See the following code:
Retrieving information from Secrets Manager
We need to retrieve the following information from Secrets Manager for the Data API to use:
- Cluster identifier
- Secrets ARN
- Database name
Retrieve the above information using the following code:
We now create the Data API client. For the rest of the notebook, we use the Data API client
client_redshift. See the following code:
Listing the schema and tables
To list the schema, enter the following code:
The following screenshot shows the output.
To list the tables, enter the following code:
The following screenshot shows the output.
Creating the schema and table
Before you issue any SQL statement to the Data API, we instantiate the custom waiter. See the following code:
Loading data into the cluster
After we create the table, we’re ready to load some data into it. The following code loads Green taxi cab data from two different Amazon S3 locations using individual COPY statements that run in parallel:
Performing in-place analysis
We can run the Data API to fetch the query result into a Pandas DataFrame. This simplifies the in-place analysis of the Amazon Redshift cluster data because we bypass unloading the data first into Amazon S3 and then loading it into a Pandas DataFrame.
The following query lists records loaded in the table
nyc_greentaxi by year and month:
The following screenshot shows the output.
Now that you’re familiar with the Data API in Jupyter Notebook, let’s proceed with ML model building, training, and deployment in Amazon SageMaker.
ML models with Amazon Redshift
The following diagram shows the ML model building, training, and deployment process. The source of data for ML training and testing is Amazon Redshift.
The workflow includes the following steps:
- Launch a Jupyter notebook instance in Amazon SageMaker. You make the Data API call from the notebook instance that runs a query in Amazon Redshift.
- The query result is unloaded into an S3 bucket. The output data is formatted as CSV, GZIP, or Parquet.
- Read the query result from Amazon S3 into a Pandas DataFrame within the Jupyter notebook. This DataFrame is split between train and test data accordingly.
- Build the model using the DataFrame, then train and test the model.
- Deploy the model into a dedicated instance in Amazon SageMaker. End-users and other systems can call this instance to directly infer by providing the input data.
Building and training the ML model using data from Amazon Redshift
In this section, we review the steps to build and train an Amazon SageMaker model from data in Amazon Redshift. For this post, we use the Amazon SageMaker built-in forecasting algorithm DeepAR and the DeepAR example code on GitHub.
The source data is in an Amazon Redshift table. We build a forecasting ML model to predict the number of Green taxi rides in New York City.
Before building the model using Amazon SageMaker DeepAR, we need to format the raw table data into a format for the algorithm to use using SQL. The following screenshot shows the original format.
The following screenshot shows the converted format.
We convert the raw table data into the preceding format by running the following SQL query. We run the UNLOAD statement using this SQL to unload the transformed data into Amazon S3.
After we unload the data into Amazon S3, we load the CSV data into a Pandas DataFrame and visualize the dataset. The following plots show the number of rides aggregated per 15 minutes for each of the vendors.
We now train our model using this time series data to forecast the number of rides.
The attached Jupyter notebook contains three steps:
- Split the train and test data. Unlike classification and regression ML tasks where the train and split are done by randomly dividing the entire dataset, in this forecasting algorithm, we split the data based on time:
- Start date of training data – 2019-01-01
- End date of training data – 2020-07-31
- Train the model by setting values to the mandatory hyperparameters.
The training job takes around 15 minutes, and the training progress is displayed on the screen. When the job is complete, you see code like the following:
- Deploy the trained model in an Amazon SageMaker endpoint.
We use the endpoint to make predictions on the fly. In this post, we create an endpoint on an ml.m4.xlarge instance class. For displaying prediction results, we have provide an interactive time series graph. You can adjust four control values:
- vendor_id – The vendor ID.
- forecast_day – The offset from the training end date. This is the first date of the forecast prediction.
- confidence – The confidence interval.
- history_weeks_plot – The number of weeks in the plot prior to the forecast day.
The prediction plot looks like the following screenshot.
In this post, we walked through steps to interact with Amazon Redshift from an Amazon SageMaker Jupyter notebook using the Data API. We provided sample codes for the notebook to wait for the Data API to finish specific steps. The sample code showed how to configure the wait time for different SQL.
The length of wait time depends on the type of query you submit. A COPY command, which loads a large number of Amazon S3 objects, is usually longer than a SELECT query.
You can retrieve query results directly into a Pandas DataFrame by calling the GetStatementResult API. This approach simplifies the in-place analysis by delegating complex SQL queries at Amazon Redshift and visualizing the data by fetching the query result into the Jupyter notebook.
We further explored building and deploying an ML model on Amazon SageMaker using train and test data from Amazon Redshift.
About the Authors
Saunak Chandra is a senior partner solutions architect for Redshift at AWS. Saunak likes to experiment with new products in the technology space, alongside his day to day work. He loves exploring the nature in the Pacific Northwest. A short hiking or biking in the trails is his favorite weekend morning routine. He also likes to do yoga when he gets time from his kid.
Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).
Chao Duan is a software development manager at Amazon Redshift, where he leads the development team focusing on enabling self-maintenance and self-tuning with comprehensive monitoring for Redshift. Chao is passionate about building high-availability, high-performance, and cost-effective database to empower customers with data-driven decision making.