AWS Big Data Blog
Use the Amazon Redshift SQLAlchemy dialect to interact with Amazon Redshift
Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that enables you to analyze your data at scale. You can interact with an Amazon Redshift database in several different ways. One method is using an object-relational mapping (ORM) framework. ORM is widely used by developers as an abstraction layer upon the database, which allows you to write code in your preferred programming language instead of writing SQL. SQLAlchemy is a popular Python ORM framework that enables the interaction between Python code and databases.
A SQLAlchemy dialect is the system used to communicate with various types of DBAPI implementations and databases. Previously, the SQLAlchemy dialect for Amazon Redshift used psycopg2 for communication with the database. Because psycopg2 is a Postgres connector, it doesn’t support Amazon Redshift specific functionality such as AWS Identity and Access Management (IAM) authentication for secure connections and Amazon Redshift specific data types such as SUPER and GEOMETRY. The new Amazon Redshift SQLAlchemy dialect uses the Amazon Redshift Python driver (redshift_connector
) and lets you securely connect to your Amazon Redshift database. It natively supports IAM authentication and single sign-on (SSO). It also supports Amazon Redshift specific data types such as SUPER, GEOMETRY, TIMESTAMPTZ, and TIMETZ.
In this post, we discuss how you can interact with your Amazon Redshift database using the new Amazon Redshift SQLAlchemy dialect. We demonstrate how you can securely connect using Okta and perform various DDL and DML operations. Because the new Amazon Redshift SQLAlchemy dialect uses redshift_connector
, users of this package can take full advantage of the connection options provided by redshift_connector
, such as authenticating via IAM and identity provider (IdP) plugins. Additionally, we also demonstrate the support for IPython SqlMagic
, which simplifies running interactive SQL queries directly from a Jupyter notebook.
Prerequisites
The following are the prerequisites for this post:
- An Amazon Redshift cluster. For more information, see Getting started with Amazon Redshift.
- Okta set up for SSO. For instructions on setting up Okta as an IdP for Amazon Redshift, see Federate Amazon Redshift access with Okta as an identity provider.
Get started with the Amazon Redshift SQLAlchemy dialect
It’s easy to get started with the Amazon Redshift SQLAlchemy dialect for Python. You can install the sqlalchemy-redshift
library using pip. To demonstrate this, we start with a Jupyter notebook. Complete the following steps:
- Create a notebook instance (for this post, we call it
redshift-sqlalchemy
). - On the Amazon SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
- Find the instance you created and choose Open Jupyter.
- Open your notebook instance and create a new
conda_python3
Jupyter notebook. - Run the following commands to install
sqlalchemy-redshift
andredshift_connector
:
redshift_connector
provides many different connection options that help customize how you access your Amazon Redshift cluster. For more information, see Connection Parameters.
Connect to your Amazon Redshift cluster
In this step, we show you how to connect to your Amazon Redshift cluster using two different methods: Okta SSO federation, and direct connection using your database user and password.
Connect with Okta SSO federation
As a prerequisite, set up your Amazon Redshift application in your Okta configuration. For more information, see Federate Amazon Redshift access with Okta as an identity provider.
To establish a connection to the Amazon Redshift cluster, we utilize the create_engine
function. The SQLAlchemy create_engine()
function produces an engine object based on a URL. The sqlalchemy-redshift
package provides a custom interface for creating an RFC-1738 compliant URL that you can use to establish a connection to an Amazon Redshift cluster.
We build the SQLAlchemy URL as shown in the following code. URL.create()
is available for SQLAlchemy version 1.4 and above. When authenticating using IAM, the host and port don’t need to be specified by the user. To connect with Amazon Redshift securely using SSO federation, we use the Okta user name and password in the URL.
Connect with an Amazon Redshift database user and password
You can connect to your Amazon Redshift cluster using your database user and password. We construct a URL and use the URL.create()
constructor, as shown in the following code:
Create a database table using Amazon Redshift data types and insert data
With new Amazon Redshift SQLAlchemy dialect, you can create tables with Amazon Redshift specific data types such as SUPER, GEOMETRY, TIMESTAMPTZ, and TIMETZ.
In this step, you create a table with TIMESTAMPTZ, TIMETZ, and SUPER data types.
Optionally, you can define your table’s distribution style, sort key, and compression encoding. See the following code:
Query and fetch results from the table
The SELECT statements generated by SQLAlchemy ORM are constructed by a query object. You can use several different methods, such as all()
, first()
, count()
, order_by()
, and join()
. The following screenshot shows how you can retrieve all rows from the queried table.
Use IPython SqlMagic with the Amazon Redshift SQLAlchemy dialect
The Amazon Redshift SQLAlchemy dialect now supports SqlMagic
. To establish a connection, you can build the SQLAlchemy URL with the redshift_connector
driver. More information about SqlMagic
is available on GitHub.
In the next section, we demonstrate how you can use SqlMagic
. Make sure that you have the ipython-sql
package installed; if not, install it by running the following command:
Connect to Amazon Redshift and query the data
In this step, you build the SQLAlchemy URL to connect to Amazon Redshift and run a sample SQL query. For this demo, we have prepopulated TPCH data in the cluster from GitHub. See the following code:
You can view the data in tabular format by using the pandas.DataFrame()
method.
If you installed matplotlib, you can use the result set’s .plot()
, .pie()
, and .bar()
methods for quick plotting.
Clean up
Make sure that SQLAlchemy resources are closed and cleaned up when you’re done with them. SQLAlchemy uses a connection pool to provide access to an Amazon Redshift cluster. Once opened, the default behavior leaves these connections open. If not properly cleaned up, this can lead to connectivity issues with your cluster. Use the following code to clean up your resources:
Summary
In this post, we discussed the new Amazon Redshift SQLAlchemy dialect. We demonstrated how it lets you securely connect to your Amazon Redshift database using SSO as well as direct connection using the SQLAlchemy URL. We also demonstrated how SQLAlchemy supports TIMESTAMPTZ, TIMETZ, and SUPER data types without explicitly casting it. We also showcased how redshift_connector
and the dialect support SqlMagic
with Jupyter notebooks, which enables you to run interactive queries against Amazon Redshift.
About the Authors
Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 16 years of experience in data warehousing and analytical space.
Brooke White is a Software Development Engineer at AWS. She enables customers to get the most out of their data through her work on Amazon Redshift drivers. Prior to AWS, she built ETL pipelines and analytics APIs at a San Francisco Bay Area startup.