AWS Big Data Blog

Federated API access to Amazon Redshift using an Amazon Redshift connector for Python

July 2023: This post was reviewed for accuracy.

Amazon Redshift is the leading cloud data warehouse that delivers performance 10 times faster at one-tenth of the cost of traditional data warehouses by using massively parallel query execution, columnar storage on high-performance disks, and results caching. You can confidently run mission-critical workloads, even in highly regulated industries, because Amazon Redshift comes with out-of-the-box security and compliance.

You can use your corporate identity providers (IdPs), for example Azure AD, Active Directory Federation Services, Okta, or Ping Federate, with Amazon Redshift to provide single sign-on (SSO) to your users so they can use their IdP accounts to log in and access Amazon Redshift. With federation, you can centralize management and governance of authentication and permissions. For more information about the federation workflow using AWS Identity and Access Management (IAM) and an identity provider, see Federate Database User Authentication Easily with IAM and Amazon Redshift. For this post, we use Okta as our IdP.

If you have personas that use Python on a computer, such as an Amazon Elastic Compute Cloud (Amazon EC2) instance, you generally use an IAM role attached to the EC2 instance for federated access. However, all users who log in to the EC2 instance assume the same IAM role to federate. The Amazon Redshift connector for Python enables IdP authentication for each user who logs in to the computer. This post shows you how to use the Amazon Redshift connector for Python and Okta to enable federated SSO into Amazon Redshift and query your data warehouse using a Python script.

Solution overview

To implement this solution, you complete the following high-level steps:

  1. Configure your Okta IdP, including users, groups, and SAML application.
  2. Configure your AWS service provider, including IdP, IAM policy, and IAM role.
  3. Set up your Amazon Redshift connector.
  4. Check your Amazon Redshift users.
  5. Configure the Python script and Amazon Redshift connector to use your enterprise credentials and sign in to Amazon Redshift to run queries.

The process flow for federated authentication includes the following steps:

  1. The user logs in from a Python script using an Amazon Redshift connector.
  2. The IdP authenticates using the corporate user name and password, and returns a SAML assertion.
  3. The client uses AWS SDK to call AWS Security Token Service (AWS STS) to assume a role with SAML.
  4. AWS STS returns temporary AWS credentials.
  5. The client uses the temporary AWS credentials to connect Amazon Redshift.

The following diagram illustrates this process flow.

Prerequisites

Before starting this walkthrough, you must have the following:

  • An Amazon Redshift cluster. The US East (N. Virginia) Region is preferred because you need to load data from Amazon Simple Storage Service (Amazon S3) in us-east-1.
  • A database user with superuser
  • Python 3.5 or above.
  • Pip.

Configuring your IdP (Okta)

For instructions on setting up your IdP, see steps 1–4 in Federate Amazon Redshift access with Okta as an identity provider.

For step 2, create a user called sales@example.com and add it to the sales group. This is the example user that you use in this post. The following screenshot shows your example users.

When the setup is complete, you should see settings similar to the following screenshots. The following screenshot shows your general SAML settings.

The following screenshot shows your attribute statements.

The following screenshot shows the settings for your application embed link, login page, and access error page.

For instructions on setting up advanced IdP configurations, see steps 8–9 in Federate Amazon Redshift access with Okta as an identity provider.

Configuring your service provider (AWS)

To set up the service provider, complete steps 5–7 from Federate Amazon Redshift access with Okta as an identity provider.

For step 7, the Amazon Redshift connector needs the following additional permission:

{
            "Sid": "DescribeClusters",
            "Effect": "Allow",
            "Action": "redshift:DescribeClusters",
            "Resource": "*"
}

Create the following custom policy, replacing the region, account, cluster, and unique-role-identifier parameters. These permissions allow the role to use Amazon Redshift to query data, create users, and allow users to join groups. The Condition block enforces that the AWS user ID should match “unique-role-identifier:${redshift:DbUser}”, so that individual users can authenticate only as themselves.

Note: Leave unique-role-identifier as is; you will need to revise the policy after the role is created.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "GetClusterCredsStatement",
            "Effect": "Allow",
            "Action": "redshift:GetClusterCredentials",
            "Resource": [
                "arn:aws:redshift:<region>:<account>:cluster:<cluster>",
                "arn:aws:redshift:<region>:<account>:dbname:<cluster>/<databasename>",
                "arn:aws:redshift:<region>:<account>:dbuser:<cluster>/${redshift:DbUser}"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:userid": "unique-role-identifier:${redshift:DbUser}"
                }
            }
        },
        {
            "Sid": "DescribeClusters",
            "Effect": "Allow",
            "Action": "redshift:DescribeClusters",
            "Resource": "*"
        },
        {
            "Sid": "CreateClusterUserStatement",
            "Effect": "Allow",
            "Action": "redshift:CreateClusterUser",
            "Resource": [
                "arn:aws:redshift:<region>:<account>:dbname:<cluster>/<databasename>",
                "arn:aws:redshift:<region>:<account>:dbuser:<cluster>/${redshift:DbUser}"
            ]
        },
        {
            "Sid": "RedshiftJoinGroupStatement",
            "Effect": "Allow",
            "Action": "redshift:JoinGroup",
            "Resource": [
                "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/sales"
            ]
        }
    ]
}

When the setup is complete, you should see an IAM role setup similar to the following screenshot.

We use the AWS Command Line Interface (AWS CLI) to fetch the unique role identifier for the role you just created.

  • Configure the AWS CLI with the necessary access token and secret access key prior to running the following command:
aws iam get-role --role-name <value>

Replace <value> with the role-name that you just created. So, in this example, the command is:

aws iam get-role --role-name oktasso
  • From the output JSON, note the value of RoleId.
  • On the IAM console, open the policy you created earlier.
  • Choose Edit policy.

  • Choose the JSON tab.

  • Replace unique-role-identifier with the RoleId fetched earlier.
  • Choose Review policy and Save changes.

Setting up your Amazon Redshift connector

The easiest way to get started with redshift_connector is via pip. See the following code:

pip install redshift_connector

After running the command, you see a message indicating redshift_connector was installed successfully:

Successfully installed redshift_connector

Checking Amazon Redshift users

To check your users, connect to your Amazon Redshift cluster using your preferred SQL client as a superuser and enter the following code:

select * from pg_user where usename = 'sales@example.com';

The query should return zero records, which shows that the corporate user sales@example.com doesn’t exist yet in Amazon Redshift. See the following code:

dev=# select * from pg_user where usename = 'sales@example.com'; 
 usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig 
---------+----------+-------------+----------+-----------+--------+----------+-----------
(0 rows)

Testing the Python script

The next step is to run an example Python script that uses federation to connect to your Amazon Redshift cluster and run some queries. See the following code:

import redshift_connector
import numpy
 
# Connects to Redshift cluster using Okta MFA Browser Plugin
conn = redshift_connector.connect(
    iam=True,
    ssl=True,
    host='<RedshiftClusterEndpoint>',
    port=<RedshiftClusterPortNumber>,
    database='<RedshiftDatabase>',
    db_user='sales@example.com',
    cluster_identifier='<RedshiftClusterIdentifier>',
    region='<Region>',
    login_url='<IdPLoginUrl>',
    credentials_provider='BrowserSamlCredentialsProvider',
    user='',
    password=''
)
 
cursor: redshift_connector.Cursor = conn.cursor()
cursor.execute("select current_user")
result: numpy.ndarray = cursor.fetch_numpy_array()
print(result)
 
cursor.execute("create Temp table book(bookname varchar,author‎ varchar)")
cursor.executemany("insert into book (bookname, author‎) values (%s, %s)", [('One Hundred Years of Solitude', 'Gabriel García Márquez'),('A Brief History of Time', 'Stephen Hawking')])
cursor.execute("select * from book")
 
result: numpy.ndarray = cursor.fetch_numpy_array()
print(result)
cursor.execute("drop table book")
conn.commit()
 
conn.close()

Use the following parameters:

  • RedshiftClusterEndpoint – The Amazon Redshift cluster endpoint without a port and database
  • RedshiftClusterPortNumber – The Amazon Redshift port number
  • RedshiftDatabase – The Amazon Redshift database name
  • RedshiftClusterIdentifier – The Amazon Redshift cluster identifier
  • Region – The Region where the cluster is
  • IdPLoginUrl – The IdP (Okta) sign-in URL, which you can get it from the embed link from the Okta SAML application property

After you run the Python script, the Okta sign-in page opens. Sign in with sales@example.com.

Switch back to your Python environment; it should show three records returned:

$ python3 redshift_connector_okta.py
[['sales@example.com']]
[['One Hundred Years of Solitude' 'Gabriel García Márquez']
 ['A Brief History of Time' 'Stephen Hawking']]

Sales@example.com is returned, showing that the Python script ran under the context of this federated user.

You may now close your browser window.

To check your users, connect to your Amazon Redshift cluster using your preferred SQL client as a superuser and enter the following code:

select * from pg_user where usename = 'sales@example.com';

The query should return one record, which shows that the database user is automatically created based on the IdP corporate user:

dev=# select * from pg_user where usename = 'sales@example.com';
      usename      | usesysid | usecreatedb | usesuper | usecatupd |  passwd  | valuntil | useconfig 
-------------------+----------+-------------+----------+-----------+----------+----------+-----------
 sales@example.com |      116 | f           | f        | f         | ******** |          | 
(1 row)

Summary

Amazon Redshift supports stringent compliance and security requirements with no extra cost, which makes it ideal for highly regulated industries. With federation, you can centralize management and governance of authentication and permissions by managing users and groups within the enterprise IdP and use them to authenticate to Amazon Redshift. SSO enables users to have a seamless user experience while accessing various applications in the organization.

This post showed you how to set up an Amazon Redshift connector and use Okta as your IdP to enable federated SSO to an Amazon Redshift cluster from a Python script. If you have users such as data scientists that heavily rely on Python scripts, you can follow these steps to set up federated SSO for your organization and manage access privileges to your Amazon Redshift cluster.

If you have any questions or suggestions, please leave a comment or contribute to the project.


About the Authors


BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.


Brooke White is a Software Development Engineer at AWS. She enables customers to get the most out of their data through her work on Amazon Redshift drivers. Prior to AWS, she built ETL pipelines and analytics APIs at a San Francisco Bay Area startup.


Ilesh Garish is a Software Development Engineer at AWS. His role is to develop connectors for Amazon Redshift. Prior to AWS, he built database drivers for the Oracle RDBMS, TigerLogic XDMS, and OpenAccess SDK. He worked in the database internal technologies at San Francisco Bay Area startups.