How do I create and query an external table in Amazon Redshift Spectrum?

Last updated: 2020-10-15

I'm trying to create and query an external table in Amazon Redshift Spectrum. How can I do this?

Resolution

With Amazon Redshift Spectrum, you can query data from Amazon Simple Storage Service (Amazon S3) without having to load data into Amazon Redshift tables. Amazon Redshift Spectrum processes any queries while the data remains in your Amazon S3 bucket.

Important: Before you begin, check whether Amazon Redshift is authorized to access your S3 bucket and any external data catalogs. Additionally, your Amazon Redshift cluster and S3 bucket must be in the same AWS Region.

To create an external table in Amazon Redshift Spectrum, perform the following steps:

1.    Create an IAM role for Amazon Redshift.

2.    Attach your AWS Identity and Access Management (IAM) policy:
If you're using AWS Glue Data Catalog, attach the AmazonS3ReadOnlyAccess and AWSGlueConsoleFullAccess IAM policies to your role.
If you're using Amazon Athena Data Catalog, attach the  AmazonAthenaFullAccess IAM policy to your role.

3.    Associate the IAM role to the Amazon Redshift cluster.

4.    Create the external schema. The external schema references a database in the external data catalog. The external schema also provides the IAM role with an Amazon Resource Name (ARN) that authorizes Amazon Redshift access to S3.

In the following example, we use sample data files from S3 (tickitdb.zip). Unzip and load the individual files to an S3 bucket in your AWS Region like this:

s3://<bucket_name>/tickit/spectrum/event/' and 's3://<bucket_name>/tickit/spectrum/sales/

In this example, the external database is created in an AWS Glue Data Catalog:

create external schema spectrum
from data catalog   
database 'spectrumdb'
iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole'
create external database if not exists;

Note: Replace the ARN of the IAM role with the ARN you created. Be sure to specify the name of the external database (such as "spectrumdb") for the database parameter.

5.   Create an external table. For example, you can create an external table for your EVENT data like this:

create external table spectrum.event(
     eventid integer,
    
     venueid smallint,
    
     catid smallint,
    
     dateid smallint,
    
     eventname varchar(200),
    
     starttime timestamp)
row format delimited
fields terminated by '|'
stored as textfile 
location 's3://<bucket_name>/tickit/spectrum/event/';

For more information about external tables, see Creating external tables for Amazon Redshift Spectrum.

To create an external table using AWS Glue, be sure to add table definitions to your AWS Glue Data Catalog. You can add table definitions in your AWS Glue Data Catalog in several ways. For more information about adding table definitions, see Defining tables in the AWS Glue Data Catalog.

To create an external table using Amazon Athena, add table definitions like this:

CREATE EXTERNAL TABLE `spectrumdb.event`(
  `eventid` int, 
    
  `venueid` smallint, 
    
  `catid` smallint, 
    
  `dateid` smallint, 
    
  `eventname` string, 
    
  `starttime` timestamp)
    
ROW FORMAT DELIMITED 
    
  FIELDS TERMINATED BY '|' 
    
STORED AS INPUTFORMAT 
    
  'org.apache.hadoop.mapred.TextInputFormat' 
    
OUTPUTFORMAT 

  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<bucket_name>/tickit/spectrum/event';

6.    Run the following query for SVV_EXTERNAL_TABLES to view all external tables referenced by your external schema:

select schemaname , tablename , location from svv_external_tables where schemaname = 'spectrum';

schemaname | tablename | location
----------------+---------------------------------+-----------------------------------------------------------------------
spectrum | event | s3://<bucket-name>/<file-location>

7.    Query the external tables (as external Amazon Redshift Spectrum tables) using a SELECT statement:

select top 3 spectrum.sales.eventid, sum(spectrum.sales.pricepaid) from spectrum.sales, spectrum.event
where spectrum.sales.eventid = spectrum.event.eventid
and spectrum.sales.pricepaid > 30
group by spectrum.sales.eventid
order by 2 desc;

 eventid |   sum
---------+----------
     289 | 51846.00
    7895 | 51049.00
    1602 | 50301.00

This example query joins the external SALES table with an external EVENT table.