How do I create and query an external table in Amazon Redshift Spectrum?

3 minute read

I'm trying to create and query an external table in Amazon Redshift Spectrum. How can I do this?

Resolution

With Amazon Redshift Spectrum, you can query data from Amazon Simple Storage Service (Amazon S3) without having to load data into Amazon Redshift tables. Amazon Redshift Spectrum processes any queries while the data remains in your Amazon S3 bucket.

Important: Before you begin, check whether Amazon Redshift is authorized to access your S3 bucket and any external data catalogs. Additionally, your Amazon Redshift cluster and S3 bucket must be in the same AWS Region.

To create an external table in Amazon Redshift Spectrum, perform the following steps:

1. Create an AWS Identity and Access Management (IAM) role for Amazon Redshift.

2. Attach your IAM policy:
If you're using AWS Glue Data Catalog, attach the AmazonS3ReadOnlyAccess and AWSGlueConsoleFullAccess IAM policies to your role.
If you're using Amazon Athena Data Catalog, attach the AmazonAthenaFullAccess IAM policy to your role.

3. Associate the IAM role to the Amazon Redshift cluster.

4. Create the external schema. The external schema references a database in the external data catalog. The external schema also provides the IAM role with an Amazon Resource Name (ARN) that authorizes Amazon Redshift access to S3.

In the following example, we use sample data files from S3 (tickitdb.zip). Unzip and load the individual files to an S3 bucket in your AWS Region like this:

s3://<bucket_name>/tickit/spectrum/event/' and 's3://<bucket_name>/tickit/spectrum/sales/

You can create an external schema using a command similar to the one below:

create external schema spectrum
from data catalog   
database 'spectrumdb'
iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole'
create external database if not exists;

Note: Replace the ARN of the IAM role with the ARN that you created. Be sure to specify the name of the external database (such as "spectrumdb") for the database parameter.

5. Create an external table. For example, you can create an external table for your EVENT data like this:

create external table spectrum.event(     eventid integer,
         venueid smallint,
         catid smallint,
         dateid smallint,
         eventname varchar(200),
         starttime timestamp)row format delimitedfields terminated by '|'stored as textfile location 's3://<bucket_name>/tickit/spectrum/event/';

For more information about external tables, see Creating external tables for Redshift Spectrum.

To create an external table using AWS Glue, be sure to add table definitions to your AWS Glue Data Catalog. You can add table definitions in your AWS Glue Data Catalog in several ways. For more information about adding table definitions, see Working with tables on the AWS Glueconsole.

To create an external table using Amazon Athena, add table definitions like this:

CREATE EXTERNAL TABLE `spectrumdb.event`(  `eventid` int, 
      `venueid` smallint, 
      `catid` smallint, 
      `dateid` smallint, 
      `eventname` string, 
      `starttime` timestamp)
    ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY '|' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'LOCATION  's3://<bucket_name>/tickit/spectrum/event';

6. Run the following query for SVV_EXTERNAL_TABLES to view all external tables referenced by your external schema:

select schemaname , tablename , location from svv_external_tables where schemaname = 'spectrum';

schemaname | tablename | location
----------------+---------------------------------+-----------------------------------------------------------------------
spectrum | event | s3://<bucket-name>/<file-location>

7. Query the external tables (as external Amazon Redshift Spectrum tables) using a SELECT statement:

select top 3 spectrum.sales.eventid, sum(spectrum.sales.pricepaid) from spectrum.sales, spectrum.event
where spectrum.sales.eventid = spectrum.event.eventid
and spectrum.sales.pricepaid > 30
group by spectrum.sales.eventid
order by 2 desc;

 eventid |   sum
---------+----------
     289 | 51846.00
    7895 | 51049.00
    1602 | 50301.00

This example query joins the external SALES table with an external EVENT table.

Related information

Querying external data using Amazon Redshift Spectrum

Troubleshooting queries in Amazon Redshift Spectrum

Topics

Analytics

Relevant content

Distinguish Redshift Spectrum from Redshift Costs
Accepted Answer
AWS-User-5967322
asked 6 years ago
Can you use a Redshift Spectrum Table in a CTE?
mmasseth
asked 2 years ago
use Redshift Spectrum to query both Redshift table and s3 file
Accepted Answer
AWS-User-5745147
asked 4 years ago
Redshift Spectrum with hanging select queries about 25 minutes
frecong
asked 3 years ago
Supported S3 Storage Classes for Redshift Spectrum
timosky
asked 9 months ago
How do I resolve access denied errors using Redshift Spectrum with Amazon S3 buckets in the same account as Amazon Redshift?
AWS OFFICIALUpdated a year ago
How can I create Amazon Redshift Spectrum cross-account access to AWS Glue and Amazon S3?
AWS OFFICIALUpdated 2 years ago
How do I calculate the query charges in Amazon Redshift Spectrum?
AWS OFFICIALUpdated 2 years ago
How do I resolve data incompatibility errors in Redshift Spectrum?
AWS OFFICIALUpdated a year ago
Create custom alerts in Amazon Redshift
EXPERT
Rick Fraser
published 6 months ago