AWS Big Data Blog

10 Best Practices for Amazon Redshift Spectrum

by Po Hong and Peter Dalton | on | Permalink | Comments |  Share

Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries against data that is stored in Amazon S3. With Amazon Redshift Spectrum, you can extend the analytic power of Amazon Redshift beyond the data that is stored on local disks in your data warehouse. You can query vast amounts of data in your Amazon S3 “data lake” without having to go through a tedious and time-consuming extract, transfer, and load (ETL) process. Amazon Redshift Spectrum applies sophisticated query optimization and scales processing across thousands of nodes to deliver fast performance.

In this blog post, we have collected 10 important best practices for Amazon Redshift Spectrum by grouping them into several different functional groups.

These guidelines are the product of many interactions and direct project work with Amazon Redshift customers.

Amazon Redshift vs. Amazon Athena

AWS customers often ask us: Amazon Athena or Amazon Redshift Spectrum? When should I use one over the other?

When to use Amazon Athena

Amazon Athena supports a use case in which you want interactive ad-hoc queries to run against data that is stored in Amazon S3 using SQL. The serverless architecture in Amazon Athena frees you from having to provision a cluster to perform queries. You are charged based on the amount of S3 data scanned by each query. You can get significant cost savings and better performance by compressing, partitioning, or converting your data into a columnar format, which reduces the amount of data that Amazon Athena needs to scan to execute a query. All the major BI tools and SQL clients that use JDBC can be used with Amazon Athena. You can also use Amazon QuickSight for easy visualization.

When to use Amazon Redshift

We recommend using Amazon Redshift on large sets of structured data. Amazon Redshift Spectrum gives you the freedom to store your data where you want, in the format you want, and have it available for processing when you need it. With Amazon Redshift Spectrum, you don’t have to worry about scaling your cluster. It lets you separate storage and compute, allowing you to scale each independently. You can even run multiple Amazon Redshift clusters against the same Amazon S3 data lake, enabling limitless concurrency. Amazon Redshift Spectrum automatically scales out to thousands of instances. So queries run quickly, whether they are processing a terabyte, a petabyte, or even an exabyte.

Set up the test environment

For information about prerequisites and steps to get started in Amazon Redshift Spectrum, see Getting Started with Amazon Redshift Spectrum.

You can use any data set to perform the tests to validate the best practices we have outlined in this blog post.  One important requirement is that the S3 files for the largest table need to be in three separate data formats:  CSV, non-partitioned Parquet as well as partitioned Parquet.  How to convert from one file format to another is beyond the scope of this blog post.  For more information on how this can be done, check out the following resources:

Creating the external schema

Use the Amazon Athena data catalog as the metadata store, and create an external schema named “spectrum” as follows:

create external schema spectrum 
from data catalog 
database 'spectrumdb' 
iam_role 'arn:aws:iam::<AWS_ACCOUNT_ID>:role/aod-redshift-role'
create external database if not exists;

The Redshift cluster and the data files in Amazon S3 must be in the same AWS region.  Your Redshift cluster needs authorization to access your external data catalog in Amazon Athena and your data files in Amazon S3. You provide that authorization by referencing an AWS Identity and Access Management (IAM) role (e.g. aod-redshift-role) that is attached to your cluster. For more information, see Create an IAM Role for Amazon Redshift.


Visualize and Monitor Amazon EC2 Events with Amazon CloudWatch Events and Amazon Kinesis Firehose

by Pubali Sen | on | Permalink | Comments |  Share

Monitoring your AWS environment is important for security, performance, and cost control purposes. For example, by monitoring and analyzing API calls made to your Amazon EC2 instances, you can trace security incidents and gain insights into administrative behaviors and access patterns. The kinds of events you might monitor include console logins, Amazon EBS snapshot creation/deletion/modification, VPC creation/deletion/modification, and instance reboots, etc.

In this post, I show you how to build a near real-time API monitoring solution for EC2 events using Amazon CloudWatch Events and Amazon Kinesis Firehose. Please be sure to have Amazon CloudTrail enabled in your account.

  • CloudWatch Events offers a near real-time stream of system events that describe changes in AWS resources. CloudWatch Events now supports Kinesis Firehose as a target.
  • Kinesis Firehose is a fully managed service for continuously capturing, transforming, and delivering data in minutes to storage and analytics destinations such as Amazon S3, Amazon Kinesis Analytics, Amazon Redshift, and Amazon Elasticsearch Service.


For this walkthrough, you create a CloudWatch event rule that matches specific EC2 events such as:

  • Starting, stopping, and terminating an instance
  • Creating and deleting VPC route tables
  • Creating and deleting a security group
  • Creating, deleting, and modifying instance volumes and snapshots

Your CloudWatch event target is a Kinesis Firehose delivery stream that delivers this data to an Elasticsearch cluster, where you set up Kibana for visualization. Using this solution, you can easily load and visualize EC2 events in minutes without setting up complicated data pipelines.

Set up the Elasticsearch cluster

Create the Amazon ES domain in the Amazon ES console, or by using the create-elasticsearch-domain command in the AWS CLI.

This example uses the following configuration:

  • Domain Name: esLogSearch
  • Elasticsearch Version: 1
  • Instance Count: 2
  • Instance type:elasticsearch
  • Enable dedicated master: true
  • Enable zone awareness: true
  • Restrict Amazon ES to an IP-based access policy


Analyze Database Audit Logs for Security and Compliance Using Amazon Redshift Spectrum

by Sandeep Kariro | on | Permalink | Comments |  Share

With the increased adoption of cloud services, organizations are moving their critical workloads to AWS. Some of these workloads store, process, and analyze sensitive data that must be audited to satisfy security and compliance requirements. The most common questions from the auditors are around who logs in to the system when, who queried which sensitive data when, when did the user last modify/update his/her credentials?

By default, Amazon Redshift logs all information related to user connections, user modifications, and user activity on the database. However, to efficiently manage disk space, log tables are only retained for 2–5 days, depending on log usage and available disk space. To retain the log data for longer period of time, enable database audit logging. After it’s enabled, Amazon Redshift automatically pushes the data to a configured S3 bucket periodically.

Amazon Redshift Spectrum is a recently released feature that enables querying and joining data stored in Amazon S3 with Amazon Redshift tables. With Redshift Spectrum, you can retrieve the audit data stored in S3 to answer all security and compliance–related questions. Redshift Spectrum can also combine the datasets from the tables in the database with the datasets stored in S3. It supports files in Parquet, textfile (csv, pipe delimited, tsv), sequence file, and RC file format. It also supports different compression types like gzip, snappy, and bz2.

In this post, I demonstrate querying the Amazon Redshift audit data logged in S3 to provide answers to common use cases described above.


You set up the following resources:

  • Amazon Redshift cluster and parameter group
  • IAM role and policies to give Redshift Spectrum access to Amazon Redshift
  • Redshift Spectrum external tables


  • Create an AWS account
  • Configure the AWS CLI to access your AWS account
  • Get access to a query tool compatible with Amazon Redshift
  • Create an S3 bucket

Cluster requirements

The Amazon Redshift cluster must:

  • Be in the same region as the S3 bucket storing the audit log files.
  • Be version 1.0.1294 or later.
  • Have read bucket and put object permissions on the S3 bucket to be configured for logging.
  • Have an IAM role attached that has at least the following two built-in policies attached, policies AmazonS3ReadOnlyAccess and AmazonAthenaFullAccess.

Set up Amazon Redshift

Create a new parameter group to enable user activity logging:

aws redshift create-cluster-parameter-group --parameter-group-name rs10-enable-log --parameter-group-family Redshift-1.0 --description "Enable Audit Logging"
aws redshift modify-cluster-parameter-group --parameter-group-name rs10-enable-log --parameters '{"ParameterName":"enable_user_activity_logging","ParameterValue":"true"}'

Create the Amazon Redshift cluster using the new parameter group created:

aws redshift create-cluster --node-type dc1.large --cluster-type single-node --cluster-parameter-group-name rs10-enable-log --master-username <Username> --master-user-password <Password> --cluster-identifier <ClusterName>

Wait for the cluster to build. When it is complete, enable audit logging:

aws redshift enable-logging --cluster-identifier rscluster --bucket-name <bucketname>

Set up Redshift Spectrum

To set up Redshift Spectrum, create an IAM role and policies, an external database, and external tables. (more…)

Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3

by Illya Yalovyy | on | Permalink | Comments |  Share

Although it’s common for Amazon EMR customers to process data directly in Amazon S3, there are occasions where you might want to copy data from S3 to the Hadoop Distributed File System (HDFS) on your Amazon EMR cluster. Additionally, you might have a use case that requires moving large amounts of data between buckets or regions. In these use cases, large datasets are too big for a simple copy operation. Amazon EMR can help with this, and offers a utility – S3distCp – to help with moving data from S3 to other S3 locations or on-cluster HDFS.

In the Hadoop ecosystem, DistCp is often used to move data. DistCp provides a distributed copy capability built on top of a MapReduce framework. S3DistCp is an extension to DistCp that is optimized to work with S3 and that adds several useful features. In addition to moving data between HDFS and S3, S3DistCp is also a Swiss Army knife of file manipulations. In this post we’ll cover the following tips for using S3DistCp, starting with basic use cases and then moving to more advanced scenarios:

1. Copy or move files without transformation
2. Copy and change file compression on the fly
3. Copy files incrementally
4. Copy multiple folders in one job
5. Aggregate files based on a pattern
6. Upload files larger than 1 TB in size
7. Submit a S3DistCp step to an EMR cluster

1. Copy or move files without transformation

We’ve observed that customers often use S3DistCp to copy data from one storage location to another, whether S3 or HDFS. Syntax for this operation is simple and straightforward:

$ s3-dist-cp --src /data/incoming/hourly_table --dest s3://my-tables/incoming/hourly_table

The source location may contain extra files that we don’t necessarily want to copy. Here, we can use filters based on regular expressions to do things such as copying files with the .log extension only.


Amazon QuickSight Adds Support for Amazon Redshift Spectrum

by Luis Wang | on | Permalink | Comments |  Share

In April, we announced Amazon Redshift Spectrum in the AWS Blog. Redshift Spectrum is a new feature of Amazon Redshift that allows you to run complex SQL queries against exabytes of data in Amazon without having to load and transform any data.

We’re happy to announce that Amazon QuickSight now supports Redshift Spectrum. Starting today, QuickSight customers can leverage Redshift Spectrum to visualize and analyze vast amounts of unstructured data in their Amazon S3 data lake. With QuickSight and Redshift Spectrum, customers can now visualize combined data sets that include frequently accessed data stored in Amazon Redshift and bulk data sets stored cost effectively in S3 using the familiar SQL syntax of Amazon Redshift.

With Redshift Spectrum, you can start querying your data in Amazon S3 immediately, with no loading or transformation required. You just need to register your Amazon Athena data catalog or Hive Metastore as an external schema. You can then use QuickSight to select the external schema and the Redshift Spectrum tables—just like any other Amazon Redshift tables in your cluster―and start visualizing your S3 data in seconds. You don’t have to worry about scaling your cluster. Redshift Spectrum lets you separate storage and compute, allowing you to scale each independently. You only pay for the queries that you run.

Redshift Spectrum support is now available in these QuickSight regions – US East (N. Virginia and Ohio), and US West (Oregon).

To learn more about these capabilities and start using them in your dashboards, check out the QuickSight User Guide.

If you have questions and suggestions, post them on the QuickSight Discussion Forum.

Not a QuickSight user? Get started for FREE on the QuickSight page.


New AWS Certification Specialty Exam for Big Data

by Sara Snedeker | on | Permalink | Comments |  Share

AWS Certifications validate technical knowledge with an industry-recognized credential. Today, the AWS Certification team released the AWS Certified Big Data – Specialty exam. This new exam validates technical skills and experience in designing and implementing AWS services to derive value from data. The exam requires a current Associate AWS Certification and is intended for individuals who perform complex big data analyses.

Individuals who are interested in sitting for this exam should know how to do the following:

  • Implement core AWS big data services according to basic architectural best practices
  • Design and maintain big data
  • Leverage tools to automate data analysis

To prepare for the exam, we recommend the Big Data on AWS course, plus AWS whitepapers and documentation that are focused on big data.

This credential can help you stand out from the crowd, get recognized, and provide more evidence of your unique technical skills.

The AWS Certification team also released an AWS Certified Advanced Networking – Specialty exam and new AWS Certification Benefits. You can read more about these new releases on the AWS Blog.

Have more questions about AWS Certification? See our AWS Certification FAQ.

Build a Serverless Architecture to Analyze Amazon CloudFront Access Logs Using AWS Lambda, Amazon Athena, and Amazon Kinesis Analytics

by Rajeev Srinivasan and Sai Sriparasa | on | Permalink | Comments |  Share

Nowadays, it’s common for a web server to be fronted by a global content delivery service, like Amazon CloudFront. This type of front end accelerates delivery of websites, APIs, media content, and other web assets to provide a better experience to users across the globe.

The insights gained by analysis of Amazon CloudFront access logs helps improve website availability through bot detection and mitigation, optimizing web content based on the devices and browser used to view your webpages, reducing perceived latency by caching of popular object closer to its viewer, and so on. This results in a significant improvement in the overall perceived experience for the user.

This blog post provides a way to build a serverless architecture to generate some of these insights. To do so, we analyze Amazon CloudFront access logs both at rest and in transit through the stream. This serverless architecture uses Amazon Athena to analyze large volumes of CloudFront access logs (on the scale of terabytes per day), and Amazon Kinesis Analytics for streaming analysis.

The analytic queries in this blog post focus on three common use cases:

  1. Detection of common bots using the user agent string
  2. Calculation of current bandwidth usage per Amazon CloudFront distribution per edge location
  3. Determination of the current top 50 viewers

However, you can easily extend the architecture described to power dashboards for monitoring, reporting, and trigger alarms based on deeper insights gained by processing and analyzing the logs. Some examples are dashboards for cache performance, usage and viewer patterns, and so on.

Following we show a diagram of this architecture.


Amazon QuickSight Now Supports Federated Single Sign-On Using SAML 2.0

by Jose Kunnackal | on | Permalink | Comments |  Share

Since launch, Amazon QuickSight has enabled business users to quickly and easily analyze data from a wide variety of data sources with superfast visualization capabilities enabled by SPICE (Superfast, Parallel, In-memory Calculation Engine). When setting up Amazon QuickSight access for business users, administrators have a choice of authentication mechanisms. These include Amazon QuickSight–specific credentials, AWS credentials, or in the case of Amazon QuickSight Enterprise Edition, existing Microsoft Active Directory credentials. Although each of these mechanisms provides a reliable, secure authentication process, they all require end users to input their credentials every time users log in to Amazon QuickSight. In addition, the invitation model for user onboarding currently in place today requires administrators to add users to Amazon QuickSight accounts either via email invitations or via AD-group membership, which can contribute to delays in user provisioning.

Today, we are happy to announce two new features that will make user authentication and provisioning simpler – Federated Single-Sign-On (SSO) and just-in-time (JIT) user creation.

Federated Single Sign-On

Federated SSO authentication to web applications (including the AWS Management Console) and Software-as-a-Service products has become increasingly popular, because Federated SSO lets organizations consolidate end-user authentication to external applications.

Traditionally, SSO involves the use of a centralized identity store (such as Active Directory or LDAP) to authenticate the user against applications within a corporate network. The growing popularity of SaaS and web applications created the need to authenticate users outside corporate networks. Federated SSO makes this scenario possible. It provides a mechanism for external applications to direct authentication requests to the centralized identity store and receive an authentication token back with the response and validity. SAML is the most common protocol used as a basis for Federated SSO capabilities today.

With Federated SSO in place, business users sign in to their Identity Provider portals with existing credentials and access QuickSight with a single click, without having to enter any QuickSight-specific passwords or account names. This makes it simple for users to access Amazon QuickSight for data analysis needs.


Build a Visualization and Monitoring Dashboard for IoT Data with Amazon Kinesis Analytics and Amazon QuickSight

by Karan Desai | on | Permalink | Comments |  Share

Customers across the world are increasingly building innovative Internet of Things (IoT) workloads on AWS. With AWS, they can handle the constant stream of data coming from millions of new, internet-connected devices. This data can be a valuable source of information if it can be processed, analyzed, and visualized quickly in a scalable, cost-efficient manner. Engineers and developers can monitor performance and troubleshoot issues while sales and marketing can track usage patterns and statistics to base business decisions.

In this post, I demonstrate a sample solution to build a quick and easy monitoring and visualization dashboard for your IoT data using AWS serverless and managed services. There’s no need for purchasing any additional software or hardware. If you are already using AWS IoT, you can build this dashboard to tap into your existing device data. If you are new to AWS IoT, you can be up and running in minutes using sample data. Later, you can customize it to your needs, as your business grows to millions of devices and messages.


The following is a high-level architecture diagram showing the serverless setup to configure.


AWS service overview

AWS IoT is a managed cloud platform that lets connected devices interact easily and securely with cloud applications and other devices. AWS IoT can process and route billions of messages to AWS endpoints and to other devices reliably and securely.

Amazon Kinesis Firehose is the easiest way to capture, transform, and load streaming data continuously into AWS from thousands of data sources, such as IoT devices. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration.

Amazon Kinesis Analytics allows you to process streaming data coming from IoT devices in real time with standard SQL, without having to learn new programming languages or processing frameworks, providing actionable insights promptly.


Build a Healthcare Data Warehouse Using Amazon EMR, Amazon Redshift, AWS Lambda, and OMOP

by Ryan Hood | on | Permalink | Comments |  Share

In the healthcare field, data comes in all shapes and sizes. Despite efforts to standardize terminology, some concepts (e.g., blood glucose) are still often depicted in different ways. This post demonstrates how to convert an openly available dataset called MIMIC-III, which consists of de-identified medical data for about 40,000 patients, into an open source data model known as the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). It describes the architecture and steps for analyzing data across various disconnected sources of health datasets so you can start applying Big Data methods to health research.

Before designing and deploying a healthcare application on AWS, make sure that you read through the AWS HIPAA Compliance whitepaper. This covers the information necessary for processing and storing patient health information (PHI).

Note: If you arrived at this page looking for more info on the movie Mimic 3: Sentinel, you might not enjoy this post.

OMOP overview

The OMOP CDM helps standardize healthcare data and makes it easier to analyze outcomes at a large scale. The CDM is gaining a lot of traction in the health research community, which is deeply involved in developing and adopting a common data model. Community resources are available for converting datasets, and there are software tools to help unlock your data after it’s in the OMOP format. The great advantage of converting data sources into a standard data model like OMOP is that it allows for streamlined, comprehensive analytics and helps remove the variability associated with analyzing health records from different sources.

OMOP ETL with Apache Spark

Observational Health Data Sciences and Informatics (OHDSI) provides the OMOP CDM in a variety of formats, including Apache Impala, Oracle, PostgreSQL, and SQL Server. (See the OHDSI Common Data Model repo in GitHub.) In this scenario, the data is moved to AWS to take advantage of the unbounded scale of Amazon EMR and serverless technologies, and the variety of AWS services that can help make sense of the data in a cost-effective way—including Amazon Machine Learning, Amazon QuickSight, and Amazon Redshift.

This example demonstrates an architecture that can be used to run SQL-based extract, transform, load (ETL) jobs to map any data source to the OMOP CDM. It uses MIMIC ETL code provided by Md. Shamsuzzoha Bayzid. The code was modified to run in Amazon Redshift.

Getting access to the MIMIC-III data

Before you can retrieve the MIMIC-III data, you must request access on the PhysioNet website, which is hosted on Amazon S3 as part of the Amazon Web Services (AWS) Public Dataset Program. However, you don’t need access to the MIMIC-III data to follow along with this post.

Solution architecture and loading process

The following diagram shows the architecture that is used to convert the MIMIC-III dataset to the OMOP CDM.