Category: Amazon Redshift


AWS Database Migration Service – 20,000 Migrations and Counting

by Jeff Barr | on | in Amazon RDS, Amazon Redshift, AWS Database Migration Service | | Comments

I first wrote about AWS Database Migration Service just about a year ago in my AWS Database Migration Service post. At that time I noted that over 1,000 AWS customers had already made use of the service as part of their move to AWS.

As a quick recap, AWS Database Migration Service and Schema Conversion Tool (SCT) help our customers migrate their relational data from expensive, proprietary databases and data warehouses (either on premises on in the cloud, but with restrictive licensing terms either way) to more cost-effective cloud-based databases and data warehouses such as Amazon Aurora, Amazon Redshift, MySQL, MariaDB, and PostgreSQL, with minimal downtime along the way. Our customers tell us that they love the flexibility and the cost-effective nature of these moves. For example, moving to Amazon Aurora gives them access to a database that is MySQL and PostgreSQL compatible, at 1/10th the cost of a commercial database. Take a peek at our AWS Database Migration Services Customer Testimonials to see how Expedia, Thomas Publishing, Pega, and Veoci have made use of the service.

20,000 Unique Migrations
I’m pleased to be able to announce that our customers have already used AWS Database Migration Service to migrate 20,000 unique databases to AWS and that the pace continues to accelerate (we reached 10,000 migrations in September of 2016).

We’ve added many new features to DMS and SCT over the past year. Here’s a summary:

Learn More
Here are some resources that will help you to learn more and to get your own migrations underway, starting with some recent webinars:

Migrating From Sharded to Scale-Up – Some of our customers implemented a scale-out strategy in order to deal with their relational workload, sharding their database across multiple independent pieces, each running on a separate host. As part of their migration, these customers often consolidate two or more shards onto a single Aurora instance, reducing complexity, increasing reliability, and saving money along the way. If you’d like to do this, check out the blog post, webinar recording, and presentation.

Migrating From Oracle or SQL Server to Aurora – Other customers migrate from commercial databases such as Oracle or SQL Server to Aurora. If you would like to do this, check out this presentation and the accompanying webinar recording.

We also have plenty of helpful blog posts on the AWS Database Blog:

  • Reduce Resource Consumption by Consolidating Your Sharded System into Aurora – “You might, in fact, save bunches of money by consolidating your sharded system into a single Aurora instance or fewer shards running on Aurora. That is exactly what this blog post is all about.”
  • How to Migrate Your Oracle Database to Amazon Aurora – “This blog post gives you a quick overview of how you can use the AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS) to facilitate and simplify migrating your commercial database to Amazon Aurora. In this case, we focus on migrating from Oracle to the MySQL-compatible Amazon Aurora.”
  • Cross-Engine Database Replication Using AWS Schema Conversion Tool and AWS Database Migration Service – “AWS SCT makes heterogeneous database migrations easier by automatically converting source database schema. AWS SCT also converts the majority of custom code, including views and functions, to a format compatible with the target database.”
  • Database Migration—What Do You Need to Know Before You Start? – “Congratulations! You have convinced your boss or the CIO to move your database to the cloud. Or you are the boss, CIO, or both, and you finally decided to jump on the bandwagon. What you’re trying to do is move your application to the new environment, platform, or technology (aka application modernization), because usually people don’t move databases for fun.”
  • How to Script a Database Migration – “You can use the AWS DMS console or the AWS CLI or the AWS SDK to perform the database migration. In this blog post, I will focus on performing the migration with the AWS CLI.”

The documentation includes five helpful walkthroughs:

There’s also a hands-on lab (you will need to register in order to participate).

See You at a Summit
The DMS team is planning to attend and present at many of our upcoming AWS Summits and would welcome the opportunity to discuss your database migration requirements in person.

Jeff;

 

 

Amazon Kinesis- Setting up a Streaming Data Pipeline

by Ana Visneski | on | in Amazon Kinesis, Amazon QuickSight, Amazon Redshift | | Comments

Ray Zhu from the Amazon Kinesis team wrote this great post about how to set up a streaming data pipeline. He carefully shows you step by step how he set it all up and how you can do it too.

-Ana


 

Consumer demand for better experiences is ever increasing today. Companies across different industry segments are looking for ways to differentiate their products and services. Data is a key ingredient for providing differentiated products and services, and this is no longer a secret but rather a well adopted practice. Almost all companies at meaningful size are using some sort of data technologies, which means being able to collect and use data is no longer enough as a differentiating factor. Then what? How fast you can collect and use your data becomes the key to stay competitive.

Streaming data technologies shorten the time to analyze and use your data from hours and days to minutes and seconds. Let’s walk through an example of using Amazon Kinesis Firehose, Amazon Redshift, and Amazon QuickSight  to set up a streaming data pipeline and visualize Maryland traffic violation data in real time.

Data Flow Overview

data flow

Step 1 Set up Redshift database and table

In this step, I’ll set up a Redshift table for Kinesis Firehose to continuously load streaming data into. I first start a single node Redshift cluster and name it “raystreaming.”

Under “Additional Configuration”, I make the cluster publicly accessible so that Kinesis Firehose and QuickSight can connect to my cluster.

After reviewing all configurations, I click on “Launch Cluster”.

Once the cluster is active, I go to the cluster’s VPC Security Groups to add inbound access for Kinesis Firehose service IPs and outbound access for QuickSight service IPs.

Kinesis Firehose service IPs:

US East (N. Virginia) 52.70.63.192/27
US West (Oregon) 52.89.255.224/27
EU (Ireland) 52.19.239.192/27

QuickSight service IPs:

US East (N. Virginia) 52.23.63.224/27
US West (Oregon) (us-west-2)  54.70.204.128/27
EU (Ireland) (eu-west-1) 52.210.255.224/27

Now the cluster is setup and configured. I’ll use a JDBC tool and the SQL statement below to create a table for storing Maryland traffic violation data.

create table TrafficViolation(
 dateofstop date,
 timeofstop timestamp,
 agency varchar(100),
 subagency varchar(100),
 description varchar(300),
 location varchar(100),
 latitude varchar(100),
 longtitude varchar(100),
 accident varchar(100),
 belts varchar(100),
 personalinjury varchar(100),
 propertydamage varchar(100),
 fatal varchar(100),
 commlicense varchar(100),
 hazmat varchar(100),
 commvehicle varchar(100),
 alcohol varchar(100),
 workzone varchar(100),
 state varchar(100),
 veichletype varchar(100),
 year varchar(100),
 make varchar(100),
 model varchar(100),
 color varchar(100),
 violation varchar(100),
 type varchar(100),
 charge varchar(100),
 article varchar(100),
 contributed varchar(100),
 race varchar(100),
 gender varchar(100),
 drivercity varchar(100),
 driverstate varchar(100),
 dlstate varchar(100),
 arresttype varchar(100),
 geolocation varchar(100));

Step 2 Set up Kinesis Firehose delivery stream

In this step, I’ll set up a Kinesis Firehose delivery stream to continuously deliver data to the “TrafficViolation” table created above.

I name my Firehose delivery stream “rayredshiftstream”. Under destination configurations, I choose “Amazon Redshift” as destination and configure an intermediate S3 bucket. Kinesis Firehose will first load my streaming data to this intermediate buckets and then COPY it to Redshift. Loading data from S3 to Redshift is efficient and preserves resources on Redshift for queries. Also, I always have a backup of my data in S3 for other batch processes or in case my Redshift cluster is not accessible (e.g. under maintenance).

Subsequently, I enter the Redshift cluster, database, and table names along with Redshift user name and password. This user needs to have Redshift INSERT permission. I also specify “json ‘auto’” under COPY options to parse JSON formatted sample data.

I set retry duration to 30 seconds. In cases when data load to my Redshift cluster fails, Kinesis Firehose will retry for 30 seconds. The failed data is always in the intermediate S3 bucket for backfill. At the bottom, the exact COPY command Kinesis Firehose will use is generated for testing purposes.

On the next page, I specify buffer size and buffer interval. Kinesis Firehose buffers streaming data to a certain size or for a certain period of time before loading it to S3. Kinesis Firehose’s buffering feature reduces S3 PUT requests and cost significantly and generates relatively larger S3 object size for efficient data load to Redshift. I’m using the smallest buffer size (1MB) and shortest buffer interval (60 seconds) in this example in order to have data delivered sooner.

You can also optionally configure Kinesis Firehose to compress the data in GZIP format before loading it to S3 and use a KMS key to encrypt the data in S3. In this example, I configure my data to be uncompressed and unencrypted. Please note that if you enable GZIP compression, you’ll also need to add “gzip” under Redshift COPY options.

I also enable error logging for Kinesis Firehose to log any delivery errors to my CloudWatch Log group. The error messages are viewable from Kinesis Firehose console as well and are particularly useful for troubleshooting purpose.

Finally, I configure a default IAM role to allow Kinesis Firehose to access the resources I configured in the delivery stream.

After reviewing all configurations, I click on “Create Delivery Stream”.

Step 3 Send data to Kinesis Firehose delivery stream

Now my Firehose delivery stream is set up and pointing to my Redshift table “TrafficViolation”. In this example, I’m using the Traffic Violations dataset from US Government Open Data. I use the Kinesis Firehose sample from AWS Java SDK to parse records from local csv file and send each record to my delivery stream.

In real streaming use cases, you can imagine that each data record is pushed to the delivery stream from police officer’s cellular devices through Firehose’s PutRecord() or PutRecordBatch() APIs as soon as a violation ticket is recorded.

A sample of the data looks like the following and includes information such as time of stop, vehicle type, driver gender, and so forth.

09/30/2014,23:51:00,MCP,"1st district, Rockville",\
  DRIVER FAILURE TO STOP AT STEADY CIRCULAR RED SIGNAL,\
  PARK RD AT HUNGERFORD DR,,,No,No,No,No,No,No,No,No,No,No,\
  MD,02 - Automobile,2014,FORD,MUSTANG,BLACK,Citation,21-202(h1),\
  Transportation Article,No,BLACK,M,ROCKVILLE,MD,MD,A - Marked Patrol,

Step 4 Visualize the data from QuickSight

As I continuously push data records to my delivery stream “rayredshiftstream”, I can see these data gets populated to my Redshift table “TrafficViolation” continuously.

Now I’m going to use QuickSight to analyze and visualize the data from my Redshift table “TrafficViolation”. I create a new analysis and a new data set pointing to my Redshift table “TrafficViolation”.

I use “Query” mode to directly retrieve data from my Redshift cluster so that new data is retrieved as they are continuously streamed from Kinesis Firehose.

With a few clicks, I create a bar chart graph that displays number of traffic violations by gender and hour of the day. There are a few interesting patterns: 1) Male drivers have significantly more traffic violations than female drivers during morning hours. 2) Noon has the lowest number of violations. 3) From 2pm to 4pm, the number of violations gap between male and female drivers narrows.

With a live dashboard, this graph will keep updating itself throughout the day as new data continuously gets streamed from police officer’s devices to Redshift through Kinesis Firehose. Another interesting live dashboard to build will be a map graph that shows a heat map of traffic violations across different districts of Maryland over time. I’ll leave this exercise to the readers of this blog and you can use your favorite Business Intelligent tools to do so.

That’s it!

Hopefully through reading this blog and trying it out yourself, you’ve got some inspirations about streaming data and a sense of how easy it is to get started with streaming data analytics on AWS. I cannot wait to see what streaming data analytics pipelines and applications you can build for your organizations!

-Ray Zhu

Amazon QuickSight Now Generally Available – Fast & Easy to Use Business Analytics for Big Data

by Jeff Barr | on | in Amazon QuickSight, Amazon RDS, Amazon Redshift, Amazon S3, Launch | | Comments

After a preview period that included participants from over 1,500 AWS customers ranging from startups to global enterprises, I am happy to be able to announce that Amazon QuickSight is now generally available! When I invited you to join the preview last year, I wrote:

In the past, Business Intelligence required an incredible amount of undifferentiated heavy lifting. You had to pay for, set up and run the infrastructure and the software, manage scale (while users fret), and hire consultants at exorbitant rates to model your data. After all that your users were left to struggle with complex user interfaces for data exploration while simultaneously demanding support for their mobile devices. Access to NoSQL and streaming data? Good luck with that!

Amazon QuickSight provides you with very fast, easy to use, cloud-powered business analytics at 1/10th the cost of traditional on-premises solutions. QuickSight lets you get started in minutes. You log in, point to a data source, and begin to visualize your data. Behind the scenes, the SPICE (Super-fast, Parallel, In-Memory Calculation Engine) will run your queries at lightning speed and provide you with highly polished data visualizations.

Deep Dive into Data
Every customer that I speak with wants to get more value from their stored data. They realize that the potential value locked up within the data is growing by the day, but are sometimes disappointed to learn that finding and unlocking that value can be expensive and difficult. On-premises business analytics tools are expensive to license and can place a heavy load on existing infrastructure. Licensing costs and the complexity of the tools can restrict the user base to just a handful of specialists.  Taken together, all of these factors have led many organizations to conclude that they are not ready to make the investment in a true business analytics function.

QuickSight is here to change that! It runs as a service and makes business analytics available to organizations of all shapes and sizes. It is fast and easy to use, does not impose a load on your existing infrastructure, and is available for a monthly fee that starts at just $9 per user.

As you’ll see in a moment, QuickSight allows you to work on data that’s stored in many different services and locations. You can get to your Amazon Redshift data warehouse, your Amazon Relational Database Service (RDS) relational databases, or your flat files in S3. You can also use a set of connectors to access data stored in on-premises MySQL, PostgreSQL, and SQL Server databases, Microsoft Excel spreadsheets, Salesforce and other services.

QuickSight is designed to scale with you. You can add more users, more data sources, and more data without having to purchase more long-term licenses or roll more hardware into your data center.

Take the Tour
Let’s take a tour through QuickSight. The administrator for my organization has already invited me to use QuickSight, so I am ready to log in and get started. Here’s the main screen:

I’d like to start by getting some data from a Redshift cluster. I click on Manage data and review my existing data sets:

I don’t see what I am looking for, so I click on New data set and review my options:

I click on Redshift (manual connect) and enter the credentials so that I can access my data warehouse (if I had a Redshift cluster running within my AWS account it would be available as an auto-discovered source):

QuickSight queries the data warehouse and shows me the schemas (sets of tables) and the tables that are available to me. I’ll select the public schema and the all_flights table to get started:

Now I have two options. I can pull the table in to SPICE for quick analysis or I can query it directly. I’ll pull it in to SPICE:

Again, I have two options! I can click on Edit/Preview data and select the rows and columns to import, or I can click on Visualize to import all of the data and proceed to the fun part! I’ll go for Edit/Preview. I can see the fields (on the left), and I can select only those that are interest using the checkboxes:

I can also click on New Filter, select a field from the popup menu, and then create a filter:

Both options (selecting fields and filtering on rows) allow me to control the data that I pull in to SPICE. This allows me to control the data that I want to visualize and also helps me to make more efficient use of memory. Once I am ready to proceed, I click on Prepare data & visualize. At this point the data is loaded in to SPICE and I’m ready to start visualizing it. I simply select a field to get started.  For example,  I can select the origin_state_abbr field and see how many flights originate in each state:

The miniaturized view on the right gives me some additional context. I can scroll up or down or select the range of values to display.  I can also click on a second field to learn more. I’ll click on flights, set the sort order to descending, and scroll to the top. Now I can see how many of the flights in my data originated in each state:

QuickSight’s AutoGraph feature automatically generates an appropriate visualization based on the data selected. For example, if I add the fl_date field, I get a state-by-state line chart over time:

Based on my query, the data types, and properties of the data, QuickSight also proposes alternate visualizations:

I also have my choice of many other visual types including vertical & horizontal bar charts, line charts, pivot tables, tree maps, pie charts, and heat maps:

Once I have created some effective visualizations, I can capture them and use the resulting storyboard to tell a data-driven story:

I can also share my visualizations with my colleagues:

Finally, my visualizations are accessible from my mobile device:

Pricing & SPICE Capacity
QuickSight comes with one free user and 1 GB of SPICE capacity for free, perpetually. This allows every AWS user to analyze their data and to gain business insights at no cost. The Standard Edition of Amazon QuickSight starts at $9 per month and includes 10 GB of SPICE capacity (see the [QuickSight Pricing] page for more info).

It is easy to manage SPICE capacity. I simply click on Manage QuickSight in the menu (I must have the ADMIN role in order to be able to make changes):

Then I can see where I stand:

I can click on Purchase more capacity to do exactly that:

I can also click on Release unused purchased capacity in order to reduce the amount of SPICE capacity that I own:

Get Started Today
Amazon QuickSight is now available in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) regions and you can start using it today.

Despite the length of this blog post I have barely scratched the surface of QuickSight. Given that you can use it at no charge, I would encourage you to sign up, load some of your data, and take QuickSight for a spin!

We have a webinar coming up on January 16th where you can learn even more! Sign up here.

Jeff;

 

New – Upload AWS Cost & Usage Reports to Redshift and QuickSight

by Jeff Barr | on | in Amazon QuickSight, Amazon Redshift, Big Data, Launch | | Comments

Many AWS customers have been asking us for a way to programmatically analyze their Cost and Usage Reports (read New – AWS Cost and Usage Reports for Comprehensive and Customizable Reporting for more info). These customers are often using AWS to run multiple lines of business, making use of a wide variety of services, often spread out across multiple regions. Because we provide very detailed billing and cost information, this is a Big Data problem and one that can be easily addressed using AWS services!

While I was on vacation earlier this month, we launched a new feature that allows you to upload your Cost and Usage reports to Amazon Redshift and Amazon QuickSight. Now that I am caught up, I’d like to tell you about this feature.

Upload to Redshift
I started by creating a new Redshift cluster (if you already have a running cluster, you need not create another one). Here’s my cluster:

Next, I verified that I had enabled the Billing Reports feature:

Then I hopped over to the Cost and Billing Reports and clicked on Create report:

Next, I named my report (MyReportRedshift), made it Hourly, and enabled support for both Redshift and QuickSight:

I wrapped things up by selecting my delivery options:

I confirmed my desire to create a report on the next page, and then clicked on Review and Complete. The report was created and I was informed that the first report would arrive in the bucket within 24 hours:

While I was waiting I installed PostgreSQL on my EC2 instance (sudo yum install postgresql94) and verified that I was signed up for the Amazon QuickSight preview. Also, following the directions in Create an IAM Role, I made a read-only IAM role and captured its ARN:

Back in the Redshift console, I clicked on Manage IAM Roles and associated the ARN with my Redshift cluster:

The next day, I verified that the files were arriving in my bucket as expected, and then returned to the console in order to retrieve a helper file so that I could access Redshift:

I clicked on Redshift file and then copied the SQL command:

I inserted the ARN and the S3 region name into the SQL (I had to add quotes around the region name in order to make the query work as expected):

And then I connected to Redshift using psql (I can use any visual or CLI-based SQL client):

$ psql -h jbcluster.XYZ.us-east-1.redshift.amazonaws.com \
  -U root -p 5439 -d dev

Then I ran the SQL command. It created a pair of tables and imported the billing data from S3.

Querying Data in Redshift
Using some queries supplied by my colleagues as a starting point, I summed up my S3 usage for the month:

And then I looked at my costs on a per-AZ basis:

And on a per-AZ, per-service basis:

Just for fun, I spent some time examining the Redshift Console. I was able to see all of my queries:

Analyzing Data with QuickSight
I also spent some time analyzing the cost and billing data using Amazon QuickSight. I signed in and clicked on Connect to another data source or upload a file:

Then I dug in to my S3 bucket (jbarr-bcm) and captured the URL of the manifest file (MyReportRedshift-RedshiftManifest.json):

I selected S3 as my data source and entered the URL:

QuickSight imported the data in a few seconds and the new data source was available. I loaded it into SPICE (QuickSight’s in-memory calculation engine). With three or four  more clicks I focused on the per-AZ data, and excluded the data that was not specific to an AZ:

Another click and I switched to a pie chart view:

I also examined the costs on a per-service basis:

As you can see, the new data and the analytical capabilities of QuickSight allow me (and you) to dive deep into your AWS costs in minutes.

Available Now
This new feature is available now and you can start using it today!

Jeff;

 

Learn about Amazon Redshift in our new Data Warehousing on AWS Class

by Jeff Barr | on | in Amazon Redshift, Big Data, Training and Certification | | Comments

As our customers continue to look to use their data to help drive their missions forward, finding a way to simply and cost-effectively make use of analytics is becoming increasingly important. That is why I am happy to announce the upcoming availability of Data Warehousing on AWS, a new course that helps customers leverage the AWS Cloud as a platform for data warehousing solutions.

New Course
Data Warehousing on AWS is a new three-day course that is designed for database architects, database administrators, database developers, and data analysts/scientists.  It introduces you to concepts, strategies, and best practices for designing a cloud-based data warehousing solution using Amazon Redshift. This course demonstrates how to collect, store, and prepare data for the data warehouse by using other AWS services such as Amazon DynamoDB, Amazon EMR, Amazon Kinesis, and Amazon S3. Additionally, this course demonstrates how you can use business intelligence tools to perform analysis on your data. Organizations who are looking to get more out of their data by implementing a Data Warehousing solution or expanding their current Data Warehousing practice are encouraged to sign up.

These classes (and many more) are available through AWS and our Trainintg Partners. Find upcoming classes in our global training schedule or learn more at AWS Training.

Jeff;

 

Amazon Redshift – Up to 2X Throughput and 10X Vacuuming Performance Improvements

by Jeff Barr | on | in Amazon Redshift | | Comments

My colleague Maor Kleider wrote today’s guest post!

Jeff;


Amazon Redshift, AWS’s fully managed data warehouse service, makes petabyte-scale data analysis fast, cheap, and simple. Since launch, it has been one of AWS’s fastest growing services, with many thousands of customers across many industries. Enterprises such as NTT DOCOMO, NASDAQ, FINRA, Johnson & Johnson, Hearst, Amgen, and web-scale companies such as Yelp, Foursquare and Yahoo! have made Amazon Redshift a key component of their analytics infrastructure.

In this blog post, we look at performance improvements we’ve made over the last several months to Amazon Redshift, improving throughput by more than 2X and vacuuming performance by 10X.

Column Store
Large scale data warehousing is largely an I/O problem, and Amazon Redshift uses a distributed columnar architecture to minimize and parallelize I/O. In a column-store, each column of a table is stored in its own data block. This reduces data size, since we can choose compression algorithms optimized for each type of column. It also reduces I/O time during queries, because only the columns in the table that are being selected need to be retrieved.

However, while a column-store is very efficient at reading data, it is less efficient than a row-store at loading and committing data, particularly for small data sets. In patch 1.0.1012 (December 17, 2015), we released a significant improvement to our I/O and commit logic. This helped with small data loads and queries using temporary tables. While the improvements are workload-dependent, we estimate the typical customer saw a 35% improvement in overall throughput.

Regarding this feature, Naeem Ali, Director of Software Development, Data Science at Cablevision, told us:

Following the release of the I/O and commit logic enhancement, we saw a 2X performance improvement on a wide variety of workloads. The more complex the queries, the higher the performance improvement.

Improved Query Processing
In addition to enhancing the I/O and commit logic for Amazon Redshift, we released an improvement to the memory allocation for query processing in patch 1.0.1056 (May 17, 2016), increasing overall throughput by up to 60% (as measured on standard benchmarks TPC-DS, 3TB), depending on the workload and the number of queries that spill from memory to disk. The query throughput improvement increases with the number of concurrent queries, as less data is spilled from memory to disk, reducing required I/O.

Taken together, these two improvements, should double performance for customer workloads where a portion of the workload contains complex queries that spill to disk or cause temporary tables to be created.

Better Vacuuming
Amazon Redshift uses multi-version concurrency control to reduce contention between readers and writers to a table. Like PostgreSQL, it does this by marking old versions of data as deleted and new versions as inserted, using the transaction ID as a marker. This allows readers to build a snapshot of the data they are allowed to see and traverse the table without locking. One issue with this approach is the system becomes slower over time, requiring a vacuum command to reclaim the space. This command reclaims the space from deleted rows and ensures new data that has been added to the table is placed in the right sorted order.

We are releasing a significant performance improvement to vacuum in patch 1.0.1056, available starting May 17, 2016. Customers previewing the feature have seen dramatic improvements both in vacuum performance and overall system throughput as vacuum requires less resources.

Ari Miller, a Principal Software Engineer at TripAdvisor, told me:

We estimate that the vacuum operation on a 15TB table went about 10X faster with the recent patch, ultimately improving overall query performance.

 You can query the VERSION function to verify that you are running at the desired patch level.

Available Now
Unlike on-premise data warehousing solutions, there are no license or maintenance fees for these improvements or work required on your part to obtain them. They simply show up as part of the automated patching process during your maintenance window.

Maor Kleider, Senior Product Manager, Amazon Redshift

 

User Defined Functions for Amazon Redshift

by Jeff Barr | on | in Amazon Redshift | | Comments

The Amazon Redshift team is on a tear. They are listening to customer feedback and rolling out new features all the time! Below you will find an announcement of another powerful and highly anticipated new feature.

— Jeff;


Amazon Redshift makes it easy to launch a petabyte-scale data warehouse. For less than $1,000/Terabyte/year, you can focus on your analytics, while Amazon Redshift manages the infrastructure for you. Amazon Redshift’s price and performance has allowed customers to unlock diverse analytical use cases to help them understand their business. As you can see from blog posts by Yelp, Amplitude and Cake, our customers are constantly pushing the boundaries of what’s possible with data warehousing at scale.

To extend Amazon Redshift’s capabilities even further and make it easier for our customers to drive new insights, I am happy to announce that Amazon Redshift has added scalar user-defined functions (UDFs). Using PostgreSQL syntax, you can now create scalar functions in Python 2.7 custom-built for your use case, and execute them in parallel across your cluster.

Here’s a template that you can use to create your own functions:

CREATE [ OR REPLACE ] FUNCTION f_function_name 
( [ argument_name arg_type, ... ] )
RETURNS data_type
{ VOLATILE | STABLE | IMMUTABLE }
AS $$
  python_program
$$ LANGUAGE plpythonu;

Scalar UDFs return a single result value for each input value, similar to built-in scalar functions such as ROUND and SUBSTRING. Once defined, you can use UDFs in any SQL statement, just as you would use our built-in functions.

In addition to creating your own functions, you can take advantage of thousands of functions available through Python libraries to perform operations not easily expressed in SQL. You can even add custom libraries directly from S3 and the web. Out of the box, Amazon Redshift UDFs come integrated with the Python Standard Library and a number of other libraries, including:

  • NumPy and SciPy, which provide mathematical tools you can use to create multi-dimensional objects, do matrix operations, build optimization algorithms, and run statistical analyses.
  • Pandas, which offers high level data manipulation tools built on top of NumPy and SciPy, and that enables you to perform data analysis or an end-to-end modeling workflow.
  • Dateutil and Pytz, which make it easy to manipulate dates and time zones (such as figuring out how many months are left before the next Easter that occurs in a leap year).

UDFs can be used to simplify complex operations. For example, if you wanted to extract the hostname out of a URL, you could use a regular expression such as:

SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘\3') FROM table;

Or, you could import a Python URL parsing library, URLParse, and create a function that extracts hostnames:

CREATE FUNCTION f_hostname(url VARCHAR)
RETURNS varchar
IMMUTABLE AS $$
import urlparse
return urlparse.urlparse(url).hostname
$$ LANGUAGE plpythonu;

Now, in SQL all you have to do is:

SELECT f_hostname(url) 
FROM table;

As our customers know, Amazon Redshift obsesses about security. We run UDFs inside a restricted container that is fully isolated. This means UDFs cannot corrupt your cluster or negatively impact its performance. Also, functions that write files or access the network are not supported. Despite being tightly managed, UDFs leverage Amazon Redshift’s MPP capabilities, including being executed in parallel on each node of your cluster for optimal performance.

To learn more about creating and using UDFs, please see our documentation and a detailed post on the AWS Big Data blog. Also, check out this how-to guide from APN Partner Looker. If you’d like to share the UDFs you’ve created with other Amazon Redshift customers, please reach out to us at redshift-feedback@amazon.com. APN Partner Periscope has already created a number of useful scalar UDFs and published them here.

We will be patching your cluster with UDFs over the next two weeks, depending on your region and maintenance window setting. The new cluster version will be 1.0.991. We know you’ve been asking for UDFs for some time and would like to thank you for your patience. We look forward to hearing from you about your experience at redshift-feedback@amazon.com.

Tina Adams, Senior Product Manager

Amazon Redshift – Now Faster and More Cost-Effective than Ever

by Jeff Barr | on | in Amazon Redshift |

My colleague Tina Adams sent me a guest post to share news of a new instance type and new Reserved Instance offerings for Amazon Redshift.

Jeff;


Amazon Redshift makes analyzing petabyte-scale data fast, cheap, and simple. It delivers advanced technology capabilities, including parallel execution, compressed columnar storage, and end-to-end encryption, as a fully managed service, letting you focus on your data not your database. All for less than $1,000/TB/YR.  When launching a cluster, you can choose between our Dense Compute (SSD) and Dense Storage (HDD) instance families.

Today, we are making our Dense Storage family even faster and more cost effective with a second-generation instance type, DS2. Moreover, you can now reserve all dense storage and dense compute instances types for one year with No Upfront payment, and receive a 20% discount over On-Demand rates. For steeper discounts, you can pay for your entire reserved instance term with one All Upfront payment.

New DS2 Instances
DS2 instances have twice the memory and compute power of their Dense Storage predecessor, DS1 (previously called DW1), but the same storage. DS2 also supports Enhanced Networking and provides 50% more disk throughput than DS1. On average, DS2 provides 50% better performance than DS1, but is priced exactly the same.

Instance vCPU Memory (GiB)
Network Storage I/O Price/TB/Year (On Demand)
Price/TB/Year (3 Year RI)
Dense Storage – Current Generation
ds2.xlarge 4 31 Enhanced 2 TB HDD 0.50 GBps $3,330 $999
ds2.8xlarge 36 244 Enhanced – 10 Gbps 16 TB HDD 4.00 GBps $3,330 $999
Dense Storage – Previous Generation (formerly DW1)
ds1.xlarge 2 15 Moderate 2 TB HDD 0.30 GBps $3,330 $999
ds1.8xlarge 16 120 10 Gbps 16 TB HDD 2.40 GBps $3,330 $999
Dense Compute – Current Generation (formerly DW2)
dc1.large 2 15 Enhanced 0.16 TB SSD 0.20 GBps $18,327 $5,498
dc1.8xlarge 32 244 Enhanced 10 Gbps 2.56 TB SSD 3.70 GBps $18,327 $5,498

We expect existing DS1 customers to quickly adopt DS2. There’s really no reason not to. To move from DS1 to DS2, simply restore a DS2 cluster from a snapshot of a DS1 cluster of the same size. Restoring from snapshot is a push-button operation in our Console. Our streaming restore feature allows you to resume querying as soon as your new cluster is created, while data is streamed from S3 in the background. After the restore completes, if you’d like, you can resize your DS2 cluster with a few clicks:

No Upfront and All Upfront Reserved Instances
You now have three Reserved Instance payment options! This gives you the flexibility to determine how much you wish to pay upfront. To purchase these offerings simply visit the Reserved Nodes tab in our Console. Here are your options:

No Upfront – You pay nothing upfront, and commit to pay hourly over the course of one year at a 20% discount over On-Demand. This option is only offered for a one year term.

Partial Upfront – The same as our previous heavy utilization RI offering. You pay a portion of the Reserved Instance upfront, and the remainder over a one or three year term. The discount over On-Demand remains at up to 41% for a one year term and up to 73% for a three year term.

All Upfront – You pay for the entire Reserved Instance term (one or three years) with one upfront payment. This is your cheapest option, with a discount of up to 42% for a one year term and up to 76% for a three year term compared to On-Demand.

Stay Tuned
There’s more to come from Amazon Redshift, so stay tuned. To get the latest feature announcements, log in to the Amazon Redshift Forum and subscribe to the Amazon Redshift Announcements thread. Or use the Developer Guide History and the Management Guide History to track product updates. We’d also appreciate your feedback at redshift-feedback@amazon.com.

Tina Adams, Senior Product Manager

Look Before You Leap – The Coming Leap Second and AWS (Updated)

by Jeff Barr | on | in Amazon CloudFront, Amazon CloudSearch, Amazon CloudTrail, Amazon EC2, Amazon RDS, Amazon Redshift |

My colleague Mingxue Zhao sent me a guest post designed to make sure that you are aware of an important time / clock issue.

Note: This post was first published on May 18, 2015. We made some important additions and corrections on May 25, 2015.

— Jeff;


The International Earth Rotation and Reference Systems (IERS) recently announced that an extra second will be injected into civil time at the end of June 30th, 2015. This means that the last minute of June 30th, 2015 will have 61 seconds. If a clock is synchronized to the standard civil time, it should show an extra second 23:59:60 on that day between 23:59:59 and 00:00:00. This extra second is called a leap second. There have been 25 such leap seconds since 1972. The last one took place on June 30th, 2012.

Clocks in IT systems do not always follow the standard above and can behave in many different ways. For example:

  • Some Linux kernels implement a one-seconds backwards jump instead of the extra “:60” second, repeating the 59th second (see the article, Resolve Leap Second Issues in Red Hat Enterprise Linux for more information).
  • Windows time servers ignore the leap second signal and will sync to the correct time after the leap second (see How the Windows Time Service Treats a Leap Second for more information).
  • Some organizations, including Amazon Web Services, plan to spread the extra second over many hours surrounding the leap second by making every second slightly longer.
  • If a clock doesn’t connect to a time synchronization system, it drifts on its own and will not implement any leap second or an adjustment for it.

If you want to know whether your applications and systems can properly handle the leap second, contact your providers. If you run time-sensitive workloads and need to know how AWS clocks will behave, read this document carefully. In general, there are three affected parts:

  • The AWS Management Console and backend systems
  • Amazon EC2 instances
  • Other AWS managed resources

For more information about comparing AWS clocks to UTC, see the AWS Adjusted Time section of this post.

AWS Management Console and Backend Systems
The AWS Management Console and backend systems will NOT implement the leap second. Instead, we will spread the one extra second over a 24-hour period surrounding the leap second by making each second slightly longer. During these 24 hours, AWS clocks may be up to 0.5 second behind or ahead of the standard civil time (see the AWS Adjusted Time section for more information).

You can see adjusted times in consoles (including resource creation timestamps), metering records, billing records, Amazon CloudFront logs, and AWS CloudTrail logs. You will not see a “:60” second in these places and your usage will be billed according to the adjusted time.

Amazon EC2 Instances
Each EC2 instance has its own clock and is fully under your control; AWS does not manage instance clocks. An instance clock can have any of the behaviors listed at the beginning of this post. Contact your OS provider to understand the expected behavior of your operating system.

If you use the Amazon Linux AMI, your instance will implement the one-second backwards jump and will see “23:59:59” twice.  You may find the following information useful:

If you use SUSE Linux Enterprise Server, take a look at Fixes and Workaround to Avoid Issues Caused by Leap Second 2015.

Other AWS Managed Resources
Other AWS resources may also have their own clocks. Unlike EC2 instances, these resources are fully or partially managed by AWS.

The following resources will implement the one-second backwards jump and will see :23:59:59″ twice:

  • Amazon CloudSearch clusters
  • Amazon EC2 Container Service instances
  • Amazon EMR Clusters
  • Amazon RDS instances
  • Amazon Redshift instances

To enable time synchronization on EMR clusters, your VPC has to allow access to NTP. Make sure that your EMR clusters have access to the Internet, and that your security groups and network ACLs allow outbound UDP traffic on port 123.

AWS Adjusted Time
This section provides specific details on how clocks will behave in the AWS Management Console and backend systems.

Starting at 12:00:00 PM on June 30th, 2015, we will slow down AWS clocks by 1/86400. Every second on AWS clocks will take 1+1/86400 seconds of “real” time, until 12:00:00 PM on  July 1st, 2015, when AWS clocks will be behind by a full second. Meanwhile, the standard civil time (UTC) will implement the leap second at the end of June 30th, 2015 and fall behind by a full second, too. Therefore, at 12:00:00 PM July 1st, 2015, AWS clocks will be synchronized to UTC again. The table below illustrates these changes.

UTC AWS Adjusted Clock AWS vs. UTC Notes
11:59:59 AM June 30th, 2015 11:59:59 AM June 30th, 2015 +0 AWS clocks are synchronized to UTC.
12:00:00 PM 12:00:00 PM +0
12:00:01 Each second is 1/86400 longer and AWS clocks fall behind UTC. The gap gradually increases to up to 1/2 second.
12:00:01 +1/86400
12:00:02
12:00:02 +2/86400
 …  …  …
23:59:59
 23:59:59 +43199/86400
23:59:60 Leap second injected to UTC.
00:00:00 AM July 1st, 2015 -1/2 AWS clocks gain 1/2 second ahead of UTC.
00:00:00 AM July 1st, 2015 AWS clocks keep falling behind and the gap with UTC shrinks gradually.
00:00:01 -43199/86400
00:00:01
00:00:02 -43198/86400
 …  …  …
11:59:59 AM -1/86400
11:59:59 AM
12:00:00 PM July 1st ,2015 12:00:00 PM July 1st ,2015 +0 The gap shrinks to zero. AWS clocks synchronize to UTC again.
12:00:01 12:00:01 +0

If you have any questions about this upcoming event, please contact AWS Support or post in the EC2 Forum.

Mingxue Zhao, Senior Product Manager

Quickly Filter Data in Amazon Redshift Using Interleaved Sorting

by Jeff Barr | on | in Amazon Redshift |

My colleague Tina Adams sent a guest post to introduce another cool and powerful feature for Amazon Redshift.

Jeff;


Amazon Redshift, our fully managed data warehouse service, makes it fast and easy to analyze all your data. You can get started with a 160GB cluster for free for 2 months, and scale to a couple of petabytes for less than $1,000/TB/Year. Amazon Redshift uses a massively parallel processing, scale-out architecture to ensure compute resources grow with your data set size, and columnar, direct-attached storage to dramatically reduce I/O time. Many of our customers, including Pinterest, Euclid, Accordant Media, Asana and Coursera, have seen significant performance improvements with Amazon Redshift.

Reducing I/O is critical to improving data warehouse query performance. One way Amazon Redshift improves I/O performance is by using sorted tables to restrict the range of values that need to be scanned. When you sort a table, we use zone maps (cached in memory) to skip 1 MB blocks without relevant data. For example, if a table has five years of data sorted by date, 98% of the blocks can be eliminated when you query for a particular month.

But what if you need to filter by a column that is not in the sort key? Row-based systems deal with this by creating secondary indices to reduce the number of blocks accessed when filtering on a non-primary key column. Indices are less selective for columnar databases with larger block sizes. Instead, columnar systems often create projections, which are copies of the underlying table sorted in different ways. However, both indices and projections can require significant overhead. You have to maintain not only data you’re loading, but also indices and projections on top of that data. For example, with 3 projections, it may take 2-3 times longer to load your data. What’s more, the number of possible projections increases exponentially with the number of columns. With eight columns, you can have up to 40K possible projections.

New Interleaved Sort Keys
For fast filter queries without the need for indices or projections, Amazon Redshift now supports Interleaved Sort Keys, which will be deployed in every region over the next seven days. A table with interleaved keys arranges your data so each sort key column has equal importance. Interleaved sorts are most effective with highly selective queries that filter on multiple columns. Let’s say you have a table with 100,000 1 MB blocks per column, and you often filter on one or more of four columns (date, customer, product, geography). You can create a compound sort key, sorting first by date, then customer, product, and finally geography. This will work well if your query filters on date, but can require a table scan of all 100,000 blocks if you only filter by geographic region. Interleaved sorting provides fast filtering, no matter which sort key columns you specify in your WHERE clause. If you filter by any one of the four sort key columns, you scan 5,600 blocks, or 100,000(3/4). If you filter by two columns, you reduce the scan to 316 blocks or 100,000(2/4). Finally, if you filter on all four sort keys, you scan a single block.

Sound like magic? No, it’s just math. Let’s walk through another example. Imagine you run a website and want to analyze your top webpages and customers. You have two sort keys (customer ID and page ID) and data split across four blocks. How do you achieve fast analytics across both webpages and customers? You can choose to sort your data by customer ID and then by page ID via a compound sort key (the notation [1, 1] means that customer 1 viewed page 1):

This means pages viewed by a given customer are stored in the same block (orange rectangle). While you only need to scan one block to return the page views of a particular customer, you must scan all four blocks to return all customers that viewed a given page.

Alternatively, if you create an interleaved sort, both customer and page are treated the same:

As you can see, the first two customer IDs are stored in the first block, along with the first two page IDs (arrows indicate the sort order). Therefore, you only scan 2 blocks to return data on a given customer or a given webpage.

The performance benefit of interleaved sorting increases with table size. If your table contains 1,000,000 blocks (1 TB per column) with an interleaved sort key of both customer ID and page ID, you scan 1,000 blocks when you filter on a specific customer or page, a speedup of 1000x compared to the unsorted case.

To create an interleaved sort, simply define your sort keys as INTERLEAVED in your CREATE TABLE statement, as shown below. You can specify up to eight sort keys.

CREATE TABLE PageViews (customerID INTEGER, pageURL VARCHAR(300), duration INTEGER)
INTERLEAVED SORTKEY (pageURL, customerID); 

When we tuned a 45,000,000 row dataset with interleaved sort keys, queries on a subset of the sort columns ran 12x faster on average compared to using compound sort keys, as shown below.

SELECT COUNT(DISTINCT pageURL) FROM PageViews WHERE customerID = 57285;

Run time...

Compound Sort: 21.7 sec
Interleaved Sort: 1.7 sec


Interleaved sorting can even be useful for single-column keys. Our zone maps store the first eight bytes of sort key column values. When we generate a zone map on the pageURL column above using compound sort keys, the leading eight bytes would only cover “http://w”, which is not very selective. However, interleaved sorting compresses the key column, allowing 264 distinct URLs to fit in the zone map.

There are tradeoffs between using compound and interleaved sort keys. If the columns you query are always the same, you may be better off using compound sort keys. In the example above, a query filtering on the leading sort column, pageURL, runs 0.4 seconds faster using compound sort keys vs. interleaved sorting. This is not surprising, since compound keys favor leading columns. Depending on the nature of the dataset, vacuum time overheads can increase by 10-50% for interleaved sort keys vs. compound keys. If you have data that increases monotonically, such as dates, your interleaved sort order will skew over time, requiring you to run a vacuum operation to re-analyze the distribution and re-sort the data.

The interleaved sorting feature will be deployed in every region over the next seven days. The new cluster version will be 1.0.921.

Learn More
Our tutorial on Optimizing for Star Schemas with Amazon Redshift goes over how to make these trade-offs in more detail. APN business intelligence partner Looker has also written a great blog post about the performance benefits of interleaved sorting. For more information on defining sort keys, please review our documentation on Best Practices for Designing Tables. We’d also appreciate your feedback at redshift-feedback@amazon.com.

Stay Tuned
There’s more to come from Amazon Redshift, so stay tuned. To get the latest feature announcements, log in to the Amazon Redshift Forum and subscribe to the Amazon Redshift Announcements thread. Or use the Developer Guide History and the Management Guide History to track product updates.

Tina Adams, Senior Product Manager