Tag: Amazon Redshift


Amazon Redshift Monitoring Now Supports End User Queries and Canaries

by Ian Meyers | on | | Comments

Ian Meyers is a Solutions Architecture Senior Manager with AWS

The serverless Amazon Redshift Monitoring utility lets you gather important performance metrics from your Redshift cluster’s system tables and persists the results in Amazon CloudWatch. This serverless solution leverages AWS Lambda to schedule custom SQL queries and process the results. With this utility, you can use Amazon Cloudwatch to monitor disk-based queries, WLM queue wait time, alerts, average query times, and other data. This allows you to create visualizations with CloudWatch dashboards, generate Alerts on specific values, and create Rules to react to those Alerts.

redshift_monitoring

You can now create your own diagnostic queries and plug-in “canaries” that monitor the runtime of your most vital end user queries. These user-defined metrics can be used to create dashboards and trigger Alarms and should improve visibility into workloads running on a Cluster. They might also facilitate sizing discussions.

View the README to get started.

(more…)

Run Mixed Workloads with Amazon Redshift Workload Management

by Suresh Akena | on | | Comments

Mixed workloads run batch and interactive workloads (short-running and long-running queries or reports) concurrently to support business needs or demand. Typically, managing and configuring mixed workloads requires a thorough understanding of access patterns, how the system resources are being used and performance requirements.

It’s common for mixed workloads to have some processes that require higher priority than others. Sometimes, this means a certain job must complete within a given SLA. Other times, this means you only want to prevent a non-critical reporting workload from consuming too many cluster resources at any one time.

Without workload management (WLM), each query is prioritized equally, which can cause a person, team, or workload to consume excessive cluster resources for a process which isn’t as valuable as other more business-critical jobs.

This post provides guidelines on common WLM patterns and shows how you can use WLM query insights to optimize configuration in production workloads.

Workload concepts

You can use WLM to define the separation of business concerns and to prioritize the different types of concurrently running queries in the system:

  • Interactive: Software that accepts input from humans as it runs. Interactive software includes most popular programs, such as BI tools or reporting applications.
    • Short-running, read-only user queries such as Tableau dashboard query with low latency requirements.
    • Long-running, read-only user queries such as a complex structured report that aggregates the last 10 years of sales data.
  • Batch: Execution of a job series in a server program without manual intervention (non-interactive). The execution of a series of programs, on a set or “batch” of inputs, rather than a single input, would instead be a custom job.
    • Batch queries includes bulk INSERT, UPDATE, and DELETE transactions, for example, ETL or ELT programs.

Amazon Redshift Workload Management

Amazon Redshift is a fully managed, petabyte scale, columnar, massively parallel data warehouse that offers scalability, security and high performance. Amazon Redshift provides an industry standard JDBC/ODBC driver interface, which allows customers to connect their existing business intelligence tools and re-use existing analytics queries.

Amazon Redshift is a good fit for any type of analytical data model, for example, star and snowflake schemas, or simple de-normalized tables.

Managing workloads

Amazon Redshift Workload Management allows you to manage workloads of various sizes and complexity for specific environments. Parameter groups contain WLM configuration, which determines how many query queues are available for processing and how queries are routed to those queues. The default parameter group settings are not configurable. Create a custom parameter group to modify the settings in that group, and then associate it with your cluster. The following settings can be configured:

  • How many queries can run concurrently in each queue
  • How much memory is allocated among the queues
  • How queries are routed to queues, based on criteria such as the user who is running the query or a query label
  • Query timeout settings for a queue

When the user runs a query, WLM assigns the query to the first matching queue and executes rules based on the WLM configuration. For more information about WLM query queues, concurrency, user groups, query groups, timeout configuration, and queue hopping capability, see Defining Query Queues. For more information about the configuration properties that can be changed dynamically, see WLM Dynamic and Static Configuration Properties.

For example, the WLM configuration in the following screenshot has three queues to support ETL, BI, and other users. ETL jobs are assigned to the long-running queue and BI queries to the short-running queue. Other user queries are executed in the default queue.

WLM-Picture 1

(more…)

Converging Data Silos to Amazon Redshift Using AWS DMS

by Pratim Das | on | | Comments

Organizations often grow organically—and so does their data in individual silos. Such systems are often powered by traditional RDBMS systems and they grow orthogonally in size and features. To gain intelligence across heterogeneous data sources, you have to join the data sets. However, this imposes new challenges, as joining data over dblinks or into a single view is extremely cumbersome and an operational nightmare.

This post walks through using AWS Database Migration Service (AWS DMS) and other AWS services to make it easy to converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.

AWS service overview

Here’s a brief overview of AWS services that help with data convergence.

AWS DMS

With DMS, you can migrate your data to and from most widely used commercial and open-source databases. The service supports homogenous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL. It also allows you to stream data to Amazon Redshift from any of the supported sources including:

  • Amazon Aurora
  • PostgreSQL
  • MySQL
  • MariaDB
  • Oracle
  • SAP ASE
  • SQL Server

DMS enables consolidation and easy analysis of data in the petabyte-scale data warehouse. It can also be used for continuous data replication with high availability.

Amazon QuickSight

Amazon QuickSight provides very fast, easy-to-use, cloud-powered business intelligence at 1/10th the cost of traditional BI solutions. QuickSight uses a new, super-fast, parallel, in-memory calculation engine (“SPICE”) to perform advanced calculations and render visualizations rapidly.

QuickSight integrates automatically with AWS data services, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance to them. You can easily connect QuickSight to AWS data services, including Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon S3, and Amazon Athena. You can also upload CSV, TSV, and spreadsheet files or connect to third-party data sources such as Salesforce.

Amazon Redshift

Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes. Amazon Redshift is typically priced at 1/10th of the price of the competition. We have many customers running petabyte scale data analytics on AWS using Amazon Redshift.

Amazon Redshift is also ANSI SQL compliant, supports JDBC/ODBC, and is easy to connect to your existing business intelligence (BI) solution. However, if your storage requirement is in the 10s of TB range and requires high levels of concurrency across small queries, you may want to consider Amazon Aurora as the target converged database.

Walkthrough

Assume that you have an events company specializing on sports, and have built a MySQL database that holds data for the players and the sporting events. Customers and ticket information is stored in another database; in this case, assume it is PostgresSQL and this gets updated when customer purchases tickets from our website and mobile apps. You can download a sample dataset from the aws-database-migration-samples GitHub repo.

These databases could be anywhere: at an on-premises facility; on AWS in Amazon EC2 or Amazon RDS, or other cloud provider; or in a mixture of such locations. To complicate things a little more, you can assume that the lost opportunities (where a customer didn’t complete buying the ticket even though it was added to the shopping cart) are streamed via clickstream through Amazon Kinesis and then stored on Amazon S3. We then use AWS Data Pipeline to orchestrate a process to cleanse that data using Amazon EMR and make it ready for loading to Amazon Redshift. The clickstream integration is not covered in this post but was demonstrated in the recent Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics post.

Architecture

In this solution, you use DMS to bring the two data sources into Amazon Redshift and run analytics to gain business insights. The following diagram demonstrates the proposed solution.

DataSilos_1_1

(more…)

Decreasing Game Churn: How Upopa used ironSource Atom and Amazon ML to Engage Users

by Tom Talpir | on | | Comments

This is a guest post by Tom Talpir, Software Developer at ironSource. ironSource is as an Advanced AWS Partner Network (APN) Technology Partner and an AWS Big Data Competency Partner.

Ever wondered what it takes to keep a user from leaving your game or application after all the hard work you put in? Wouldn’t it be great to get a chance to interact with the users before they’re about to leave?

Finding these users can be difficult, mainly because most churn happens within the first few minutes or hours of a user’s gameplay. However, machine learning (ML) can make this possible by providing insights to help developers identify these users and engage with them to decrease the churn rate.

Upopa is a gaming studio that creates cool games (that you should definitely check out), and they were a great fit for our new project, leveraging Amazon Machine Learning (Amazon ML) to offer game developers an ability to predict the future actions of their players, and ultimately reduce churn without having to learn the complex ML algorithms.

o_upopa_1

Upopa sends all their data to Amazon Redshift, using ironSource Atom, a data flow management solution that allows developers to send data from their application into many different types of data targets (including Amazon Redshift, Amazon S3, Amazon Elasticsearch Service, and other relational databases) with great ease.

Amazon ML turned out to be the right solution for Upopa, because it integrates easily with Amazon Redshift, and makes everything much easier with visualization tools and wizards that guides you through the process of creating ML models.

(more…)

Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning

by Radhika Ravirala and Wangechi Doble | on | | Comments

Air travel can be stressful due to the many factors that are simply out of airline passengers’ control. As passengers, we want to minimize this stress as much as we can. We can do this by using past data to make predictions about how likely a flight will be delayed based on the time of day or the airline carrier.

In this post, we generate a predictive model for flight delays that can be used to help us pick the flight least likely to add to our travel stress. To accomplish this, we will use Apache Spark running on Amazon EMR for extracting, transforming, and loading (ETL) the data, Amazon Redshift for analysis, and Amazon Machine Learning for creating predictive models. This solution gives a good example of combining multiple AWS services to build a sophisticated analytical application in the AWS Cloud.

Architecture

At a high level, our solution includes the following steps:

Step 1 is to ingest datasets:

Step 2 is to enrich data by using ETL:

  • We will transform the maximum and minimum temperature columns from Celsius to Fahrenheit in the weather table in Hive by using a user-defined function in Spark.
  • We enrich the flight data in Amazon Redshift to compute and include extra features and columns (departure hour, days to the nearest holiday) that will help the Amazon Machine Learning algorithm’s learning process.
  • We then combine both the datasets in the Spark environment by using the spark-redshift package to load data from Amazon Redshift cluster to Spark running on an Amazon EMR cluster. We write the enriched data back to a Amazon Redshift table using the spark-redshift package.

Step 3 is to perform predictive analytics:

  • In this last step, we use Amazon Machine Learning to create and train a ML model using Amazon Redshift as our data source. The trained Amazon ML model is used to generate predictions for the test dataset, which are output to an S3 bucket.

o_powering_redshift_analytics_1

(more…)

Building an Event-Based Analytics Pipeline for Amazon Game Studios’ Breakaway

by Brent Nash | on | | Comments

All software developers strive to build products that are functional, robust, and bug-free, but video game developers have an extra challenge: they must also create a product that entertains. When designing a game, developers must consider how the various elements—such as characters, story, environment, and mechanics—will fit together and, more importantly, how players will interact with those elements.

It’s not enough to just assume that those interactions occur as intended—is a particular level too difficult? Are the controls responsive enough? Is the combat balanced?  While in-house and focus testing can help answer those questions during development, nothing provides a better answer than actual data from real world players.

o_breakaway_1

We’re currently developing Amazon Game Studios’ new title Breakaway; an online 4v4 team battle sport that delivers fast action, teamwork, and competition. We’re releasing Breakaway in a live alpha state on December 15, 2016, so that we can iterate and improve the game by analyzing massive amounts of gameplay data from our player community.

(more…)

Amazon Redshift Engineering’s Advanced Table Design Playbook: Table Data Durability

by Zach Christopherson | on | | Comments

Part 1: Preamble, Prerequisites, and Prioritization
Part 2: Distribution Styles and Distribution Keys
Part 3: Compound and Interleaved Sort Keys
Part 4: Compression Encodings
Part 5: Table Data Durability (Translated into Japanese)


In the fifth and final installment of the Advanced Table Design Playbook, I’ll discuss how to use two simple table durability properties to squeeze even more performance out of your Amazon Redshift configuration.

Limiting Automated Backups

With automated backups, Amazon Redshift will frequently push incremental block changes to Amazon S3. These automated backups are triggered based on a threshold of time or blocks changed. This approach makes possible recovery features such as snapshot and table level restore operations.

The actual cost of the automated backup is generally insignificant. However, because it transfers data blocks to S3 it can potentially affect your clusters’ throughput to S3 for workloads that constantly perform large COPY or UNLOAD operations. To reduce the cost of these automated backups, we can use the table property BACKUP NO for tables that don’t contain data to store durably in S3. Some tables where you can use this property are these:

  1. Tables that are populated with easily recreated data
  2. Tables containing data that is transient or often changing in nature
  3. Tables that are populated with unimportant or experimental data

We use this property to achieve two things:

  1. Preventing inconsequential block changes from triggering frequent automated backups
  2. Reducing quantity of blocks transferred to S3 during backups, to reduce backup duration

Note: Tables defined as temporary inherit this property by default, because these tables are cleaned up at the end of the session in which they were created.

Reducing Synchronous Data Replication

By default, multinode Amazon Redshift configurations automatically store redundant copies of table data on other nodes in the same cluster, so we can tolerate multiple disk and node failures. As with all synchronous data replication, some overhead is associated with the process. For temporary tables, this replication process doesn’t occur because the system makes the assumption that the data is truly transient. Additionally, blocks associated with temporary tables don’t count toward the threshold which triggers automated backups. Thus, by using temporary tables for truly transient data we can also avoid triggering backup processes and improve the rate at which the automated backup processes complete—just as we did with the BACKUP NO option.

(more…)

Amazon Redshift Engineering’s Advanced Table Design Playbook: Compression Encodings

by Zach Christopherson | on | | Comments

Part 1: Preamble, Prerequisites, and Prioritization
Part 2: Distribution Styles and Distribution Keys
Part 3: Compound and Interleaved Sort Keys
Part 4: Compression Encodings (Translated into Japanese)
Part 5: Table Data Durability


In part 4 of this blog series, I’ll be discussing when and when not to apply column encoding for compression, methods for determining ideal column encodings, and automatic compression behaviors in Amazon Redshift.

Data compression in database systems isn’t new. However, historically it was used to reduce data footprint, rather than boost performance, because of its expensive decompression overhead. In Amazon Redshift, using column encodings translates to both a data footprint reduction and a performance gain, because the cost of decompression is negligible relative to the reduction of disk I/O.

Within a Amazon Redshift table, each column can be specified with an encoding that is used to compress the values within each block. In general, compression should be used for almost every column within an Amazon Redshift cluster – but there are a few scenarios where it is better to avoid encoding columns.

The method I use for prioritizing compression related optimizations is a bit different than the prioritization method we used in the preceding parts, largely because you might already have implemented automatic or manual compression for many of your important tables. The approach we use to determine which tables to review for compression is based on measuring the size of uncompressed storage per table:

SELECT 
  ti.schema||'.'||ti."table" tablename,
  uncompressed_size.size uncompressed_size,
  ti.size total_size
FROM svv_table_info ti
LEFT JOIN (
  SELECT tbl,COUNT(*) size 
  FROM stv_blocklist 
  WHERE (tbl,col) IN (
    SELECT attrelid, attnum-1 
    FROM pg_attribute 
    WHERE attencodingtype=0 AND attnum>0) 
  GROUP BY 1) uncompressed_size ON ti.table_id = uncompressed_size.tbl
ORDER BY 2 DESC NULLS LAST;

Using the results of the query preceding to prioritize efforts, we can work through this simple flowchart to determine whether to compress or not:

o_redshift_tables_4_1

(more…)

Amazon Redshift Engineering’s Advanced Table Design Playbook: Compound and Interleaved Sort Keys

by Zach Christopherson | on | | Comments

 

Part 1: Preamble, Prerequisites, and Prioritization
Part 2: Distribution Styles and Distribution Keys
Part 3: Compound and Interleaved Sort Keys (Translated into Japanese)
Part 4: Compression Encodings
Part 5: Table Data Durability


In this installment, I’ll cover different sort key options, when to use sort keys, and how to identify the most optimal sort key configuration for your tables. I’ll also present another methodology with which to work through your specific workload. This methodology offers concrete guidance on how to properly use sort keys for performance.

Defining a table with a sort key results in the physical ordering of data within each slice, based on the sort type and the columns chosen in the key definition. In Amazon Redshift, we allow for a table to be defined with compound sort keys, interleaved sort keys, or no sort keys. Each of these styles of sort key is useful for certain table access patterns. In practice, a compound sort key is most appropriate for over 90% of workloads.

There are numerous benefits of ordering your data in Amazon Redshift:

  1. You can reduce disk I/O by improving zone map effectiveness.
  2. You can reduce compute overhead and I/O by avoiding or reducing cost of sort steps.
  3. Improve join performance by enabling MERGE JOIN operation

Starting at the most fundamental question, and diving deeper, we can construct a logical flowchart applicable to any table that can be used for identifying the ideal sort type and sort key columns for your workload.

Will queries against my tables benefit from a sort key?

Almost always the answer is yes. However, in a few edge cases sorting a table doesn’t result in a performance improvement and only adds minor overhead to data ingestion.

As discussed preceding, you most often use sort keys to improve the effectiveness of the zone maps, which result in reduced block I/O for read operations. The next most frequent benefit provided by sort keys is sorting to remove, or reduce the cost of, sort steps required by SQL operations like ORDER BY, PARTITION BY, GROUP BY, and so on. Least common, but still important, is sorting to facilitate a MERGE JOIN operation. MERGE JOIN is the fastest of the three JOIN operations supported by Amazon Redshift.

If you have a table that is accessed in a pattern where none of these three optimizations benefits you, then you have one of the few scenarios where defining a sort key makes no difference. In these circumstances, you don’t need to specify a sort key.

Together, these steps can be reduced to “Does this table benefit from sorting?” Expanded, that question looks like the following:

o_redshift_tables_3_1

(more…)

Amazon Redshift Engineering’s Advanced Table Design Playbook: Distribution Styles and Distribution Keys

by Zach Christopherson | on | | Comments

 

Part 1: Preamble, Prerequisites, and Prioritization
Part 2: Distribution Styles and Distribution Keys (Translated into Japanese)
Part 3: Compound and Interleaved Sort Keys
Part 4: Compression Encodings
Part 5: Table Data Durability


The first table and column properties we discuss in this blog series are table distribution styles (DISTSTYLE) and distribution keys (DISTKEY). This blog installment presents a methodology to guide you through the identification of optimal DISTSTYLEs and DISTKEYs for your unique workload.

When you load data into a table, Amazon Redshift distributes the rows to each of the compute nodes according to the table’s DISTSTYLE. Within each compute node, the rows are assigned to a cluster slice. Depending on node type, each compute node contains 2, 16, or 32 slices. You can think of a slice like a virtual compute node. During query execution, all slices process the rows that they’ve had assigned in parallel. The primary goal in selecting a table’s DISTSTYLE is to evenly distribute the data throughout the cluster for parallel processing.

When you execute a query, the query optimizer might redistribute or broadcast the intermediate tuples throughout the cluster to facilitate any join or aggregation operations. The secondary goal in selecting a table’s DISTSTYLE is to minimize the cost of data movement necessary for query processing. To achieve minimization, data should be located where it needs to be before the query is executed.

A table might be defined with a DISTSTYLE of EVEN, KEY, or ALL. If you’re unfamiliar with these table properties, you can watch my presentation at the 2016 AWS Santa Clara Summit, where I discussed the basics of distribution starting at the 17-minute mark. I summarize these here:

  • EVEN will do a round-robin distribution of data.
  • KEY requires a single column to be defined as a DISTKEY. On ingest, Amazon Redshift hashes each DISTKEY column value, and route hashes to the same slice consistently.
  • ALL distribution stores a full copy of the table on the first slice of each node.

Which style is most appropriate for your table is determined by several criteria. This post presents a two-phase flow chart that will guide you through questions to ask of your data profile to arrive at the ideal DISTSTYLE and DISTKEY for your scenario.

Phase 1: Identifying Appropriate DISTKEY Columns

Phase 1 seeks to determine if KEY distribution is appropriate. To do so, first determine if the table contains any columns that would appropriately distribute the table data if they were specified as a DISTKEY. If we find that no columns are acceptable DISTKEY columns, then we can eliminate DISTSTYLE KEY as a potential DISTSTYLE option for this table.

o_redshift_table_design_1

 

(more…)