Tag: Amazon Redshift


Build a Healthcare Data Warehouse Using Amazon EMR, Amazon Redshift, AWS Lambda, and OMOP

by Ryan Hood | on | | Comments

In the healthcare field, data comes in all shapes and sizes. Despite efforts to standardize terminology, some concepts (e.g., blood glucose) are still often depicted in different ways. This post demonstrates how to convert an openly available dataset called MIMIC-III, which consists of de-identified medical data for about 40,000 patients, into an open source data model known as the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). It describes the architecture and steps for analyzing data across various disconnected sources of health datasets so you can start applying Big Data methods to health research.

Before designing and deploying a healthcare application on AWS, make sure that you read through the AWS HIPAA Compliance whitepaper. This covers the information necessary for processing and storing patient health information (PHI).

Note: If you arrived at this page looking for more info on the movie Mimic 3: Sentinel, you might not enjoy this post.

OMOP overview

The OMOP CDM helps standardize healthcare data and makes it easier to analyze outcomes at a large scale. The CDM is gaining a lot of traction in the health research community, which is deeply involved in developing and adopting a common data model. Community resources are available for converting datasets, and there are software tools to help unlock your data after it’s in the OMOP format. The great advantage of converting data sources into a standard data model like OMOP is that it allows for streamlined, comprehensive analytics and helps remove the variability associated with analyzing health records from different sources.

OMOP ETL with Apache Spark

Observational Health Data Sciences and Informatics (OHDSI) provides the OMOP CDM in a variety of formats, including Apache Impala, Oracle, PostgreSQL, and SQL Server. (See the OHDSI Common Data Model repo in GitHub.) In this scenario, the data is moved to AWS to take advantage of the unbounded scale of Amazon EMR and serverless technologies, and the variety of AWS services that can help make sense of the data in a cost-effective way—including Amazon Machine Learning, Amazon QuickSight, and Amazon Redshift.

This example demonstrates an architecture that can be used to run SQL-based extract, transform, load (ETL) jobs to map any data source to the OMOP CDM. It uses MIMIC ETL code provided by Md. Shamsuzzoha Bayzid. The code was modified to run in Amazon Redshift.

Getting access to the MIMIC-III data

Before you can retrieve the MIMIC-III data, you must request access on the PhysioNet website, which is hosted on Amazon S3 as part of the Amazon Web Services (AWS) Public Dataset Program. However, you don’t need access to the MIMIC-III data to follow along with this post.

Solution architecture and loading process

The following diagram shows the architecture that is used to convert the MIMIC-III dataset to the OMOP CDM.

(more…)

Manage Query Workloads with Query Monitoring Rules in Amazon Redshift

by Suresh Akena and Gaurav Saxena | on | | Comments

Data warehousing workloads are known for high variability due to seasonality, potentially expensive exploratory queries, and the varying skill levels of SQL developers.

To obtain high performance in the face of highly variable workloads, Amazon Redshift workload management (WLM) enables you to flexibly manage priorities and resource usage. With WLM, short, fast-running queries don’t get stuck in queues behind long-running queries. In spite of this, a query can sometimes corner a disproportionate share of resources, penalizing other queries in the system. Such queries are commonly known as rogue or runaway queries.

While WLM provides a method to restrict memory use and moving queries to other queues using a timeout, many times granular control is desirable. You can now use query monitoring rules to create resource usage rules for queries, monitor a query’s resource use, and then perform actions if a query violates a rule.

Workload management concurrency and query monitoring rules

In an Amazon Redshift environment, there are a maximum of 500 simultaneous connections to a single cluster. Throughput is usually expressed as queries per hour to maximize performance, while row databases like MySQL use concurrent connections to scale. In Amazon Redshift, workload management (WLM) maximizes throughput irrespective of concurrency. There are two main parts to WLM: queues and concurrency. Queues allow you to allocate memory at a user group or a query group level. Concurrency or memory slots is how you further subdivide and allocate memory to a query.

For example, assume that you have one queue (100% memory allocation) with a concurrency of 10. This means that each query gets a maximum of 10% memory. If the majority of your queries need 20% memory, then these queries are swapping to disk, causing a lower throughput. However, if you lower the concurrency to 5, each query is assigned 20% memory and the net result is higher throughput and overall faster response time to SQL clients. When switching from a row database to column-oriented, it is a common pitfall to assume that higher concurrency leads to better performance.

Now that you understand concurrency, here are more details about query monitoring rules. You define a rule based on resource usage and a corresponding action to take if a query violates that rule. Twelve different resource usage metrics are available, such as a query’s use of CPU, query execution time, rows scanned, rows returned, nested loop join, and so on.

Each rule includes up to three conditions, or predicates, and one action. A predicate consists of a metric, a comparison condition (=, <, or > ), and a value. If all of the predicates for any rule are met, that rule’s action is triggered. Possible rule actions are log, hop, and abort.

This allows you to catch a rogue or runaway query long before it causes severe problems. The rule triggers an action to free up the queue, and in turn improve throughput and responsiveness.

For example, for a queue that’s dedicated to short-running queries, you might create a rule aborting queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule logging queries that contain nested loops. There are predefined rule templates in the Amazon Redshift console to get you started.

Scenarios

Use query monitoring rules to perform query level actions ranging from simply logging the query to aborting it. All of the actions taken are logged in the STL_WLM_RULE_ACTION table.

  • The Log action logs the information and continue to monitor the query.
  • The Hop action terminate the query, and restart it the next matching queue. If there is not another matching queue, the query is canceled.
  • The Abort action aborts rule-violating queries.

The following three sample scenarios show how to use query monitoring rules. 

Scenario 1: How to govern a suboptimal query in your ad hoc queue?

A runaway query that joins two large tables could return a billion or more rows. You can protect your ad hoc queue by creating a rule to abort any queries that return more than billion rows. Logically, this would look like the following:

IF return_row_count > 1B rows then ABORT

In the following screenshot, any query returning more than a billion rows in the BI_USER group is aborted.

(more…)

Amazon Redshift Monitoring Now Supports End User Queries and Canaries

by Ian Meyers | on | | Comments

Ian Meyers is a Solutions Architecture Senior Manager with AWS

The serverless Amazon Redshift Monitoring utility lets you gather important performance metrics from your Redshift cluster’s system tables and persists the results in Amazon CloudWatch. This serverless solution leverages AWS Lambda to schedule custom SQL queries and process the results. With this utility, you can use Amazon Cloudwatch to monitor disk-based queries, WLM queue wait time, alerts, average query times, and other data. This allows you to create visualizations with CloudWatch dashboards, generate Alerts on specific values, and create Rules to react to those Alerts.

redshift_monitoring

You can now create your own diagnostic queries and plug-in “canaries” that monitor the runtime of your most vital end user queries. These user-defined metrics can be used to create dashboards and trigger Alarms and should improve visibility into workloads running on a Cluster. They might also facilitate sizing discussions.

View the README to get started.

(more…)

Run Mixed Workloads with Amazon Redshift Workload Management

by Suresh Akena | on | | Comments

Mixed workloads run batch and interactive workloads (short-running and long-running queries or reports) concurrently to support business needs or demand. Typically, managing and configuring mixed workloads requires a thorough understanding of access patterns, how the system resources are being used and performance requirements.

It’s common for mixed workloads to have some processes that require higher priority than others. Sometimes, this means a certain job must complete within a given SLA. Other times, this means you only want to prevent a non-critical reporting workload from consuming too many cluster resources at any one time.

Without workload management (WLM), each query is prioritized equally, which can cause a person, team, or workload to consume excessive cluster resources for a process which isn’t as valuable as other more business-critical jobs.

This post provides guidelines on common WLM patterns and shows how you can use WLM query insights to optimize configuration in production workloads.

Workload concepts

You can use WLM to define the separation of business concerns and to prioritize the different types of concurrently running queries in the system:

  • Interactive: Software that accepts input from humans as it runs. Interactive software includes most popular programs, such as BI tools or reporting applications.
    • Short-running, read-only user queries such as Tableau dashboard query with low latency requirements.
    • Long-running, read-only user queries such as a complex structured report that aggregates the last 10 years of sales data.
  • Batch: Execution of a job series in a server program without manual intervention (non-interactive). The execution of a series of programs, on a set or “batch” of inputs, rather than a single input, would instead be a custom job.
    • Batch queries includes bulk INSERT, UPDATE, and DELETE transactions, for example, ETL or ELT programs.

Amazon Redshift Workload Management

Amazon Redshift is a fully managed, petabyte scale, columnar, massively parallel data warehouse that offers scalability, security and high performance. Amazon Redshift provides an industry standard JDBC/ODBC driver interface, which allows customers to connect their existing business intelligence tools and re-use existing analytics queries.

Amazon Redshift is a good fit for any type of analytical data model, for example, star and snowflake schemas, or simple de-normalized tables.

Managing workloads

Amazon Redshift Workload Management allows you to manage workloads of various sizes and complexity for specific environments. Parameter groups contain WLM configuration, which determines how many query queues are available for processing and how queries are routed to those queues. The default parameter group settings are not configurable. Create a custom parameter group to modify the settings in that group, and then associate it with your cluster. The following settings can be configured:

  • How many queries can run concurrently in each queue
  • How much memory is allocated among the queues
  • How queries are routed to queues, based on criteria such as the user who is running the query or a query label
  • Query timeout settings for a queue

When the user runs a query, WLM assigns the query to the first matching queue and executes rules based on the WLM configuration. For more information about WLM query queues, concurrency, user groups, query groups, timeout configuration, and queue hopping capability, see Defining Query Queues. For more information about the configuration properties that can be changed dynamically, see WLM Dynamic and Static Configuration Properties.

For example, the WLM configuration in the following screenshot has three queues to support ETL, BI, and other users. ETL jobs are assigned to the long-running queue and BI queries to the short-running queue. Other user queries are executed in the default queue.

WLM-Picture 1

(more…)

Converging Data Silos to Amazon Redshift Using AWS DMS

by Pratim Das | on | | Comments

Organizations often grow organically—and so does their data in individual silos. Such systems are often powered by traditional RDBMS systems and they grow orthogonally in size and features. To gain intelligence across heterogeneous data sources, you have to join the data sets. However, this imposes new challenges, as joining data over dblinks or into a single view is extremely cumbersome and an operational nightmare.

This post walks through using AWS Database Migration Service (AWS DMS) and other AWS services to make it easy to converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.

AWS service overview

Here’s a brief overview of AWS services that help with data convergence.

AWS DMS

With DMS, you can migrate your data to and from most widely used commercial and open-source databases. The service supports homogenous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL. It also allows you to stream data to Amazon Redshift from any of the supported sources including:

  • Amazon Aurora
  • PostgreSQL
  • MySQL
  • MariaDB
  • Oracle
  • SAP ASE
  • SQL Server

DMS enables consolidation and easy analysis of data in the petabyte-scale data warehouse. It can also be used for continuous data replication with high availability.

Amazon QuickSight

Amazon QuickSight provides very fast, easy-to-use, cloud-powered business intelligence at 1/10th the cost of traditional BI solutions. QuickSight uses a new, super-fast, parallel, in-memory calculation engine (“SPICE”) to perform advanced calculations and render visualizations rapidly.

QuickSight integrates automatically with AWS data services, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance to them. You can easily connect QuickSight to AWS data services, including Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon S3, and Amazon Athena. You can also upload CSV, TSV, and spreadsheet files or connect to third-party data sources such as Salesforce.

Amazon Redshift

Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes. Amazon Redshift is typically priced at 1/10th of the price of the competition. We have many customers running petabyte scale data analytics on AWS using Amazon Redshift.

Amazon Redshift is also ANSI SQL compliant, supports JDBC/ODBC, and is easy to connect to your existing business intelligence (BI) solution. However, if your storage requirement is in the 10s of TB range and requires high levels of concurrency across small queries, you may want to consider Amazon Aurora as the target converged database.

Walkthrough

Assume that you have an events company specializing on sports, and have built a MySQL database that holds data for the players and the sporting events. Customers and ticket information is stored in another database; in this case, assume it is PostgresSQL and this gets updated when customer purchases tickets from our website and mobile apps. You can download a sample dataset from the aws-database-migration-samples GitHub repo.

These databases could be anywhere: at an on-premises facility; on AWS in Amazon EC2 or Amazon RDS, or other cloud provider; or in a mixture of such locations. To complicate things a little more, you can assume that the lost opportunities (where a customer didn’t complete buying the ticket even though it was added to the shopping cart) are streamed via clickstream through Amazon Kinesis and then stored on Amazon S3. We then use AWS Data Pipeline to orchestrate a process to cleanse that data using Amazon EMR and make it ready for loading to Amazon Redshift. The clickstream integration is not covered in this post but was demonstrated in the recent Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics post.

Architecture

In this solution, you use DMS to bring the two data sources into Amazon Redshift and run analytics to gain business insights. The following diagram demonstrates the proposed solution.

DataSilos_1_1

(more…)

Decreasing Game Churn: How Upopa used ironSource Atom and Amazon ML to Engage Users

by Tom Talpir | on | | Comments

This is a guest post by Tom Talpir, Software Developer at ironSource. ironSource is as an Advanced AWS Partner Network (APN) Technology Partner and an AWS Big Data Competency Partner.

Ever wondered what it takes to keep a user from leaving your game or application after all the hard work you put in? Wouldn’t it be great to get a chance to interact with the users before they’re about to leave?

Finding these users can be difficult, mainly because most churn happens within the first few minutes or hours of a user’s gameplay. However, machine learning (ML) can make this possible by providing insights to help developers identify these users and engage with them to decrease the churn rate.

Upopa is a gaming studio that creates cool games (that you should definitely check out), and they were a great fit for our new project, leveraging Amazon Machine Learning (Amazon ML) to offer game developers an ability to predict the future actions of their players, and ultimately reduce churn without having to learn the complex ML algorithms.

o_upopa_1

Upopa sends all their data to Amazon Redshift, using ironSource Atom, a data flow management solution that allows developers to send data from their application into many different types of data targets (including Amazon Redshift, Amazon S3, Amazon Elasticsearch Service, and other relational databases) with great ease.

Amazon ML turned out to be the right solution for Upopa, because it integrates easily with Amazon Redshift, and makes everything much easier with visualization tools and wizards that guides you through the process of creating ML models.

(more…)

Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning

by Radhika Ravirala and Wangechi Doble | on | | Comments

Air travel can be stressful due to the many factors that are simply out of airline passengers’ control. As passengers, we want to minimize this stress as much as we can. We can do this by using past data to make predictions about how likely a flight will be delayed based on the time of day or the airline carrier.

In this post, we generate a predictive model for flight delays that can be used to help us pick the flight least likely to add to our travel stress. To accomplish this, we will use Apache Spark running on Amazon EMR for extracting, transforming, and loading (ETL) the data, Amazon Redshift for analysis, and Amazon Machine Learning for creating predictive models. This solution gives a good example of combining multiple AWS services to build a sophisticated analytical application in the AWS Cloud.

Architecture

At a high level, our solution includes the following steps:

Step 1 is to ingest datasets:

Step 2 is to enrich data by using ETL:

  • We will transform the maximum and minimum temperature columns from Celsius to Fahrenheit in the weather table in Hive by using a user-defined function in Spark.
  • We enrich the flight data in Amazon Redshift to compute and include extra features and columns (departure hour, days to the nearest holiday) that will help the Amazon Machine Learning algorithm’s learning process.
  • We then combine both the datasets in the Spark environment by using the spark-redshift package to load data from Amazon Redshift cluster to Spark running on an Amazon EMR cluster. We write the enriched data back to a Amazon Redshift table using the spark-redshift package.

Step 3 is to perform predictive analytics:

  • In this last step, we use Amazon Machine Learning to create and train a ML model using Amazon Redshift as our data source. The trained Amazon ML model is used to generate predictions for the test dataset, which are output to an S3 bucket.

o_powering_redshift_analytics_1

(more…)

Building an Event-Based Analytics Pipeline for Amazon Game Studios’ Breakaway

by Brent Nash | on | | Comments

All software developers strive to build products that are functional, robust, and bug-free, but video game developers have an extra challenge: they must also create a product that entertains. When designing a game, developers must consider how the various elements—such as characters, story, environment, and mechanics—will fit together and, more importantly, how players will interact with those elements.

It’s not enough to just assume that those interactions occur as intended—is a particular level too difficult? Are the controls responsive enough? Is the combat balanced?  While in-house and focus testing can help answer those questions during development, nothing provides a better answer than actual data from real world players.

o_breakaway_1

We’re currently developing Amazon Game Studios’ new title Breakaway; an online 4v4 team battle sport that delivers fast action, teamwork, and competition. We’re releasing Breakaway in a live alpha state on December 15, 2016, so that we can iterate and improve the game by analyzing massive amounts of gameplay data from our player community.

(more…)

Amazon Redshift Engineering’s Advanced Table Design Playbook: Table Data Durability

by Zach Christopherson | on | | Comments

Part 1: Preamble, Prerequisites, and Prioritization
Part 2: Distribution Styles and Distribution Keys
Part 3: Compound and Interleaved Sort Keys
Part 4: Compression Encodings
Part 5: Table Data Durability (Translated into Japanese)


In the fifth and final installment of the Advanced Table Design Playbook, I’ll discuss how to use two simple table durability properties to squeeze even more performance out of your Amazon Redshift configuration.

Limiting Automated Backups

With automated backups, Amazon Redshift will frequently push incremental block changes to Amazon S3. These automated backups are triggered based on a threshold of time or blocks changed. This approach makes possible recovery features such as snapshot and table level restore operations.

The actual cost of the automated backup is generally insignificant. However, because it transfers data blocks to S3 it can potentially affect your clusters’ throughput to S3 for workloads that constantly perform large COPY or UNLOAD operations. To reduce the cost of these automated backups, we can use the table property BACKUP NO for tables that don’t contain data to store durably in S3. Some tables where you can use this property are these:

  1. Tables that are populated with easily recreated data
  2. Tables containing data that is transient or often changing in nature
  3. Tables that are populated with unimportant or experimental data

We use this property to achieve two things:

  1. Preventing inconsequential block changes from triggering frequent automated backups
  2. Reducing quantity of blocks transferred to S3 during backups, to reduce backup duration

Note: Tables defined as temporary inherit this property by default, because these tables are cleaned up at the end of the session in which they were created.

Reducing Synchronous Data Replication

By default, multinode Amazon Redshift configurations automatically store redundant copies of table data on other nodes in the same cluster, so we can tolerate multiple disk and node failures. As with all synchronous data replication, some overhead is associated with the process. For temporary tables, this replication process doesn’t occur because the system makes the assumption that the data is truly transient. Additionally, blocks associated with temporary tables don’t count toward the threshold which triggers automated backups. Thus, by using temporary tables for truly transient data we can also avoid triggering backup processes and improve the rate at which the automated backup processes complete—just as we did with the BACKUP NO option.

(more…)

Amazon Redshift Engineering’s Advanced Table Design Playbook: Compression Encodings

by Zach Christopherson | on | | Comments

Part 1: Preamble, Prerequisites, and Prioritization
Part 2: Distribution Styles and Distribution Keys
Part 3: Compound and Interleaved Sort Keys
Part 4: Compression Encodings (Translated into Japanese)
Part 5: Table Data Durability


In part 4 of this blog series, I’ll be discussing when and when not to apply column encoding for compression, methods for determining ideal column encodings, and automatic compression behaviors in Amazon Redshift.

Data compression in database systems isn’t new. However, historically it was used to reduce data footprint, rather than boost performance, because of its expensive decompression overhead. In Amazon Redshift, using column encodings translates to both a data footprint reduction and a performance gain, because the cost of decompression is negligible relative to the reduction of disk I/O.

Within a Amazon Redshift table, each column can be specified with an encoding that is used to compress the values within each block. In general, compression should be used for almost every column within an Amazon Redshift cluster – but there are a few scenarios where it is better to avoid encoding columns.

The method I use for prioritizing compression related optimizations is a bit different than the prioritization method we used in the preceding parts, largely because you might already have implemented automatic or manual compression for many of your important tables. The approach we use to determine which tables to review for compression is based on measuring the size of uncompressed storage per table:

SELECT 
  ti.schema||'.'||ti."table" tablename,
  uncompressed_size.size uncompressed_size,
  ti.size total_size
FROM svv_table_info ti
LEFT JOIN (
  SELECT tbl,COUNT(*) size 
  FROM stv_blocklist 
  WHERE (tbl,col) IN (
    SELECT attrelid, attnum-1 
    FROM pg_attribute 
    WHERE attencodingtype=0 AND attnum>0) 
  GROUP BY 1) uncompressed_size ON ti.table_id = uncompressed_size.tbl
ORDER BY 2 DESC NULLS LAST;

Using the results of the query preceding to prioritize efforts, we can work through this simple flowchart to determine whether to compress or not:

o_redshift_tables_4_1

(more…)