AWS Big Data Blog
You have come up with an exciting hypothesis, and now you are keen to find and analyze as much data as possible to prove (or refute) it. There are many datasets that might be applicable, but they have been created at different times by different people and don’t conform to any common standard. They use different names for variables that mean the same thing and the same names for variables that mean different things. They use different units of measurement and different categories. Some have more variables than others. And they all have data quality issues (for example, badly formed dates and times, invalid geographic coordinates, and so on).
You first need a way to harmonize these datasets, to identify the variables that mean the same thing and make sure that these variables have the same names and units. You also need to clean up or remove records with invalid data.
After the datasets are harmonized, you need to search through the data to find the datasets you’re interested in. Not all of them have records that are relevant to your hypothesis, so you want to filter on a number of important variables to narrow down the datasets and verify they contain enough matching records to be significant.
Having identified the datasets of interest, you are ready to run your custom analyses on the data they contain so that you can prove your hypothesis and create beautiful visualizations to share with the world!
In this blog post, we will describe a sample application that illustrates how to solve these problems. You can install our sample app, which will:
- Harmonize and index three disparate datasets to make them searchable.
- Present a data-driven, customizable UI for searching the datasets to do preliminary analysis and to locate relevant datasets.
- Integrate with Amazon Athena and Amazon QuickSight for custom analysis and visualization.
The Police Data Initiative seeks to improve community and law enforcement relations through the public availability of data related to police activity. Datasets from participating cities, available through the Public Safety Open Data Portal, have many of the problems just outlined. Despite the commonality of crime and location metadata, there is no standard naming or value scheme. Datasets are stored in various locations and in various formats. There is no central search and discovery engine. To gain insights and value from this data, you have to analyze datasets city by city.
Although the focus of this post is police incident data, the same approach can be used for datasets in other domains, such as IoT, personalized medicine, news, weather, finance, and much more.
Our architecture uses the following AWS services:
- Amazon EMR (with Apache Spark and Jupyter notebooks) to explore, clean, harmonize (transform), describe, and save multiple, loosely coupled datasets.
- Amazon S3 to store both raw and harmonized datasets.
- Amazon Elasticsearch Service (Amazon ES) to host secure, searchable indexes of selected dataset variables and the associated dictionary/metadata used to power the search web page.
- Amazon EC2 Container Service (ECS) to host a web-based search UI.
- Amazon Athena and Amazon QuickSight to provide analysis and reporting.
- AWS CodeBuild and AWS CodePipeline to build and deliver the search UI application on ECS.
- AWS Identity and Access Management (IAM) policies and instance roles allow least-privilege access to Amazon ES from the UI containers and from the EMR cluster.
- AWS CloudFormation to orchestrate the provisioning of the environment.
The diagram below illustrates the solution architecture:
Jose Kunnackal is a Senior Product Manager for Amazon Quicksight
This blog post has been translated into Japanese.
In November 2016, we launched Amazon QuickSight, a cloud-powered, business analytics service that lets you quickly and easily visualize your data. QuickSight uses SPICE (Super-fast, Parallel, In-Memory Calculation Engine), a fully managed data store that enables blazing fast visualizations and can ingest data from AWS, on-premises, and cloud sources. Data in SPICE could be refreshed at any time with the click of a button within QuickSight.
Today, we are launching the ability to schedule these refreshes!
To schedule a SPICE data set refresh
Select a SPICE data set and choose Schedule refresh. Then, specify the time zone, frequency, and start time of the refresh.
Most systems use Java Script Object Notation (JSON) to log event information. Although it’s efficient and flexible, deriving information from JSON is difficult.
In this post, you will use the tightly coupled integration of Amazon Kinesis Firehose for log delivery, Amazon S3 for log storage, and Amazon Athena with JSONSerDe to run SQL queries against these logs without the need for data transformation or insertion into a database. It’s done in a completely serverless way. There’s no need to provision any compute.
Amazon SES provides highly detailed logs for every message that travels through the service and, with SES event publishing, makes them available through Firehose. However, parsing detailed logs for trends or compliance data would require a significant investment in infrastructure and development time. Athena is a boon to these data seekers because it can query this dataset at rest, in its native format, with zero code or architecture. On top of that, it uses largely native SQL queries and syntax.
Walkthrough: Establishing a dataset
We start with a dataset of an SES send event that looks like this:
This dataset contains a lot of valuable information about this SES interaction. There are thousands of datasets in the same format to parse for insights. Getting this data is straightforward.
1. Create a configuration set in the SES console or CLI that uses a Firehose delivery stream to send and store logs in S3 in near real-time.
The AWS Big Data team is coming to HIMSS, the industry-leading conference for professionals in the field of healthcare technology. The conference brings together more than 40,000 health IT professionals, clinicians, administrators, and vendors to talk about the latest innovations in health technology. Because transitioning healthcare to the cloud is at the forefront of this year’s conversations, for the first time, HIMSS is hosting a conference pre-day on February 19 that is focused on the use of cloud in healthcare.
This year’s conference will be held at the Orange County Convention Center in Orlando, Florida from February 20 – 23. You can visit us at booth 6969 to learn about how AWS healthcare customers like Cambia and Cleveland Clinic are leveraging cloud-based analytics to support healthcare’s digital transformation. The booth will be staffed by AWS certified solution architects who can answer questions about transitioning existing health applications into the cloud or creating new big data solutions to meet the evolving needs of healthcare.
If you’re interested in understanding how your health data skills fit in at AWS, there will be recruiters and hiring manages onsite to discuss AWS career opportunities. Just send e-mail to email@example.com to set up an informal chat.
Thousands of healthcare customers are using AWS to change the way they deliver care, engage with patients, or incorporate new technology into their organization by using HIPAA-eligible big data services such as:
- Amazon EMR, a managed Hadoop framework.
- Amazon DynamoDB, a fast and flexible NoSQL database service.
- Amazon Aurora [MySQL-compatible edition only], a relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.
- Amazon Redshift, a fast, simple, cost-effective data warehouse.
- Amazon S3, a durable, massively scalable object store.
Check out some past AWS Big Data Blog posts to see how these technologies are being used to improve healthcare:
- Readmission Prediction Through Patient Risk Stratification Using Amazon Machine Learning
- Introduction to Python UDFs in Amazon Redshift (CMS Open Payment Data)
- Create a Healthcare Data Hub with AWS and Mirth Connect
- How Eliza Corporation Moved Healthcare Data to the Cloud
- Will Spark Power the Data behind Precision Medicine?
- Interactive Analysis of Genomic Datasets Using Amazon Athena
For more information about how healthcare customers are using AWS, visit aws.amazon.com/health.
For customers who use Hive external tables on Amazon EMR, or any flavor of Hadoop, a key challenge is how to effectively migrate an existing Hive metastore to Amazon Athena, an interactive query service that directly analyzes data stored in Amazon S3. With Athena, there are no clusters to manage and tune, and no infrastructure to set up or manage. Customers pay only for the queries they run.
In this post, I discuss an approach to migrate an existing Hive metastore to Athena, as well as how to use the Athena JDBC driver to run scripts. I demonstrate two scripts.
- The first script exports external tables from a Hive metastore on EMR, or other Hadoop flavors, as a Hive script. This script handles both Hive metastores local to the cluster or metastores stored in an external database.
- The second script executes the Hive script in Athena over JDBC to import the external tables into the Athena catalog.
Both scripts are available in the aws-blog-athena-importing-hive-metastores GitHub repo.
You must have the following resources available:
- A working Python 2.7+ environment. (required for the first script)
- A working Java 1.8 runtime environment
- Groovy, if not already installed
- The Java classpath set to point to the Athena JDBC driver JAR file location
In EMR, you can use the following commands to complete the prerequisites (Python comes already installed):
Exporting external tables from a Hive metastore
The Python script exportdatabase.py exports external tables only from the Hive metastore, and saves them to a local file as a Hive script.
Here’s the sample output:
Athena does not support every data type and SerDe supported by Hive. Edit or replace contents in the generated Hive script as needed to ensure compatibility. For more information about supported datatypes and SerDes, see the Amazon Athena documentation.
Applications log a large amount of data that—when analyzed in real time—provides significant insight into your applications. Real-time log analysis can be used to ensure security compliance, troubleshoot operation events, identify application usage patterns, and much more.
Ingesting and analyzing this data in real time can be accomplished by using a variety of open source tools on Amazon EC2. Alternatively, you can use a set of simple, managed AWS services to perform serverless log analytics. The Amazon Kinesis platform includes the following managed services:
- Amazon Kinesis Streams streams data on AWS, which allows you to collect, store, and process TBs per hour at a low cost.
- Amazon Kinesis Firehose loads streaming data in to Amazon Kinesis Analytics, Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service.
- Amazon Kinesis Analytics helps you analyze streaming data by writing SQL queries and in turn overcoming the management and monitoring of streaming logs in near real time. Analytics allows you to reference metadata stored in S3 in SQL queries for real-time analytics.
In this post, I show you how to implement a solution that analyzes streaming Apache access log data from an EC2 instance aggregated over 5 minutes. The solution helps you understand where requests to your applications are coming from. If the source is an unknown application or if a particular source application is trying to clog your application, you can contact the application owner.
Some challenges that this solution entails:
- You do not want to maintain (patch/upgrade) the log application or servers to do log analytics. You also want your log analytics to scale on demand by default, and so all components are managed services.
- Apache Logs logs the host IP address or host name. However, that information isn’t useful in the cloud where servers are fungible and hosts change constantly either to scale or heal automatically. So you maintain a flat file list of servers in an S3 bucket that can be updated by Auto Scaling policies and mapped to streaming log data.
The following diagram shows how this solution works.
- Application nodes run Apache applications and write Apache logs locally to disk. The Amazon Kinesis agent on the EC2 instance ingests the log stream in to the Amazon Kinesis stream.
- The log input stream from various application nodes is ingested in to the Amazon Kinesis stream.
- Machine metadata about the machine or application is stored in flat files in an S3 bucket. It is a mapping of host IP addresses with the application name and contact.
- The Analytics application processes streaming logs over tumbling windows by adding referenced machine metadata from S3.
- The output stream, which is the result of the aggregated responses from the Analytics application, is written into the Amazon Kinesis stream.
- The Lambda function consumes the aggregated response from the destination stream, processes it, and publishes it to Amazon CloudWatch. It is event driven: as soon as new records are pushed to the destination stream, they are processed in batches of 200 records.
- The CloudWatch dashboard is used to view response trends.
- Alarms on aggregated data are generated when specified thresholds are reached.
Another month of big data solutions on the Big Data Blog!
Take a look at our summaries below and learn, comment, and share. Thank you for reading!
Decreasing Game Churn: How Upopa used ironSource Atom and Amazon ML to Engage Users
Ever wondered what it takes to keep a user from leaving your game or application after all the hard work you put in? Wouldn’t it be great to get a chance to interact with the users before they’re about to leave? In this post, learn how ironSource worked with gaming studio Upopa to build an efficient, cheap, and accurate way to battle churn and make data-driven decisions using ironSource Atom’s data pipeline and Amazon ML.
Create a Healthcare Data Hub with AWS and Mirth Connect
Healthcare providers record patient information across different software platforms. Each of these platforms can have varying implementations of complex healthcare data standards. Also, each system needs to communicate with a central repository called a health information exchange (HIE) to build a central, complete clinical record for each patient. In this post, learn how to consume different data types as messages, transform the information within the messages, and then use AWS services to take action depending on the message type.
Call for Papers! DEEM: 1st Workshop on Data Management for End-to-End Machine Learning
Amazon and Matroid will hold the first workshop on Data Management for End-to-End Machine Learning (DEEM) on May 14th, 2017 in conjunction with the premier systems conference SIGMOD/PODS 2017 in Raleigh, North Carolina. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management, and systems research to discuss data management issues in ML application scenarios. The workshop is soliciting research papers that describe preliminary and ongoing research results.
Converging Data Silos to Amazon Redshift Using AWS DMS
In this post, learn to use AWS Database Migration Service (AWS DMS) and other AWS services to easily converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.
Run Mixed Workloads with Amazon Redshift Workload Management
It’s common for mixed workloads to have some processes that require higher priority than others. Sometimes, this means a certain job must complete within a given SLA. Other times, this means you only want to prevent a non-critical reporting workload from consuming too many cluster resources at any one time. Without workload management (WLM), each query is prioritized equally, which can cause a person, team, or workload to consume excessive cluster resources for a process which isn’t as valuable as other more business-critical jobs. This post provides guidelines on common WLM patterns and shows how you can use WLM query insights to optimize configuration in production workloads.
Secure Amazon EMR with Encryption
In this post, learn how to set up encryption of data at multiple levels using security configurations with EMR. You’ll walk through the step-by-step process to achieve all the encryption prerequisites, such as building the KMS keys, building SSL certificates, and launching the EMR cluster with a strong security configuration.
In the last few years, there has been a rapid rise in enterprises adopting the Apache Hadoop ecosystem for critical workloads that process sensitive or highly confidential data. Due to the highly critical nature of the workloads, the enterprises implement certain organization/industry wide policies and certain regulatory or compliance policies. Such policy requirements are designed to protect sensitive data from unauthorized access.
A common requirement within such policies is about encrypting data at-rest and in-flight. Amazon EMR uses “security configurations” to make it easy to specify the encryption keys and certificates, ranging from AWS Key Management Service to supplying your own custom encryption materials provider.
You create a security configuration that specifies encryption settings and then use the configuration when you create a cluster. This makes it easy to build the security configuration one time and use it for any number of clusters.
In this post, I go through the process of setting up the encryption of data at multiple levels using security configurations with EMR. Before I dive deep into encryption, here are the different phases where data needs to be encrypted.
Mixed workloads run batch and interactive workloads (short-running and long-running queries or reports) concurrently to support business needs or demand. Typically, managing and configuring mixed workloads requires a thorough understanding of access patterns, how the system resources are being used and performance requirements.
It’s common for mixed workloads to have some processes that require higher priority than others. Sometimes, this means a certain job must complete within a given SLA. Other times, this means you only want to prevent a non-critical reporting workload from consuming too many cluster resources at any one time.
Without workload management (WLM), each query is prioritized equally, which can cause a person, team, or workload to consume excessive cluster resources for a process which isn’t as valuable as other more business-critical jobs.
This post provides guidelines on common WLM patterns and shows how you can use WLM query insights to optimize configuration in production workloads.
You can use WLM to define the separation of business concerns and to prioritize the different types of concurrently running queries in the system:
- Interactive: Software that accepts input from humans as it runs. Interactive software includes most popular programs, such as BI tools or reporting applications.
- Short-running, read-only user queries such as Tableau dashboard query with low latency requirements.
- Long-running, read-only user queries such as a complex structured report that aggregates the last 10 years of sales data.
- Batch: Execution of a job series in a server program without manual intervention (non-interactive). The execution of a series of programs, on a set or “batch” of inputs, rather than a single input, would instead be a custom job.
- Batch queries includes bulk INSERT, UPDATE, and DELETE transactions, for example, ETL or ELT programs.
Amazon Redshift Workload Management
Amazon Redshift is a fully managed, petabyte scale, columnar, massively parallel data warehouse that offers scalability, security and high performance. Amazon Redshift provides an industry standard JDBC/ODBC driver interface, which allows customers to connect their existing business intelligence tools and re-use existing analytics queries.
Amazon Redshift is a good fit for any type of analytical data model, for example, star and snowflake schemas, or simple de-normalized tables.
Amazon Redshift Workload Management allows you to manage workloads of various sizes and complexity for specific environments. Parameter groups contain WLM configuration, which determines how many query queues are available for processing and how queries are routed to those queues. The default parameter group settings are not configurable. Create a custom parameter group to modify the settings in that group, and then associate it with your cluster. The following settings can be configured:
- How many queries can run concurrently in each queue
- How much memory is allocated among the queues
- How queries are routed to queues, based on criteria such as the user who is running the query or a query label
- Query timeout settings for a queue
When the user runs a query, WLM assigns the query to the first matching queue and executes rules based on the WLM configuration. For more information about WLM query queues, concurrency, user groups, query groups, timeout configuration, and queue hopping capability, see Defining Query Queues. For more information about the configuration properties that can be changed dynamically, see WLM Dynamic and Static Configuration Properties.
For example, the WLM configuration in the following screenshot has three queues to support ETL, BI, and other users. ETL jobs are assigned to the long-running queue and BI queries to the short-running queue. Other user queries are executed in the default queue.
Organizations often grow organically—and so does their data in individual silos. Such systems are often powered by traditional RDBMS systems and they grow orthogonally in size and features. To gain intelligence across heterogeneous data sources, you have to join the data sets. However, this imposes new challenges, as joining data over dblinks or into a single view is extremely cumbersome and an operational nightmare.
This post walks through using AWS Database Migration Service (AWS DMS) and other AWS services to make it easy to converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.
AWS service overview
Here’s a brief overview of AWS services that help with data convergence.
With DMS, you can migrate your data to and from most widely used commercial and open-source databases. The service supports homogenous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL. It also allows you to stream data to Amazon Redshift from any of the supported sources including:
- Amazon Aurora
- SAP ASE
- SQL Server
DMS enables consolidation and easy analysis of data in the petabyte-scale data warehouse. It can also be used for continuous data replication with high availability.
Amazon QuickSight provides very fast, easy-to-use, cloud-powered business intelligence at 1/10th the cost of traditional BI solutions. QuickSight uses a new, super-fast, parallel, in-memory calculation engine (“SPICE”) to perform advanced calculations and render visualizations rapidly.
QuickSight integrates automatically with AWS data services, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance to them. You can easily connect QuickSight to AWS data services, including Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon S3, and Amazon Athena. You can also upload CSV, TSV, and spreadsheet files or connect to third-party data sources such as Salesforce.
Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes. Amazon Redshift is typically priced at 1/10th of the price of the competition. We have many customers running petabyte scale data analytics on AWS using Amazon Redshift.
Amazon Redshift is also ANSI SQL compliant, supports JDBC/ODBC, and is easy to connect to your existing business intelligence (BI) solution. However, if your storage requirement is in the 10s of TB range and requires high levels of concurrency across small queries, you may want to consider Amazon Aurora as the target converged database.
Assume that you have an events company specializing on sports, and have built a MySQL database that holds data for the players and the sporting events. Customers and ticket information is stored in another database; in this case, assume it is PostgresSQL and this gets updated when customer purchases tickets from our website and mobile apps. You can download a sample dataset from the aws-database-migration-samples GitHub repo.
These databases could be anywhere: at an on-premises facility; on AWS in Amazon EC2 or Amazon RDS, or other cloud provider; or in a mixture of such locations. To complicate things a little more, you can assume that the lost opportunities (where a customer didn’t complete buying the ticket even though it was added to the shopping cart) are streamed via clickstream through Amazon Kinesis and then stored on Amazon S3. We then use AWS Data Pipeline to orchestrate a process to cleanse that data using Amazon EMR and make it ready for loading to Amazon Redshift. The clickstream integration is not covered in this post but was demonstrated in the recent Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics post.
In this solution, you use DMS to bring the two data sources into Amazon Redshift and run analytics to gain business insights. The following diagram demonstrates the proposed solution.