AWS Big Data Blog
Most systems use Java Script Object Notation (JSON) to log event information. Although it’s efficient and flexible, deriving information from JSON is difficult.
In this post, you will use the tightly coupled integration of Amazon Kinesis Firehose for log delivery, Amazon S3 for log storage, and Amazon Athena with JSONSerDe to run SQL queries against these logs without the need for data transformation or insertion into a database. It’s done in a completely serverless way. There’s no need to provision any compute.
Amazon SES provides highly detailed logs for every message that travels through the service and, with SES event publishing, makes them available through Firehose. However, parsing detailed logs for trends or compliance data would require a significant investment in infrastructure and development time. Athena is a boon to these data seekers because it can query this dataset at rest, in its native format, with zero code or architecture. On top of that, it uses largely native SQL queries and syntax.
Walkthrough: Establishing a dataset
We start with a dataset of an SES send event that looks like this:
This dataset contains a lot of valuable information about this SES interaction. There are thousands of datasets in the same format to parse for insights. Getting this data is straightforward.
1. Create a configuration set in the SES console or CLI that uses a Firehose delivery stream to send and store logs in S3 in near real-time.
The AWS Big Data team is coming to HIMSS, the industry-leading conference for professionals in the field of healthcare technology. The conference brings together more than 40,000 health IT professionals, clinicians, administrators, and vendors to talk about the latest innovations in health technology. Because transitioning healthcare to the cloud is at the forefront of this year’s conversations, for the first time, HIMSS is hosting a conference pre-day on February 19 that is focused on the use of cloud in healthcare.
This year’s conference will be held at the Orange County Convention Center in Orlando, Florida from February 20 – 23. You can visit us at booth 6969 to learn about how AWS healthcare customers like Cambia and Cleveland Clinic are leveraging cloud-based analytics to support healthcare’s digital transformation. The booth will be staffed by AWS certified solution architects who can answer questions about transitioning existing health applications into the cloud or creating new big data solutions to meet the evolving needs of healthcare.
If you’re interested in understanding how your health data skills fit in at AWS, there will be recruiters and hiring manages onsite to discuss AWS career opportunities. Just send e-mail to email@example.com to set up an informal chat.
Thousands of healthcare customers are using AWS to change the way they deliver care, engage with patients, or incorporate new technology into their organization by using HIPAA-eligible big data services such as:
- Amazon EMR, a managed Hadoop framework.
- Amazon DynamoDB, a fast and flexible NoSQL database service.
- Amazon Aurora [MySQL-compatible edition only], a relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.
- Amazon Redshift, a fast, simple, cost-effective data warehouse.
- Amazon S3, a durable, massively scalable object store.
Check out some past AWS Big Data Blog posts to see how these technologies are being used to improve healthcare:
- Readmission Prediction Through Patient Risk Stratification Using Amazon Machine Learning
- Introduction to Python UDFs in Amazon Redshift (CMS Open Payment Data)
- Create a Healthcare Data Hub with AWS and Mirth Connect
- How Eliza Corporation Moved Healthcare Data to the Cloud
- Will Spark Power the Data behind Precision Medicine?
- Interactive Analysis of Genomic Datasets Using Amazon Athena
For more information about how healthcare customers are using AWS, visit aws.amazon.com/health.
For customers who use Hive external tables on Amazon EMR, or any flavor of Hadoop, a key challenge is how to effectively migrate an existing Hive metastore to Amazon Athena, an interactive query service that directly analyzes data stored in Amazon S3. With Athena, there are no clusters to manage and tune, and no infrastructure to set up or manage. Customers pay only for the queries they run.
In this post, I discuss an approach to migrate an existing Hive metastore to Athena, as well as how to use the Athena JDBC driver to run scripts. I demonstrate two scripts.
- The first script exports external tables from a Hive metastore on EMR, or other Hadoop flavors, as a Hive script. This script handles both Hive metastores local to the cluster or metastores stored in an external database.
- The second script executes the Hive script in Athena over JDBC to import the external tables into the Athena catalog.
Both scripts are available in the aws-blog-athena-importing-hive-metastores GitHub repo.
You must have the following resources available:
- A working Python 2.7+ environment. (required for the first script)
- A working Java 1.8 runtime environment
- Groovy, if not already installed
- The Java classpath set to point to the Athena JDBC driver JAR file location
In EMR, you can use the following commands to complete the prerequisites (Python comes already installed):
Exporting external tables from a Hive metastore
The Python script exportdatabase.py exports external tables only from the Hive metastore, and saves them to a local file as a Hive script.
Here’s the sample output:
Athena does not support every data type and SerDe supported by Hive. Edit or replace contents in the generated Hive script as needed to ensure compatibility. For more information about supported datatypes and SerDes, see the Amazon Athena documentation.
Applications log a large amount of data that—when analyzed in real time—provides significant insight into your applications. Real-time log analysis can be used to ensure security compliance, troubleshoot operation events, identify application usage patterns, and much more.
Ingesting and analyzing this data in real time can be accomplished by using a variety of open source tools on Amazon EC2. Alternatively, you can use a set of simple, managed AWS services to perform serverless log analytics. The Amazon Kinesis platform includes the following managed services:
- Amazon Kinesis Streams streams data on AWS, which allows you to collect, store, and process TBs per hour at a low cost.
- Amazon Kinesis Firehose loads streaming data in to Amazon Kinesis Analytics, Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service.
- Amazon Kinesis Analytics helps you analyze streaming data by writing SQL queries and in turn overcoming the management and monitoring of streaming logs in near real time. Analytics allows you to reference metadata stored in S3 in SQL queries for real-time analytics.
In this post, I show you how to implement a solution that analyzes streaming Apache access log data from an EC2 instance aggregated over 5 minutes. The solution helps you understand where requests to your applications are coming from. If the source is an unknown application or if a particular source application is trying to clog your application, you can contact the application owner.
Some challenges that this solution entails:
- You do not want to maintain (patch/upgrade) the log application or servers to do log analytics. You also want your log analytics to scale on demand by default, and so all components are managed services.
- Apache Logs logs the host IP address or host name. However, that information isn’t useful in the cloud where servers are fungible and hosts change constantly either to scale or heal automatically. So you maintain a flat file list of servers in an S3 bucket that can be updated by Auto Scaling policies and mapped to streaming log data.
The following diagram shows how this solution works.
- Application nodes run Apache applications and write Apache logs locally to disk. The Amazon Kinesis agent on the EC2 instance ingests the log stream in to the Amazon Kinesis stream.
- The log input stream from various application nodes is ingested in to the Amazon Kinesis stream.
- Machine metadata about the machine or application is stored in flat files in an S3 bucket. It is a mapping of host IP addresses with the application name and contact.
- The Analytics application processes streaming logs over tumbling windows by adding referenced machine metadata from S3.
- The output stream, which is the result of the aggregated responses from the Analytics application, is written into the Amazon Kinesis stream.
- The Lambda function consumes the aggregated response from the destination stream, processes it, and publishes it to Amazon CloudWatch. It is event driven: as soon as new records are pushed to the destination stream, they are processed in batches of 200 records.
- The CloudWatch dashboard is used to view response trends.
- Alarms on aggregated data are generated when specified thresholds are reached.
Another month of big data solutions on the Big Data Blog!
Take a look at our summaries below and learn, comment, and share. Thank you for reading!
Decreasing Game Churn: How Upopa used ironSource Atom and Amazon ML to Engage Users
Ever wondered what it takes to keep a user from leaving your game or application after all the hard work you put in? Wouldn’t it be great to get a chance to interact with the users before they’re about to leave? In this post, learn how ironSource worked with gaming studio Upopa to build an efficient, cheap, and accurate way to battle churn and make data-driven decisions using ironSource Atom’s data pipeline and Amazon ML.
Create a Healthcare Data Hub with AWS and Mirth Connect
Healthcare providers record patient information across different software platforms. Each of these platforms can have varying implementations of complex healthcare data standards. Also, each system needs to communicate with a central repository called a health information exchange (HIE) to build a central, complete clinical record for each patient. In this post, learn how to consume different data types as messages, transform the information within the messages, and then use AWS services to take action depending on the message type.
Call for Papers! DEEM: 1st Workshop on Data Management for End-to-End Machine Learning
Amazon and Matroid will hold the first workshop on Data Management for End-to-End Machine Learning (DEEM) on May 14th, 2017 in conjunction with the premier systems conference SIGMOD/PODS 2017 in Raleigh, North Carolina. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management, and systems research to discuss data management issues in ML application scenarios. The workshop is soliciting research papers that describe preliminary and ongoing research results.
Converging Data Silos to Amazon Redshift Using AWS DMS
In this post, learn to use AWS Database Migration Service (AWS DMS) and other AWS services to easily converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.
Run Mixed Workloads with Amazon Redshift Workload Management
It’s common for mixed workloads to have some processes that require higher priority than others. Sometimes, this means a certain job must complete within a given SLA. Other times, this means you only want to prevent a non-critical reporting workload from consuming too many cluster resources at any one time. Without workload management (WLM), each query is prioritized equally, which can cause a person, team, or workload to consume excessive cluster resources for a process which isn’t as valuable as other more business-critical jobs. This post provides guidelines on common WLM patterns and shows how you can use WLM query insights to optimize configuration in production workloads.
Secure Amazon EMR with Encryption
In this post, learn how to set up encryption of data at multiple levels using security configurations with EMR. You’ll walk through the step-by-step process to achieve all the encryption prerequisites, such as building the KMS keys, building SSL certificates, and launching the EMR cluster with a strong security configuration.
In the last few years, there has been a rapid rise in enterprises adopting the Apache Hadoop ecosystem for critical workloads that process sensitive or highly confidential data. Due to the highly critical nature of the workloads, the enterprises implement certain organization/industry wide policies and certain regulatory or compliance policies. Such policy requirements are designed to protect sensitive data from unauthorized access.
A common requirement within such policies is about encrypting data at-rest and in-flight. Amazon EMR uses “security configurations” to make it easy to specify the encryption keys and certificates, ranging from AWS Key Management Service to supplying your own custom encryption materials provider.
You create a security configuration that specifies encryption settings and then use the configuration when you create a cluster. This makes it easy to build the security configuration one time and use it for any number of clusters.
In this post, I go through the process of setting up the encryption of data at multiple levels using security configurations with EMR. Before I dive deep into encryption, here are the different phases where data needs to be encrypted.
Mixed workloads run batch and interactive workloads (short-running and long-running queries or reports) concurrently to support business needs or demand. Typically, managing and configuring mixed workloads requires a thorough understanding of access patterns, how the system resources are being used and performance requirements.
It’s common for mixed workloads to have some processes that require higher priority than others. Sometimes, this means a certain job must complete within a given SLA. Other times, this means you only want to prevent a non-critical reporting workload from consuming too many cluster resources at any one time.
Without workload management (WLM), each query is prioritized equally, which can cause a person, team, or workload to consume excessive cluster resources for a process which isn’t as valuable as other more business-critical jobs.
This post provides guidelines on common WLM patterns and shows how you can use WLM query insights to optimize configuration in production workloads.
You can use WLM to define the separation of business concerns and to prioritize the different types of concurrently running queries in the system:
- Interactive: Software that accepts input from humans as it runs. Interactive software includes most popular programs, such as BI tools or reporting applications.
- Short-running, read-only user queries such as Tableau dashboard query with low latency requirements.
- Long-running, read-only user queries such as a complex structured report that aggregates the last 10 years of sales data.
- Batch: Execution of a job series in a server program without manual intervention (non-interactive). The execution of a series of programs, on a set or “batch” of inputs, rather than a single input, would instead be a custom job.
- Batch queries includes bulk INSERT, UPDATE, and DELETE transactions, for example, ETL or ELT programs.
Amazon Redshift Workload Management
Amazon Redshift is a fully managed, petabyte scale, columnar, massively parallel data warehouse that offers scalability, security and high performance. Amazon Redshift provides an industry standard JDBC/ODBC driver interface, which allows customers to connect their existing business intelligence tools and re-use existing analytics queries.
Amazon Redshift is a good fit for any type of analytical data model, for example, star and snowflake schemas, or simple de-normalized tables.
Amazon Redshift Workload Management allows you to manage workloads of various sizes and complexity for specific environments. Parameter groups contain WLM configuration, which determines how many query queues are available for processing and how queries are routed to those queues. The default parameter group settings are not configurable. Create a custom parameter group to modify the settings in that group, and then associate it with your cluster. The following settings can be configured:
- How many queries can run concurrently in each queue
- How much memory is allocated among the queues
- How queries are routed to queues, based on criteria such as the user who is running the query or a query label
- Query timeout settings for a queue
When the user runs a query, WLM assigns the query to the first matching queue and executes rules based on the WLM configuration. For more information about WLM query queues, concurrency, user groups, query groups, timeout configuration, and queue hopping capability, see Defining Query Queues. For more information about the configuration properties that can be changed dynamically, see WLM Dynamic and Static Configuration Properties.
For example, the WLM configuration in the following screenshot has three queues to support ETL, BI, and other users. ETL jobs are assigned to the long-running queue and BI queries to the short-running queue. Other user queries are executed in the default queue.
Organizations often grow organically—and so does their data in individual silos. Such systems are often powered by traditional RDBMS systems and they grow orthogonally in size and features. To gain intelligence across heterogeneous data sources, you have to join the data sets. However, this imposes new challenges, as joining data over dblinks or into a single view is extremely cumbersome and an operational nightmare.
This post walks through using AWS Database Migration Service (AWS DMS) and other AWS services to make it easy to converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.
AWS service overview
Here’s a brief overview of AWS services that help with data convergence.
With DMS, you can migrate your data to and from most widely used commercial and open-source databases. The service supports homogenous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL. It also allows you to stream data to Amazon Redshift from any of the supported sources including:
- Amazon Aurora
- SAP ASE
- SQL Server
DMS enables consolidation and easy analysis of data in the petabyte-scale data warehouse. It can also be used for continuous data replication with high availability.
Amazon QuickSight provides very fast, easy-to-use, cloud-powered business intelligence at 1/10th the cost of traditional BI solutions. QuickSight uses a new, super-fast, parallel, in-memory calculation engine (“SPICE”) to perform advanced calculations and render visualizations rapidly.
QuickSight integrates automatically with AWS data services, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance to them. You can easily connect QuickSight to AWS data services, including Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon S3, and Amazon Athena. You can also upload CSV, TSV, and spreadsheet files or connect to third-party data sources such as Salesforce.
Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes. Amazon Redshift is typically priced at 1/10th of the price of the competition. We have many customers running petabyte scale data analytics on AWS using Amazon Redshift.
Amazon Redshift is also ANSI SQL compliant, supports JDBC/ODBC, and is easy to connect to your existing business intelligence (BI) solution. However, if your storage requirement is in the 10s of TB range and requires high levels of concurrency across small queries, you may want to consider Amazon Aurora as the target converged database.
Assume that you have an events company specializing on sports, and have built a MySQL database that holds data for the players and the sporting events. Customers and ticket information is stored in another database; in this case, assume it is PostgresSQL and this gets updated when customer purchases tickets from our website and mobile apps. You can download a sample dataset from the aws-database-migration-samples GitHub repo.
These databases could be anywhere: at an on-premises facility; on AWS in Amazon EC2 or Amazon RDS, or other cloud provider; or in a mixture of such locations. To complicate things a little more, you can assume that the lost opportunities (where a customer didn’t complete buying the ticket even though it was added to the shopping cart) are streamed via clickstream through Amazon Kinesis and then stored on Amazon S3. We then use AWS Data Pipeline to orchestrate a process to cleanse that data using Amazon EMR and make it ready for loading to Amazon Redshift. The clickstream integration is not covered in this post but was demonstrated in the recent Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics post.
In this solution, you use DMS to bring the two data sources into Amazon Redshift and run analytics to gain business insights. The following diagram demonstrates the proposed solution.
Amazon and Matroid will hold the first workshop on Data Management for End-to-End Machine Learning (DEEM) on May 14th, 2017 in conjunction with the premier systems conference SIGMOD/PODS 2017 in Raleigh, North Carolina. For more details about the workshop focus, see Challenges and opportunities in machine learning below.
DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management, and systems research to discuss data management issues in ML application scenarios.
We’re soliciting research papers that describe preliminary and ongoing research results. We’re also looking for reports from industry describing end-to-end ML deployments. Submissions can either be short papers (4 pages) or long papers (up to 10 pages) following the ACM proceedings format.
Register and submit: https://cmt3.research.microsoft.com/DEEM2017/ (account needed)
Submission Deadline: February 1, 2017
Notification of Acceptance: March 1, 2017
Final papers due: March 20, 2017
Workshop: May 14th, 2017
Follow us on twitter @deem_workshop.
Challenges and opportunities in machine learning
Applying machine learning (ML) in real-world scenarios is challenging. In recent years, the database community has focused on creating systems and abstractions for efficiently training ML models on large datasets. But model training is only one of many steps in an end-to-end ML application. Many orthogonal data management problems arise from the large-scale use of ML. The data management community needs to focus on these problems.
For example, preprocessing data and extracting feature workloads causes complex pipelines that often require the simultaneous execution of relational and linear algebraic operations. Next, the class of the ML model to use needs to be chosen. For that, a set of popular approaches such as linear models, decision trees, and deep neural networks often must be analyzed, evaluated, and interpreted.
The prediction quality of such ML models depends on the choice of features and hyperparameters, which are typically selected in a costly offline evaluation process. Afterwards, the resulting models must be deployed and integrated into existing business workflows in a way that enables fast and efficient predictions while allowing for the lifecycle of models (that become stale over time) to be managed.
As a further complication, the resulting systems need to take the target audience of ML applications into account. This audience is heterogeneous, ranging from analysts without programming skills that possibly prefer an easy-to-use, cloud-based solution, to teams of data processing experts and statisticians that develop and deploy custom-tailored algorithms.
DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research to discuss data management issues in ML application scenarios. This workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments.
Questions? Please send them to firstname.lastname@example.org
As anyone visiting their doctor may have noticed, gone are the days of physicians recording their notes on paper. Physicians are more likely to enter the exam room with a laptop than with paper and pen. This change is the byproduct of efforts to improve patient outcomes, increase efficiency, and drive population health. Pushing for these improvements has created many new data opportunities as well as challenges. Using a combination of AWS services and open source software, we can use these new datasets to work towards these goals and beyond.
When you get a physical examination, your doctor’s office has an electronic chart with information about your demographics (name, date of birth, address, etc.), healthcare history, and current visit. When you go to the hospital for an emergency, a whole new record is created that may contain duplicate or conflicting information. A simple example would be that my primary care doctor lists me as Joe whereas the hospital lists me as Joseph.
Providers record patient information across different software platforms. Each of these platforms can have varying implementations of complex healthcare data standards. Also, each system needs to communicate with a central repository called a health information exchange (HIE) to build a central, complete clinical record for each patient.
In this post, I demonstrate the capability to consume different data types as messages, transform the information within the messages, and then use AWS service to take action depending on the message type.
Overview of Mirth Connect
Using open source technologies on AWS, you can build a system that transforms, stores, and processes this data as needed. The system can scale to meet the ever-increasing demands of modern medicine. The project that ingests and processes this data is called Mirth Connect.
Mirth Connect is an open source, cross-platform, bidirectional, healthcare integration engine. This project is a standalone server that functions as a central point for routing and processing healthcare information.
Running Mirth Connect on AWS provides the necessary scalability and elasticity to meet the current and future needs of healthcare organizations.
Healthcare data hub walkthrough
Healthcare information comes from various sources and can generate large amounts of data:
- Health information exchange (HIE)
- Electronic health records system (EHR)
- Practice management system (PMS)
- Insurance company systems
- Pharmacy systems
- Other source systems that can make data accessible
Messages typically require some form of modification (transformation) to accommodate ingestion and processing in other systems. Using another project, Blue Button, you can dissect large healthcare messages and locate the sections/items of interest. You can also convert those messages into other formats for storage and analysis.
The examples in this post focus on the following data types representing information made available from a typical healthcare organization:
- HL7 (Health Level Seven) version 2 messages
- CDA (Clinical Data Architecture)/CDD (Continuity of Care Document)
- DICOM (Digital Imaging and Communications in Medicine)
- CSV (comma-separated variable)
HL7 version 2 messages define both a message format and communication protocol for health information. They are broken into different message types depending on the information that they transmit.
There are many message types available, such as ordering labs, prescription dispensing, billing, and more. During a routine doctor visit, numerous messages are created for each patient. This provides a lot of information but also a challenge in storage and processing. For a full list of message types, see Data Definition Tables, section A6. The two types used for this post are:
- ADT A01 (patient admission and visit notification)
- SIU S12 (new appointment booking)
As you can see, this text is formatted as delimited data, where the delimiters are defined in the top line message called the MSG segment. Mirth Connect can parse these messages and communicate using the standard HL7 network protocol.