AWS Big Data Blog
Amazon Athena is an interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.
In this blog post, we will review the top 10 tips that can improve query performance. We will focus on aspects related to storing data in Amazon S3 and tuning specific to queries. Amazon Athena uses Presto to run SQL queries and hence some of the advice will work if you are running Presto on Amazon EMR.
This post assumes that you have knowledge of different file formats, such as Parquet, ORC, Text files, Avro, CSV, TSV, and JSON.
Best practices: storage
This section discusses how to structure your data so that you can get the most out of Athena. The same practices can be applied to Amazon EMR data processing applications such as Spark, Presto, and Hive when your data is stored on Amazon S3.
1. Partition your data
Partitioning divides your table into parts and keeps the related data together based on column values such as date, country, region, etc. Partitions act as virtual columns. You define them at table creation, and they can help reduce the amount of data scanned per query, thereby improving performance. You can restrict the amount of data scanned by a query by specifying filters based on the partition. For more details, see Partitioning Data.
Athena supports Hive partitioning, which follows one of the following naming convention:
a) Partition column name followed by an equal symbol (‘=’) and then the value.
If your dataset is partitioned in this format, then you can run the MSCK REPAIR table command to add partitions to your table automatically.
b) If the “path” of your data does not follow the above format, you can add the partitions manually using the ALTER TABLE ADD PARTITION command for each partition. For example
Alter Table <tablename> add Partition (PARTITION_COLUMN_NAME = <VALUE>, PARTITION_COLUMN2_NAME = <VALUE>) LOCATION ‘s3://yourBucket/pathToTable/YYYY/MM/DD/’;
Note: using the above methodology, you can map any location with what values you want to refer them by.
The following example shows how data is partitioned on the year column on the flight table stored in an S3 bucket.
You can restrict the partitions that are scanned in a query by using the column in the ‘WHERE’ clause.
SELECT dest, origin FROM flights WHERE year = 1991
You can also use multiple columns as partition keys. You can scan the data for specificvalues, and so on.
When deciding the columns on which to partition, consider the following:
- Columns that are used as filters are good candidates for partitioning.
- Partitioning has a cost. As the number of partitions in your table increases, the higher the overhead of retrieving and processing the partition metadata, and the smaller your files. Partitioning too finely can wipe out the initial benefit.
- If your data is heavily skewed to one partition value, and most queries use that value, then the overhead may wipe out the initial benefit.
The table below compares query run times between a partitioned and Non-partitioned table. Both tables contain 74GB data, uncompressed stored in Text format. The partitioned table is partitioned by the l_shipdate column and has 2526 partitions.
|Query||Non- Partitioned Table||Cost||Partitioned table||Cost||Savings|
|Run time||Data scanned||Run time||Data scanned|
|SELECT count(*) FROM lineitem WHERE l_shipdate = '1996-09-01'||9.71 seconds||74.1 GB||$0.36||2.16 seconds||29.06 MB||$0.0001||
|SELECT count(*) FROM lineitem WHERE l_shipdate >= '1996-09-01' AND l_shipdate < '1996-10-01'||10.41 seconds||74.1 GB||$0.36||2.73 seconds||871.39 MB||$0.004||98% cheaper
However, partitioning also has a penalty as shown in the following run times. Make sure that you don’t over-partition your data.
|Query||Non- Partitioned Table||Cost||Partitioned table||Cost||Savings|
|Run time||Data scanned||Run time||Data scanned|
|SELECT count(*) FROM lineitem;||8.4 seconds||74.1 GB||$0.36||10.65 seconds||74.1 GB||$0.36||27% slower|
2. Compress and split files
Compressing your data can speed up your queries significantly, as long as the files are either of an optimal size (see the next section) or the files are splittable. The smaller data sizes reduce network traffic from S3 to Athena.
Splittable files allows Athena’s execution engine to split the reading of a file by multiple readers to increase parallelism. If you have a single unsplittable file, then only a single reader is able to read the file while all other readers sit idle. Not all compression algorithms are splittable. The following table lists common compression formats and their attributes.
Data scientists are often concerned about managing the infrastructure behind big data platforms while running SQL on R. Amazon Athena is an interactive query service that works directly with data stored in S3 and makes it easy to analyze data using standard SQL without the need to manage infrastructure. Integrating R with Amazon Athena gives data scientists a powerful platform for building interactive analytical solutions.
In this blog post, you’ll connect R/RStudio running on an Amazon EC2 instance with Athena.
Before you get started, complete the following steps.
- Have your AWS account administrator give your AWS account the required permissions to access Athena via Amazon’s Identity and Access Management (IAM) console. This can be done by attaching the associated Athena policies to your data scientist user group in IAM.
Ian Meyers is a Solutions Architecture Senior Manager with AWS
The serverless Amazon Redshift Monitoring utility lets you gather important performance metrics from your Redshift cluster’s system tables and persists the results in Amazon CloudWatch. This serverless solution leverages AWS Lambda to schedule custom SQL queries and process the results. With this utility, you can use Amazon Cloudwatch to monitor disk-based queries, WLM queue wait time, alerts, average query times, and other data. This allows you to create visualizations with CloudWatch dashboards, generate Alerts on specific values, and create Rules to react to those Alerts.
You can now create your own diagnostic queries and plug-in “canaries” that monitor the runtime of your most vital end user queries. These user-defined metrics can be used to create dashboards and trigger Alarms and should improve visibility into workloads running on a Cluster. They might also facilitate sizing discussions.
View the README to get started.
Another month of big data solutions on the Big Data Blog!
Take a look at our summaries below and learn, comment, and share. Thank you for reading!
Implement Serverless Log Analytics Using Amazon Kinesis Analytics
In this post, learn how how to implement a solution that analyzes streaming Apache access log data from an EC2 instance aggregated over 5 minutes.
Migrate External Table Definitions from a Hive Metastore to Amazon Athena
For customers who use Hive external tables on Amazon EMR, or any flavor of Hadoop, a key challenge is how to effectively migrate an existing Hive metastore to Amazon Athena, an interactive query service that directly analyzes data stored in Amazon S3. In this post, learn an approach to migrate an existing Hive metastore to Athena, as well as how to use the Athena JDBC driver to run scripts.
AWS Big Data is Coming to HIMSS!
This year’s HIMSS conference was held at the Orange County Convention Center in Orlando, Florida from February 20 – 23. This blog post lists past AWS Big Data Blog posts to show how AWS technologies are being used to improve healthcare.
Create Tables in Amazon Athena from Nested JSON and Mappings Using JSONSerDe
In this post, you will use the tightly coupled integration of Amazon Kinesis Firehose for log delivery, Amazon S3 for log storage, and Amazon Athena with JSONSerDe to run SQL queries against these logs without the need for data transformation or insertion into a database.
Scheduled Refresh for SPICE Data Sets on Amazon QuickSight
QuickSight uses SPICE (Super-fast, Parallel, In-Memory Calculation Engine), a fully managed data store that enables blazing fast visualizations and can ingest data from AWS, on-premises, and cloud sources. Data in SPICE could be refreshed at any time with the click of a button within QuickSight. This post announced the ability to schedule these refreshes!
Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS
You have come up with an exciting hypothesis, and now you are keen to find and analyze as much data as possible to prove (or refute) it. There are many datasets that might be applicable, but they have been created at different times by different people and don’t conform to any common standard. In this blog post, we will describe a sample application that illustrates how to solve these problems. You can install our sample app, which will harmonize and index three disparate datasets to make them searchable, present a data-driven, customizable UI for searching the datasets to do preliminary analysis and to locate relevant datasets, and integrate with Amazon Athena and Amazon QuickSight for custom analysis and visualization.
We’re back in San Jose for the Strata conference, March 13-16, 2017, to talk all things big data at AWS and show you some of our latest innovations. Come meet the AWS Big Data team at booth #928, where big data experts will be happy to answer your questions, hear about your requirements, and help you with your big data initiatives.
New this year, we’re hosting a hands-on tutorial on Tuesday, where big data solutions architects will guide you through creating a sample big data application using services such as Amazon Kinesis, Amazon Athena, and AWS Lambda.
As previous years, Strata will host a number of technical sessions where you can get best practices and learn more about big data on AWS. Here are some recommendations to get you started.
Amazon Kinesis Data Streaming Services
Roger Barga, AWS
11:50am Wednesday, March 15, 2017. Room LL20 C
Serverless Big Data Architectures: Design Patterns and Best Practices
Siva Raghupathy & Ben Snively, AWS
5:10pm Wednesday, March 15, 2017. Room 210 B/F
Feature Engineering for Diverse Data Types
Alice Zheng, Amazon
5:10pm–5:50pm Wednesday, March 15, 2017. Room 230 C
Zillow: Transforming real estate through big data and data science
Jasjeet Thind, Zillow
11:50am–12:30pm Wednesday, March 15, 2017. Room 230 A
Recommending 1+ billion items to 100+ million users in real time: Harnessing the structure of the user-to-object graph to extract ranking signals at scale
Jure Leskovec, Pinterest
11:50am–12:30pm Wednesday, March 15, 2017. Room 230 C
Shifting left for continuous quality in an Agile data world
Avinash Padmanabhan, Intuit
1:50pm–2:30pm Wednesday, March 15, 2017. Room LL20 A
Looking forward to seeing you at Strata + Hadoop World!
Many business and operational processes require you to analyze large volumes of frequently updated data. Log analysis, for example, involves querying and visualizing large volumes of log data to identify behavioral patterns, understand application processing flows, and investigate and diagnose issues.
VPC flow logs capture information about the IP traffic going to and from network interfaces in VPCs in the Amazon VPC service. The logs allow you to investigate network traffic patterns and identify threats and risks across your VPC estate. Flow log data is stored using Amazon CloudWatch Logs. After you’ve created a flow log, you can view and retrieve its data in Amazon CloudWatch Logs.
Flow logs can help you with a number of tasks. For example, you can use them to troubleshoot why specific traffic is not reaching an instance, which in turn can help you diagnose overly restrictive security group rules. You can also use flow logs as a security tool to monitor the traffic that is reaching your instance.
This blog post shows how to build a serverless architecture by using Amazon Kinesis Firehose, AWS Lambda, Amazon S3, Amazon Athena, and Amazon QuickSight to collect, store, query, and visualize flow logs. In building this solution, you will also learn how to implement Athena best practices with regard to compressing and partitioning data so as to reduce query latencies and drive down query costs.
Summary of the solution
The solution described here is divided into three parts:
- Send VPC Flow Logs to S3 for Analysis with Athena. This section describes how to use Lambda and Firehose to publish flow log data to S3, and how to create a table in Athena so that you can query this data.
- Visualize Your Logs in QuickSight. Here you’ll learn how to use QuickSight and its Athena connector to build flow log analysis dashboards that you can share with other users in your organization.
- Partition Your Data in Athena for Improved Query Performance and Reduced Costs. This section shows how you can use a Lambda function to automatically partition Athena data as it arrives in S3. This function will work with any Firehose stream and any other delivery mechanism that writes data to S3 using a year/month/day/hour prefix.
Partitioning your data is one of three strategies for improving Athena query performance and reducing costs. The other two are compressing your data, and converting it into columnar formats such as Apache Parquet. The solution described here automatically compresses your data, but it doesn’t convert it into a columnar format. Even if you don’t convert your data to a columnar format, as is the case here, it’s always worth compressing and partitioning it. For any large-scale solution, you should also consider converting it to Parquet.
Serverless Architecture for Analyzing VPC Flow Logs
Below is a diagram showing how the various services work together.
When you create a flow log for a VPC, the log data is published to a log group in CloudWatch Logs. By using a CloudWatch Logs subscription, you can send a real-time feed of these log events to a Lambda function that uses Firehose to write the log data to S3.
AWS offers a range of training resources to help you advance your knowledge with practical skills so you can get more out of the cloud. We’ve updated Big Data on AWS, a three-day, instructor-led training course to keep pace with the latest AWS big data innovations. This course allows you to hear big data best practices from an expert, get answers to your questions in person, and get hands-on practice using AWS big data services. Anyone interested in learning about the services and architecture patterns behind big data solutions on AWS will benefit from this training.
Specifically, this course introduces you to cloud-based big data solutions such as Amazon EMR, Amazon Redshift, Amazon Kinesis and the rest of the AWS big data platform. This course shows you how to use Amazon EMR to process data using the broad ecosystem of Hadoop tools like Hive and Hue. We also teach you how to create big data environments, work with Amazon DynamoDB, Amazon Redshift, Amazon Quicksight, Amazon Athena and Amazon Kinesis, and leverage best practices to design big data environments for security and cost-effectiveness.
This new version of the course incorporates feedback and adds new content. There’s a new module around Big Data Processing and Analytics that focuses on Amazon Athena. We’ve also updated the course with more context for IoT, more content for Kinesis Firehose, new content for Kinesis Analytics and Amazon Snowball, and added content for Amazon QuickSight.
If you’re interested in this course, you can search for a local Big Data on AWS class in our Global Class Schedule. Or, if you’d like to arrange a private onsite class for your team, you can contact us about scheduling. You can also explore other training courses on our Classes & Workshops page.
As organizations move their workloads to the cloud, audit logs provide a wealth of information on the operations, governance, and security of assets and resources. As the complexity of the workloads increases, so does the volume of audit logs being generated. It becomes increasingly difficult for organizations to analyze and understand what is happening in their accounts without a significant investment of time and resources.
AWS CloudTrail and Amazon Athena help make it easier by combining the detailed CloudTrail log files with the power of the Athena SQL engine to easily find, analyze, and respond to changes and activities in an AWS account.
AWS CloudTrail records API calls and account activities and publishes the log files to Amazon S3. Account activity is tracked as an event in the CloudTrail log file. Each event carries information such as who performed the action, when the action was done, which resources were impacted, and many more details. Multiple events are stitched together and structured in a JSON format within the CloudTrail log files.
Amazon Athena uses Apache Hive’s data definition language (DDL) to create tables and Presto, a distributed SQL engine, to run queries. Apache Hive does not natively support files in JSON, so we’ll have to use a SerDe to help Hive understand how the records should be processed. A SerDe interface is a combination of a serializer and deserializer. A deserializer helps take data and convert it into a Java object while the serializer helps convert the Java object into a usable representation.
In this blog post, we will walk through how to set up and use the recently released Amazon Athena CloudTrail SerDe to query CloudTrail log files for EC2 security group modifications, console sign-in activity, and operational account activity. This post assumes that customers already have AWS CloudTrail configured. For more information about configuring CloudTrail, see Getting Started with AWS CloudTrail in the AWS CloudTrail User Guide.
Setting up Amazon Athena
Let’s start by signing in to the Amazon Athena console and performing the following steps.
You have come up with an exciting hypothesis, and now you are keen to find and analyze as much data as possible to prove (or refute) it. There are many datasets that might be applicable, but they have been created at different times by different people and don’t conform to any common standard. They use different names for variables that mean the same thing and the same names for variables that mean different things. They use different units of measurement and different categories. Some have more variables than others. And they all have data quality issues (for example, badly formed dates and times, invalid geographic coordinates, and so on).
You first need a way to harmonize these datasets, to identify the variables that mean the same thing and make sure that these variables have the same names and units. You also need to clean up or remove records with invalid data.
After the datasets are harmonized, you need to search through the data to find the datasets you’re interested in. Not all of them have records that are relevant to your hypothesis, so you want to filter on a number of important variables to narrow down the datasets and verify they contain enough matching records to be significant.
Having identified the datasets of interest, you are ready to run your custom analyses on the data they contain so that you can prove your hypothesis and create beautiful visualizations to share with the world!
In this blog post, we will describe a sample application that illustrates how to solve these problems. You can install our sample app, which will:
- Harmonize and index three disparate datasets to make them searchable.
- Present a data-driven, customizable UI for searching the datasets to do preliminary analysis and to locate relevant datasets.
- Integrate with Amazon Athena and Amazon QuickSight for custom analysis and visualization.
The Police Data Initiative seeks to improve community and law enforcement relations through the public availability of data related to police activity. Datasets from participating cities, available through the Public Safety Open Data Portal, have many of the problems just outlined. Despite the commonality of crime and location metadata, there is no standard naming or value scheme. Datasets are stored in various locations and in various formats. There is no central search and discovery engine. To gain insights and value from this data, you have to analyze datasets city by city.
Although the focus of this post is police incident data, the same approach can be used for datasets in other domains, such as IoT, personalized medicine, news, weather, finance, and much more.
Our architecture uses the following AWS services:
- Amazon EMR (with Apache Spark and Jupyter notebooks) to explore, clean, harmonize (transform), describe, and save multiple, loosely coupled datasets.
- Amazon S3 to store both raw and harmonized datasets.
- Amazon Elasticsearch Service (Amazon ES) to host secure, searchable indexes of selected dataset variables and the associated dictionary/metadata used to power the search web page.
- Amazon EC2 Container Service (ECS) to host a web-based search UI.
- Amazon Athena and Amazon QuickSight to provide analysis and reporting.
- AWS CodeBuild and AWS CodePipeline to build and deliver the search UI application on ECS.
- AWS Identity and Access Management (IAM) policies and instance roles allow least-privilege access to Amazon ES from the UI containers and from the EMR cluster.
- AWS CloudFormation to orchestrate the provisioning of the environment.
The diagram below illustrates the solution architecture:
Jose Kunnackal is a Senior Product Manager for Amazon Quicksight
This blog post has been translated into Japanese.
In November 2016, we launched Amazon QuickSight, a cloud-powered, business analytics service that lets you quickly and easily visualize your data. QuickSight uses SPICE (Super-fast, Parallel, In-Memory Calculation Engine), a fully managed data store that enables blazing fast visualizations and can ingest data from AWS, on-premises, and cloud sources. Data in SPICE could be refreshed at any time with the click of a button within QuickSight.
Today, we are launching the ability to schedule these refreshes!
To schedule a SPICE data set refresh
Select a SPICE data set and choose Schedule refresh. Then, specify the time zone, frequency, and start time of the refresh.