Tag: Amazon Athena


Build a Serverless Architecture to Analyze Amazon CloudFront Access Logs Using AWS Lambda, Amazon Athena, and Amazon Kinesis Analytics

by Rajeev Srinivasan and Sai Sriparasa | on | | Comments

Nowadays, it’s common for a web server to be fronted by a global content delivery service, like Amazon CloudFront. This type of front end accelerates delivery of websites, APIs, media content, and other web assets to provide a better experience to users across the globe.

The insights gained by analysis of Amazon CloudFront access logs helps improve website availability through bot detection and mitigation, optimizing web content based on the devices and browser used to view your webpages, reducing perceived latency by caching of popular object closer to its viewer, and so on. This results in a significant improvement in the overall perceived experience for the user.

This blog post provides a way to build a serverless architecture to generate some of these insights. To do so, we analyze Amazon CloudFront access logs both at rest and in transit through the stream. This serverless architecture uses Amazon Athena to analyze large volumes of CloudFront access logs (on the scale of terabytes per day), and Amazon Kinesis Analytics for streaming analysis.

The analytic queries in this blog post focus on three common use cases:

  1. Detection of common bots using the user agent string
  2. Calculation of current bandwidth usage per Amazon CloudFront distribution per edge location
  3. Determination of the current top 50 viewers

However, you can easily extend the architecture described to power dashboards for monitoring, reporting, and trigger alarms based on deeper insights gained by processing and analyzing the logs. Some examples are dashboards for cache performance, usage and viewer patterns, and so on.

Following we show a diagram of this architecture.

(more…)

Querying OpenStreetMap with Amazon Athena

by Seth Fitzsimmons | on | | Comments

This is a guest post by Seth Fitzsimmons, member of the 2017 OpenStreetMap US board of directors. Seth works with clients including the Humanitarian OpenStreetMap Team, Mapzen, the American Red Cross, and World Bank to craft innovative geospatial solutions.

OpenStreetMap (OSM) is a free, editable map of the world, created and maintained by volunteers and available for use under an open license. Companies and non-profits like Mapbox, Foursquare, Mapzen, the World Bank, the American Red Cross and others use OSM to provide maps, directions, and geographic context to users around the world.

In the 12 years of OSM’s existence, editors have created and modified several billion features (physical things on the ground like roads or buildings). The main PostgreSQL database that powers the OSM editing interface is now over 2TB and includes historical data going back to 2007. As new users join the open mapping community, more and more valuable data is being added to OpenStreetMap, requiring increasingly powerful tools, interfaces, and approaches to explore its vastness.

This post explains how anyone can use Amazon Athena to quickly query publicly available OSM data stored in Amazon S3 (updated weekly) as an AWS Public Dataset. Imagine that you work for an NGO interested in improving knowledge of and access to health centers in Africa. You might want to know what’s already been mapped, to facilitate the production of maps of surrounding villages, and to determine where infrastructure investments are likely to be most effective.

Note: If you run all the queries in this post, you will be charged approximately $1 based on the number of bytes scanned. All queries used in this post can be found in this GitHub gist.

What is OpenStreetMap?

As an open content project, regular OSM data archives are made available to the public via planet.openstreetmap.org in a few different formats (XML, PBF). This includes both snapshots of the current state of data in OSM as well as historical archives.

Working with “the planet” (as the data archives are referred to) can be unwieldy. Because it contains data spanning the entire world, the size of a single archive is on the order of 50 GB. The format is bespoke and extremely specific to OSM. The data is incredibly rich, interesting, and useful, but the size, format, and tooling can often make it very difficult to even start the process of asking complex questions.

Heavy users of OSM data typically download the raw data and import it into their own systems, tailored for their individual use cases, such as map rendering, driving directions, or general analysis. Now that OSM data is available in the Apache ORC format on Amazon S3, it’s possible to query the data using Athena without even downloading it.

How does Athena help?

You can use Athena along with data made publicly available via OSM on AWS. You don’t have to learn how to install, configure, and populate your own server instances and go through multiple steps to download and transform the data into a queryable form. Thanks to AWS and partners, a regularly updated copy of the planet file (available within hours of OSM’s weekly publishing schedule) is hosted on S3 and made available in a format that lends itself to efficient querying using Athena.

Asking questions with Athena involves registering the OSM planet file as a table and making SQL queries. That’s it. Nothing to download, nothing to configure, nothing to ingest. Athena distributes your queries and returns answers within seconds, even while querying over 9 years and billions of OSM elements.

You’re in control. S3 provides high availability for the data and Athena charges you per TB of data scanned. Plus, we’ve gone through the trouble of keeping scanning charges as small as possible by transcoding OSM’s bespoke format as ORC. All the hard work of transforming the data into something highly queryable and making it publicly available is done; you just need to bring some questions.

Registering Tables

The OSM Public Datasets consist of three tables:

  • planet
    Contains the current versions of all elements present in OSM.
  • planet_history
    Contains a historical record of all versions of all elements (even those that have been deleted).
  • changesets
    Contains information about changesets in which elements were modified (and which have a foreign key relationship to both the planet and planet_history tables).

To register the OSM Public Datasets within your AWS account so you can query them, open the Athena console (make sure you are using the us-east-1 region) to paste and execute the following table definitions:

(more…)

Top 10 Performance Tuning Tips for Amazon Athena

by Manjeet Chayel and Mert Hocanin | on | | Comments

This blog post has been translated into Japanese

Amazon Athena is an interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.

In this blog post, we will review the top 10 tips that can improve query performance. We will focus on aspects related to storing data in Amazon S3 and tuning specific to queries. Amazon Athena uses Presto to run SQL queries and hence some of the advice will work if you are running Presto on Amazon EMR.

This post assumes that you have knowledge of different file formats, such as Parquet, ORC, Text files, Avro, CSV, TSV, and JSON.

Best practices: storage

This section discusses how to structure your data so that you can get the most out of Athena. The same practices can be applied to Amazon EMR data processing applications such as Spark, Presto, and Hive when your data is stored on Amazon S3.

1.   Partition your data

Partitioning divides your table into parts and keeps the related data together based on column values such as date, country, region, etc. Partitions act as virtual columns. You define them at table creation, and they can help reduce the amount of data scanned per query, thereby improving performance. You can restrict the amount of data scanned by a query by specifying filters based on the partition. For more details, see Partitioning Data.

Athena supports Hive partitioning, which follows one of the following naming convention:

a) Partition column name followed by an equal symbol (‘=’) and then the value.

s3://yourBucket/pathToTable/<PARTITION_COLUMN_NAME>=<VALUE>/<PARTITION_COLUMN_NAME>=<VALUE>/

If your dataset is partitioned in this format, then you can run the MSCK REPAIR table command to add partitions to your table automatically.

b) If the “path” of your data does not follow the above format, you can add the partitions manually using the ALTER TABLE ADD PARTITION command for each partition. For example

s3://yourBucket/pathToTable/YYYY/MM/DD/

Alter Table <tablename> add Partition (PARTITION_COLUMN_NAME = <VALUE>, PARTITION_COLUMN2_NAME = <VALUE>) LOCATION ‘s3://yourBucket/pathToTable/YYYY/MM/DD/’;

Note: using the above methodology, you can map any location with what values you want to refer them by.

The following example shows how data is partitioned on the year column on the flight table stored in an S3 bucket.

$ aws s3 ls s3://athena-examples/flight/parquet/
PRE year=1987/
PRE year=1988/
PRE year=1989/
PRE year=1990/
PRE year=1991/
PRE year=1992/
PRE year=1993/

You can restrict the partitions that are scanned in a query by using the column in the ‘WHERE’ clause.

SELECT dest, origin FROM flights WHERE year = 1991

You can also use multiple columns as partition keys. You can scan the data for specificvalues, and so on.

s3://athena-examples/flight/parquet/year=1991/month=1/day=1/

s3://athena-examples/flight/parquet/year=1991/month=1/day=2/

When deciding the columns on which to partition, consider the following:

  • Columns that are used as filters are good candidates for partitioning.
  • Partitioning has a cost. As the number of partitions in your table increases, the higher the overhead of retrieving and processing the partition metadata, and the smaller your files. Partitioning too finely can wipe out the initial benefit.
  • If your data is heavily skewed to one partition value, and most queries use that value, then the overhead may wipe out the initial benefit.

Example:

The table below compares query run times between a partitioned and Non-partitioned table. Both tables contain 74GB data, uncompressed stored in Text format. The partitioned table is partitioned by the l_shipdate column and has 2526 partitions.

Query Non- Partitioned Table Cost Partitioned table Cost Savings
Run time Data scanned Run time Data scanned
SELECT count(*) FROM lineitem WHERE l_shipdate = '1996-09-01' 9.71 seconds 74.1 GB $0.36 2.16 seconds 29.06 MB $0.0001

99% cheaper

77% faster

SELECT count(*) FROM lineitem WHERE l_shipdate >= '1996-09-01' AND l_shipdate < '1996-10-01' 10.41 seconds 74.1 GB $0.36 2.73 seconds 871.39 MB $0.004 98% cheaper
73% faster

However, partitioning also has a penalty as shown in the following run times. Make sure that you don’t over-partition your data.

Query Non- Partitioned Table Cost Partitioned table Cost Savings
Run time Data scanned Run time Data scanned
SELECT count(*) FROM lineitem; 8.4 seconds 74.1 GB $0.36 10.65 seconds 74.1 GB $0.36 27% slower

2. Compress and split files

Compressing your data can speed up your queries significantly, as long as the files are either of an optimal size (see the next section) or the files are splittable. The smaller data sizes reduce network traffic from S3 to Athena.

Splittable files allows Athena’s execution engine to split the reading of a file by multiple readers to increase parallelism. If you have a single unsplittable file, then only a single reader is able to read the file while all other readers sit idle. Not all compression algorithms are splittable. The following table lists common compression formats and their attributes.

(more…)

Running R on Amazon Athena

by Gopal Wunnava | on | | Comments

This blog post has been translated into Japanese.

Data scientists are often concerned about managing the infrastructure behind big data platforms while running SQL on R. Amazon Athena is an interactive query service that works directly with data stored in S3 and makes it easy to analyze data using standard SQL without the need to manage infrastructure. Integrating R with Amazon Athena gives data scientists a powerful platform for building interactive analytical solutions.

In this blog post, you’ll connect R/RStudio running on an Amazon EC2 instance with Athena.

Prerequisites

Before you get started, complete the following steps.

  1. Have your AWS account administrator give your AWS account the required permissions to access Athena via Amazon’s Identity and Access Management (IAM) console. This can be done by attaching the associated Athena policies to your data scientist user group in IAM.

 

RAthena_1

(more…)

Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight

by Ian Robinson and Ben Snively | on | | Comments

Many business and operational processes require you to analyze large volumes of frequently updated data. Log analysis, for example, involves querying and visualizing large volumes of log data to identify behavioral patterns, understand application processing flows, and investigate and diagnose issues.

VPC flow logs capture information about the IP traffic going to and from network interfaces in VPCs in the Amazon VPC service. The logs allow you to investigate network traffic patterns and identify threats and risks across your VPC estate. Flow log data is stored using Amazon CloudWatch Logs. After you’ve created a flow log, you can view and retrieve its data in Amazon CloudWatch Logs.

Flow logs can help you with a number of tasks. For example, you can use them to troubleshoot why specific traffic is not reaching an instance, which in turn can help you diagnose overly restrictive security group rules. You can also use flow logs as a security tool to monitor the traffic that is reaching your instance.

This blog post shows how to build a serverless architecture by using Amazon Kinesis Firehose, AWS Lambda, Amazon S3, Amazon Athena, and Amazon QuickSight to collect, store, query, and visualize flow logs. In building this solution, you will also learn how to implement Athena best practices with regard to compressing and partitioning data so as to reduce query latencies and drive down query costs.

Summary of the solution

The solution described here is divided into three parts:

  • Send VPC Flow Logs to S3 for Analysis with Athena. This section describes how to use Lambda and Firehose to publish flow log data to S3, and how to create a table in Athena so that you can query this data.
  • Visualize Your Logs in QuickSight. Here you’ll learn how to use QuickSight and its Athena connector to build flow log analysis dashboards that you can share with other users in your organization.
  • Partition Your Data in Athena for Improved Query Performance and Reduced Costs. This section shows how you can use a Lambda function to automatically partition Athena data as it arrives in S3. This function will work with any Firehose stream and any other delivery mechanism that writes data to S3 using a year/month/day/hour prefix.

Partitioning your data is one of three strategies for improving Athena query performance and reducing costs. The other two are compressing your data, and converting it into columnar formats such as Apache Parquet. The solution described here automatically compresses your data, but it doesn’t convert it into a columnar format. Even if you don’t convert your data to a columnar format, as is the case here, it’s always worth compressing and partitioning it. For any large-scale solution, you should also consider converting it to Parquet.

Serverless Architecture for Analyzing VPC Flow Logs

Below is a diagram showing how the various services work together.

VPC_Flowlogs_Ian_Ben

When you create a flow log for a VPC, the log data is published to a log group in CloudWatch Logs. By using a CloudWatch Logs subscription, you can send a real-time feed of these log events to a Lambda function that uses Firehose to write the log data to S3.

(more…)

Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena

by Sai Sriparasa and Bob O'Dell | on | | Comments

As organizations move their workloads to the cloud, audit logs provide a wealth of information on the operations, governance, and security of assets and resources. As the complexity of the workloads increases, so does the volume of audit logs being generated. It becomes increasingly difficult for organizations to analyze and understand what is happening in their accounts without a significant investment of time and resources.

AWS CloudTrail and Amazon Athena help make it easier by combining the detailed CloudTrail log files with the power of the Athena SQL engine to easily find, analyze, and respond to changes and activities in an AWS account.

AWS CloudTrail records API calls and account activities and publishes the log files to Amazon S3. Account activity is tracked as an event in the CloudTrail log file. Each event carries information such as who performed the action, when the action was done, which resources were impacted, and many more details. Multiple events are stitched together and structured in a JSON format within the CloudTrail log files.

Amazon Athena uses Apache Hive’s data definition language (DDL) to create tables and Presto, a distributed SQL engine, to run queries. Apache Hive does not natively support files in JSON, so we’ll have to use a SerDe to help Hive understand how the records should be processed. A SerDe interface is a combination of a serializer and deserializer. A deserializer helps take data and convert it into a Java object while the serializer helps convert the Java object into a usable representation.

In this blog post, we will walk through how to set up and use the recently released Amazon Athena CloudTrail SerDe to query CloudTrail log files for EC2 security group modifications, console sign-in activity, and operational account activity. This post assumes that customers already have AWS CloudTrail configured. For more information about configuring CloudTrail, see Getting Started with AWS CloudTrail in the AWS CloudTrail User Guide.

Setting up Amazon Athena

Let’s start by signing in to the Amazon Athena console and performing the following steps.

o_athena-cloudtrail_1

(more…)

Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS

by Ryan Jancaitis, Oliver Atoa, and Bob Strahan | on | | Comments

You have come up with an exciting hypothesis, and now you are keen to find and analyze as much data as possible to prove (or refute) it. There are many datasets that might be applicable, but they have been created at different times by different people and don’t conform to any common standard. They use different names for variables that mean the same thing and the same names for variables that mean different things. They use different units of measurement and different categories. Some have more variables than others. And they all have data quality issues (for example, badly formed dates and times, invalid geographic coordinates, and so on).

You first need a way to harmonize these datasets, to identify the variables that mean the same thing and make sure that these variables have the same names and units. You also need to clean up or remove records with invalid data.

After the datasets are harmonized, you need to search through the data to find the datasets you’re interested in. Not all of them have records that are relevant to your hypothesis, so you want to filter on a number of important variables to narrow down the datasets and verify they contain enough matching records to be significant.

Having identified the datasets of interest, you are ready to run your custom analyses on the data they contain so that you can prove your hypothesis and create beautiful visualizations to share with the world!

In this blog post, we will describe a sample application that illustrates how to solve these problems. You can install our sample app, which will:

  • Harmonize and index three disparate datasets to make them searchable.
  • Present a data-driven, customizable UI for searching the datasets to do preliminary analysis and to locate relevant datasets.
  • Integrate with Amazon Athena and Amazon QuickSight for custom analysis and visualization.

Example data

The Police Data Initiative seeks to improve community and law enforcement relations through the public availability of data related to police activity. Datasets from participating cities, available through the Public Safety Open Data Portal, have many of the problems just outlined. Despite the commonality of crime and location metadata, there is no standard naming or value scheme. Datasets are stored in various locations and in various formats. There is no central search and discovery engine. To gain insights and value from this data, you have to analyze datasets city by city.

Although the focus of this post is police incident data, the same approach can be used for datasets in other domains, such as IoT, personalized medicine, news, weather, finance, and much more.

Architecture

Our architecture uses the following AWS services:

The diagram below illustrates the solution architecture:

HarmonizeSearch_1 (more…)

Create Tables in Amazon Athena from Nested JSON and Mappings Using JSONSerDe

by Rick Wiggins | on | | Comments

Most systems use Java Script Object Notation (JSON) to log event information. Although it’s efficient and flexible, deriving information from JSON is difficult.

In this post, you will use the tightly coupled integration of Amazon Kinesis Firehose for log delivery, Amazon S3 for log storage, and Amazon Athena with JSONSerDe to run SQL queries against these logs without the need for data transformation or insertion into a database. It’s done in a completely serverless way. There’s no need to provision any compute.

Amazon SES provides highly detailed logs for every message that travels through the service and, with SES event publishing, makes them available through Firehose. However, parsing detailed logs for trends or compliance data would require a significant investment in infrastructure and development time. Athena is a boon to these data seekers because it can query this dataset at rest, in its native format, with zero code or architecture. On top of that, it uses largely native SQL queries and syntax.

Walkthrough: Establishing a dataset

We start with a dataset of an SES send event that looks like this:

{
	"eventType": "Send",
	"mail": {
		"timestamp": "2017-01-18T18:08:44.830Z",
		"source": "youraddress@example.com",
		"sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com",
		"sendingAccountId": "111222333",
		"messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000",
		"destination": ["success@simulator.amazonses.com"],
		"headersTruncated": false,
		"headers": [{
				"name": "From",
				"value": "youraddress@example.com"
			}, {
				"name": "To",
				"value": "success@simulator.amazonses.com"
			}, {
				"name": "Subject",
				"value": "Bounced Like a Bad Check"
			}, {
				"name": "MIME-Version",
				"value": "1.0"
			}, {
				"name": "Content-Type",
				"value": "text/plain; charset=UTF-8"
			}, {
				"name": "Content-Transfer-Encoding",
				"value": "7bit"
			}
		],
		"commonHeaders": {
			"from": ["youraddress@example.com"],
			"to": ["success@simulator.amazonses.com"],
			"messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000",
			"subject": "Test"
		},
		"tags": {
			"ses:configuration-set": ["Firehose"],
			"ses:source-ip": ["54.55.55.55"],
			"ses:from-domain": ["amazon.com"],
			"ses:caller-identity": ["root"]
		}
	},
	"send": {}
}

This dataset contains a lot of valuable information about this SES interaction. There are thousands of datasets in the same format to parse for insights. Getting this data is straightforward.

1. Create a configuration set in the SES console or CLI that uses a Firehose delivery stream to send and store logs in S3 in near real-time.
NestedJson_1

(more…)

Migrate External Table Definitions from a Hive Metastore to Amazon Athena

by Neil Mukerje | on | | Comments

For customers who use Hive external tables on Amazon EMR, or any flavor of Hadoop, a key challenge is how to effectively migrate an existing Hive metastore to Amazon Athena, an interactive query service that directly analyzes data stored in Amazon S3. With Athena, there are no clusters to manage and tune, and no infrastructure to set up or manage. Customers pay only for the queries they run.

In this post, I discuss an approach to migrate an existing Hive metastore to Athena, as well as how to use the Athena JDBC driver to run scripts. I demonstrate two scripts.

  1. The first script exports external tables from a Hive metastore on EMR, or other Hadoop flavors, as a Hive script. This script handles both Hive metastores local to the cluster or metastores stored in an external database.
  1. The second script executes the Hive script in Athena over JDBC to import the external tables into the Athena catalog.

Both scripts are available in the aws-blog-athena-importing-hive-metastores GitHub repo.

Prerequisites

You must have the following resources available:

  • A working Python 2.7+ environment. (required for the first script)
  • A working Java 1.8 runtime environment
  • Groovy, if not already installed
  • The Java classpath set to point to the Athena JDBC driver JAR file location

In EMR, you can use the following commands to complete the prerequisites (Python comes already installed):

# set Java to 1.8
EMR $> export JAVA_HOME=/usr/lib/jvm/java-1.8.0

# Download Groovy and set Groovy binary in PATH
EMR $> wget https://dl.bintray.com/groovy/maven/apache-groovy-binary-2.4.7.zip
EMR $> unzip apache-groovy-binary-2.4.7.zip
EMR $> export PATH=$PATH:`pwd`/groovy-2.4.7/bin/:

# Download latest Athena JDBC driver and set it in JAVA CLASSPATH
EMR $> aws s3 cp s3://athena-downloads/drivers/AthenaJDBC41-1.0.0.jar .
EMR $> export CLASSPATH=`pwd`/AthenaJDBC41-1.0.0.jar:;

Exporting external tables from a Hive metastore

The Python script exportdatabase.py exports external tables only from the Hive metastore, and saves them to a local file as a Hive script.

EMR $> python exportdatabase.py <<Hive database name>> 

Here’s the sample output:

EMR $> python exportdatabase.py default

Found 10 tables in database...

Database metadata exported to default_export.hql.

Athena does not support every data type and SerDe supported by Hive. Edit or replace contents in the generated Hive script as needed to ensure compatibility. For more information about supported datatypes and SerDes, see the Amazon Athena documentation.

(more…)

Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight

by Ben Snively | on | | Comments

Ben Snively is a Solutions Architect with AWS

Speed and agility are essential with today’s analytics tools. The quicker you can get from idea to first results, the more you can experiment and innovate with your data, perform ad-hoc analysis, and drive answers to new business questions.

Serverless architectures help in this respect by taking care of the non-differentiated heavy lifting in your environment―tasks such as managing servers, clusters, and device endpoints – allowing you to focus on assembling your IoT system, analyzing data, and building meaningful reports quickly and efficiently.

In this post, I show how you can build a business intelligence capability for streaming IoT device data using AWS serverless and managed services. You can be up and running in minutes―starting small, but able to easily grow to millions of devices and billions of messages.

AWS serverless services

AWS services offer a quicker time to intelligence. The following is a high-level diagram showing how the services in this post are configured:

o_IoT_Minutes_1

(more…)