Tag: Amazon Quicksight

Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight

by Ian Robinson and Ben Snively | on | | Comments

Many business and operational processes require you to analyze large volumes of frequently updated data. Log analysis, for example, involves querying and visualizing large volumes of log data to identify behavioral patterns, understand application processing flows, and investigate and diagnose issues.

VPC flow logs capture information about the IP traffic going to and from network interfaces in VPCs in the Amazon VPC service. The logs allow you to investigate network traffic patterns and identify threats and risks across your VPC estate. Flow log data is stored using Amazon CloudWatch Logs. After you’ve created a flow log, you can view and retrieve its data in Amazon CloudWatch Logs.

Flow logs can help you with a number of tasks. For example, you can use them to troubleshoot why specific traffic is not reaching an instance, which in turn can help you diagnose overly restrictive security group rules. You can also use flow logs as a security tool to monitor the traffic that is reaching your instance.

This blog post shows how to build a serverless architecture by using Amazon Kinesis Firehose, AWS Lambda, Amazon S3, Amazon Athena, and Amazon QuickSight to collect, store, query, and visualize flow logs. In building this solution, you will also learn how to implement Athena best practices with regard to compressing and partitioning data so as to reduce query latencies and drive down query costs.

Summary of the solution

The solution described here is divided into three parts:

  • Send VPC Flow Logs to S3 for Analysis with Athena. This section describes how to use Lambda and Firehose to publish flow log data to S3, and how to create a table in Athena so that you can query this data.
  • Visualize Your Logs in QuickSight. Here you’ll learn how to use QuickSight and its Athena connector to build flow log analysis dashboards that you can share with other users in your organization.
  • Partition Your Data in Athena for Improved Query Performance and Reduced Costs. This section shows how you can use a Lambda function to automatically partition Athena data as it arrives in S3. This function will work with any Firehose stream and any other delivery mechanism that writes data to S3 using a year/month/day/hour prefix.

Partitioning your data is one of three strategies for improving Athena query performance and reducing costs. The other two are compressing your data, and converting it into columnar formats such as Apache Parquet. The solution described here automatically compresses your data, but it doesn’t convert it into a columnar format. Even if you don’t convert your data to a columnar format, as is the case here, it’s always worth compressing and partitioning it. For any large-scale solution, you should also consider converting it to Parquet.

Serverless Architecture for Analyzing VPC Flow Logs

Below is a diagram showing how the various services work together.


When you create a flow log for a VPC, the log data is published to a log group in CloudWatch Logs. By using a CloudWatch Logs subscription, you can send a real-time feed of these log events to a Lambda function that uses Firehose to write the log data to S3.


Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS

by Ryan Jancaitis, Oliver Atoa, and Bob Strahan | on | | Comments

You have come up with an exciting hypothesis, and now you are keen to find and analyze as much data as possible to prove (or refute) it. There are many datasets that might be applicable, but they have been created at different times by different people and don’t conform to any common standard. They use different names for variables that mean the same thing and the same names for variables that mean different things. They use different units of measurement and different categories. Some have more variables than others. And they all have data quality issues (for example, badly formed dates and times, invalid geographic coordinates, and so on).

You first need a way to harmonize these datasets, to identify the variables that mean the same thing and make sure that these variables have the same names and units. You also need to clean up or remove records with invalid data.

After the datasets are harmonized, you need to search through the data to find the datasets you’re interested in. Not all of them have records that are relevant to your hypothesis, so you want to filter on a number of important variables to narrow down the datasets and verify they contain enough matching records to be significant.

Having identified the datasets of interest, you are ready to run your custom analyses on the data they contain so that you can prove your hypothesis and create beautiful visualizations to share with the world!

In this blog post, we will describe a sample application that illustrates how to solve these problems. You can install our sample app, which will:

  • Harmonize and index three disparate datasets to make them searchable.
  • Present a data-driven, customizable UI for searching the datasets to do preliminary analysis and to locate relevant datasets.
  • Integrate with Amazon Athena and Amazon QuickSight for custom analysis and visualization.

Example data

The Police Data Initiative seeks to improve community and law enforcement relations through the public availability of data related to police activity. Datasets from participating cities, available through the Public Safety Open Data Portal, have many of the problems just outlined. Despite the commonality of crime and location metadata, there is no standard naming or value scheme. Datasets are stored in various locations and in various formats. There is no central search and discovery engine. To gain insights and value from this data, you have to analyze datasets city by city.

Although the focus of this post is police incident data, the same approach can be used for datasets in other domains, such as IoT, personalized medicine, news, weather, finance, and much more.


Our architecture uses the following AWS services:

The diagram below illustrates the solution architecture:

HarmonizeSearch_1 (more…)

Scheduled Refresh for SPICE Data Sets on Amazon QuickSight

by Jose Kunnackal | on | | Comments

Jose Kunnackal is a Senior Product Manager for Amazon Quicksight

This blog post has been translated into Japanese.

In November 2016, we launched Amazon QuickSight, a cloud-powered, business analytics service that lets you quickly and easily visualize your data. QuickSight uses SPICE (Super-fast, Parallel, In-Memory Calculation Engine), a fully managed data store that enables blazing fast visualizations and can ingest data from AWS, on-premises, and cloud sources. Data in SPICE could be refreshed at any time with the click of a button within QuickSight.

Today, we are launching the ability to schedule these refreshes!

To schedule a SPICE data set refresh

Select a SPICE data set and choose Schedule refresh. Then, specify the time zone, frequency, and start time of the refresh.



Converging Data Silos to Amazon Redshift Using AWS DMS

by Pratim Das | on | | Comments

Organizations often grow organically—and so does their data in individual silos. Such systems are often powered by traditional RDBMS systems and they grow orthogonally in size and features. To gain intelligence across heterogeneous data sources, you have to join the data sets. However, this imposes new challenges, as joining data over dblinks or into a single view is extremely cumbersome and an operational nightmare.

This post walks through using AWS Database Migration Service (AWS DMS) and other AWS services to make it easy to converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.

AWS service overview

Here’s a brief overview of AWS services that help with data convergence.


With DMS, you can migrate your data to and from most widely used commercial and open-source databases. The service supports homogenous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL. It also allows you to stream data to Amazon Redshift from any of the supported sources including:

  • Amazon Aurora
  • PostgreSQL
  • MySQL
  • MariaDB
  • Oracle
  • SQL Server

DMS enables consolidation and easy analysis of data in the petabyte-scale data warehouse. It can also be used for continuous data replication with high availability.

Amazon QuickSight

Amazon QuickSight provides very fast, easy-to-use, cloud-powered business intelligence at 1/10th the cost of traditional BI solutions. QuickSight uses a new, super-fast, parallel, in-memory calculation engine (“SPICE”) to perform advanced calculations and render visualizations rapidly.

QuickSight integrates automatically with AWS data services, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance to them. You can easily connect QuickSight to AWS data services, including Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon S3, and Amazon Athena. You can also upload CSV, TSV, and spreadsheet files or connect to third-party data sources such as Salesforce.

Amazon Redshift

Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes. Amazon Redshift is typically priced at 1/10th of the price of the competition. We have many customers running petabyte scale data analytics on AWS using Amazon Redshift.

Amazon Redshift is also ANSI SQL compliant, supports JDBC/ODBC, and is easy to connect to your existing business intelligence (BI) solution. However, if your storage requirement is in the 10s of TB range and requires high levels of concurrency across small queries, you may want to consider Amazon Aurora as the target converged database.


Assume that you have an events company specializing on sports, and have built a MySQL database that holds data for the players and the sporting events. Customers and ticket information is stored in another database; in this case, assume it is PostgresSQL and this gets updated when customer purchases tickets from our website and mobile apps. You can download a sample dataset from the aws-database-migration-samples GitHub repo.

These databases could be anywhere: at an on-premises facility; on AWS in Amazon EC2 or Amazon RDS, or other cloud provider; or in a mixture of such locations. To complicate things a little more, you can assume that the lost opportunities (where a customer didn’t complete buying the ticket even though it was added to the shopping cart) are streamed via clickstream through Amazon Kinesis and then stored on Amazon S3. We then use AWS Data Pipeline to orchestrate a process to cleanse that data using Amazon EMR and make it ready for loading to Amazon Redshift. The clickstream integration is not covered in this post but was demonstrated in the recent Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics post.


In this solution, you use DMS to bring the two data sources into Amazon Redshift and run analytics to gain business insights. The following diagram demonstrates the proposed solution.



Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight

by Ben Snively | on | | Comments

Ben Snively is a Solutions Architect with AWS

Speed and agility are essential with today’s analytics tools. The quicker you can get from idea to first results, the more you can experiment and innovate with your data, perform ad-hoc analysis, and drive answers to new business questions.

Serverless architectures help in this respect by taking care of the non-differentiated heavy lifting in your environment―tasks such as managing servers, clusters, and device endpoints – allowing you to focus on assembling your IoT system, analyzing data, and building meaningful reports quickly and efficiently.

In this post, I show how you can build a business intelligence capability for streaming IoT device data using AWS serverless and managed services. You can be up and running in minutes―starting small, but able to easily grow to millions of devices and billions of messages.

AWS serverless services

AWS services offer a quicker time to intelligence. The following is a high-level diagram showing how the services in this post are configured:



Build a Community of Analysts with Amazon QuickSight

by Nick Corbett | on | | Comments

Imagine you’ve just landed your dream job. You’ve always liked tackling the hardest problems and you’ve got one now: You’ll work for a chain of coffee shops that’s struggling against fierce competition, tight budgets, and low morale. But there’s a new management team in place. As head of business intelligence (BI), you think you can be part of the team that turns things around.

Your new company (let’s call them CoffeeCo) has already invested heavily in analytics, and staff at the head office has access to a wealth of reports. But you quickly see that the problem the company faces isn’t a lack of reports—it lies elsewhere.

You know that there’s a manager at each coffee shop who’s working hard to boost sales by running promotions and encouraging use of a reward card. However, CoffeeCo’s tight budgets prevent sharing much of the existing analytics with these managers, so they struggle to see the impact of their actions. In fact, anything beyond a daily email containing limited sales data is considered too expensive. Perhaps more information will give the managers the insights they need to boost their sales. By using AWS, you’re sure you can test your ideas by building a solution that’s easy to use, robust and low cost.

Day one

On your first day, you sit down with the IT team and sketch out the architecture that you’d like built:


Your architecture is based on fully managed AWS services to ensure cost-effective operation and no long-term lock-in. Data will be sent to Amazon Kinesis Firehose, which is a fully managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, or Amazon Elasticsearch Service (Amazon ES). You choose to deliver your data to Amazon Redshift, a fully managed, petabyte-scale data warehouse service. Finally, in order to visualize the data, you will load it into Amazon QuickSight.