AWS Big Data Blog

Query and Visualize AWS Cost and Usage Data Using Amazon Athena and Amazon QuickSight

If you’ve ever wondered if a serverless alternative existed for consuming and querying your AWS Cost and Usage report data, then wonder no more. The answer is yes, and this post both introduces you to that solution and illustrates the simplicity and effortlessness of deploying it.

This solution leverages AWS serverless technologies Amazon Athena, AWS Lambda, AWS CloudFormation, and Amazon S3. But it doesn’t stop there, as you can also use Amazon QuickSight, a Serverless cloud-powered business analytics service, to build visualizations and perform ad-hoc analysis of your sanitized AWS Cost and Usage report data.

Amazon Athena, what’s that?

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. It’s a Serverless platform in which there is no need to set up or manage infrastructure. Athena scales automatically—executing queries in parallel—so results are fast, even with large datasets and complex queries.

Athena exposes several API operations that allow developers to automate running queries or using services like Lambda to trigger queries in response to events in other services like S3. This solution takes advantage of these abilities and allows you to focus on running the SQL queries that yield the results you are looking for. This solution builds the components in Athena that are needed for you to run these queries, for example, building and maintaining a database and corresponding table.

So, what’s the solution?

Today, the Billing and Cost Management service writes your AWS Cost and Usage report to an S3 bucket that you designated during the time of creation. You have the option to have AWS write these files as a GZIP or ZIP file on a schedule basis. This schedule can either be hourly or daily.

The CloudFormation template that accompanies this blog post builds a Serverless environment that contains a Lambda function that takes a CUR file, unzips it in memory, removes the header row and writes the modified report to an S3 bucket.  The Lambda function writes this file into an S3 Bucket with a directory structure of “year=current-year” and “month=current-month”. For example, if a file is written for June 1st, 2017 then the Lambda function writes the file in the folder path “bucket-name/year=2017/month=06/file_name”. The S3 bucket in which the Lambda function creates the aforementioned directory structure is constructed at the time the environment is built by the CloudFormation template.

The following diagram provides an example of what you should see in the AWS Management Console after your Lambda function runs.

(more…)

Create Custom AMIs and Push Updates to a Running Amazon EMR Cluster Using Amazon EC2 Systems Manager

Amazon EMR lets you have complete control over your cluster, giving you the flexibility to customize a cluster and install additional applications easily. EMR customers often use bootstrap actions to install and configure custom software in a cluster. However, bootstrap actions only run during the cluster or node startup. This makes it difficult for you to make configuration changes after a cluster is already running.

EMR clusters can also use a custom Amazon Machine Image (AMI). With the new support for launching clusters with custom Amazon Linux AMIs, customizing an EMR cluster is now even easier. However, the task of creating and managing custom AMIs can become increasingly difficult as the number of AMIs in your environment starts to increase.

Amazon EC2 Systems Manager helps you automate various management tasks such as automating AMI creation or running a command or script across hundreds of instances. In this post, I show how Systems Manager Automation can be used to automate the creation and patching of custom Amazon Linux AMIs for EMR.

Systems Manager Run Command lets you remotely manage the configuration of Amazon EC2 instances or on-premises machines. Run Command can be used to help you perform the following types of tasks on your EMR cluster nodes: install applications, restart daemons (HDFS, YARN, Presto, etc.), and make configuration changes. I also show how you can use Run Command to send commands to all nodes of a running EMR cluster.

Benefits of using a custom AMI

Although you can easily customize an EMR cluster using bootstrap actions, there can be benefits to using a custom AMI.

    • Reduction of cluster start time
      There are certain scenarios where a bootstrap action may affect your cluster start time. For example, your bootstrap action could be doing something like downloading a large program over the internet and delaying the time for your cluster to be ready. By adding and installing a program directly in the AMI, the time to complete a cluster launch may be reduced.
    • Prevent unexpected bootstrap action failures
      There are also scenarios where installing and configuring custom software directly in the AMI reduces the risk of unexpected failures. For example, a mirror or repo used by your bootstrap action to download a program might be offline or inaccessible. This could cause your bootstrap action to fail, which could cause a cluster launch failure.
    • Support for Amazon EBS root volume encryption
      A number of security and encryption features are available with EMR security configurations. This includes the ability to encrypt data at rest for HDFS (local volumes/Amazon EBS) and Amazon S3. However, certain regulatory/compliance policies may require that the root (boot) volume is also encrypted. By bringing your own Amazon Linux AMI, you can create AMIs that use encrypted EBS root volumes and use those AMIs for your EMR clusters.

(more…)

Unite Real-Time and Batch Analytics Using the Big Data Lambda Architecture, Without Servers!

The Big Data Lambda Architecture seeks to provide data engineers and architects with a scalable, fault-tolerant data processing architecture and framework using loosely coupled, distributed systems. At a high level, the Lambda Architecture is designed to handle both real-time and historically aggregated batched data in an integrated fashion. It separates the duties of real-time and batch processing so purpose-built engines, processes, and storage can be used for each, while serving and query layers present a unified view of all of the data.

Historically, the Lambda Architecture demanded the use of various complex systems to achieve the outcomes of uniting batch and real-time views. Data platform engineers and architects were required to implement services running on Amazon EC2 for data collection and ingestion, batch processing, stream processing, serving layers, and dashboards/reporting. As time has gone on, AWS customers have continued to ask for managed solutions that scale seamlessly and put less focus on infrastructure, allowing teams to focus on what really matters: the data and the resulting insights.

In this post, I show you how you can use AWS services like AWS Glue to build a Lambda Architecture completely without servers. I use a practical demonstration to examine the tight integration between serverless services on AWS and create a robust data processing Lambda Architecture system.

New:  AWS Glue

With the launch of AWS Glue, AWS provides a portfolio of services to architect a Big Data platform without managing any servers or clusters. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (for example, the table definition and schema) in the AWS Glue Data Catalog. After it’s cataloged, your data is immediately searchable, queryable, and available for ETL.

AWS Glue generates the code to execute your data transformations and data loading processes. Furthermore, AWS Glue provides a managed Spark execution environment to run ETL jobs against a data lake in Amazon S3. In short, you can now run a Lambda Architecture in AWS in a completely 100% serverless fashion!

“Serverless” applications allow you to build and run applications without thinking about servers. What this means is that you can now stream data in real-time, process huge volumes of data in S3, and run SQL queries and visualizations against that data without managing server provisioning, installation, patching, or capacity scaling. This frees up your users to spend more time interpreting the data and deriving business value for your organization.

Solution overview

The AWS services involved in this solution include:

The following diagram explains how the services work together:

None of the included services require the creation, configuration, or installation of servers, clusters, and databases.

In this example, you use these services to send and process simulated streaming data of sensor devices to Kinesis Firehose and store the raw data in S3. Using AWS Glue, you analyze the raw data from S3 in batch-oriented fashion to look at the thermostat efficiency over time against the historical data, and store results back in S3.

Using Amazon Kinesis Analytics, you analyze and filter the data to detect inefficient sensors in real time. In this example, you detect inefficiencies in thermostat devices by comparing their temperature settings against the temperature they are reading (where thermostats are typically set between 70-72º F).

Finally, you use Athena and Amazon QuickSight to query and visualize the data and build a dashboard that can be shared with other users of Amazon QuickSight in your organization.

(more…)

Implement Continuous Integration and Delivery of Apache Spark Applications using AWS

by Luis Caro Perez and Samuel Schmidt | on | Permalink | Comments |  Share

When you develop Apache Spark–based applications, you might face some additional challenges when dealing with continuous integration and deployment pipelines, such as the following common issues:

  • Applications must be tested on real clusters using automation tools (live test)
  • Any user or developer must be able to easily deploy and use different versions of both the application and infrastructure to be able to debug, experiment on, and test different functionality.
  • Infrastructure needs to be evaluated and tested along with the application that uses it.

In this post, we walk you through a solution that implements a continuous integration and deployment pipeline supported by AWS services. The pipeline offers the following workflow:

  • Deploy the application to a QA stage after a commit is performed to the source code.
  • Perform a unit test using Spark local mode.
  • Deploy to a dynamically provisioned Amazon EMR cluster and test the Spark application on it
  • Update the application as an AWS Service Catalog product version, allowing a user to deploy any version (commit) of the application on demand.

Solution overview

The following diagram shows the pipeline workflow.

The solution uses AWS CodePipeline, which allows users to orchestrate and automate the build, test, and deploy stages for application source code. The solution consists of a pipeline that contains the following stages:

  • Source: Both the Spark application source code in addition to the AWS CloudFormation template file for deploying the application are committed to version control. In this example, we use AWS CodeCommit. For an example of the application source code, see zip. 
  • Build: In this stage, you use Apache Maven both to generate the application .jar binaries and to execute all of the application unit tests that end with *Spec.scala. In this example, we use AWS CodeBuild, which runs the unit tests given that they are designed to use Spark local mode.
  • QADeploy: In this stage, the .jar file built previously is deployed using the CloudFormation template included with the application source code. All the resources are created in this stage, such as networks, EMR clusters, and so on. 
  • LiveTest: In this stage, you use Apache Maven to execute all the application tests that end with *SpecLive.scala. The tests submit EMR steps to the cluster created as part of the QADeploy step. The tests verify that the steps ran successfully and their results. 
  • LiveTestApproval: This stage is included in case a pipeline administrator approval is required to deploy the application to the next stages. The pipeline pauses in this stage until an administrator manually approves the release. 
  • QACleanup: In this stage, you use an AWS Lambda function to delete the CloudFormation template deployed as part of the QADeploy stage. The function does not affect any resources other than those deployed as part of the QADeploy stage. 
  • DeployProduct: In this stage, you use a Lambda function that creates or updates an AWS Service Catalog product and portfolio. Every time the pipeline releases a change to the application, the AWS Service Catalog product gets a new version, with the commit of the change as the version description. 

(more…)

Amazon QuickSight Now Supports Search, Filter Groups, and Amazon S3 Analytics Connector

Today, I’m excited to share information about some new features in Amazon QuickSight. First, you can now search for datasets, analyses, and dashboards in Amazon QuickSight using the unified search box, making it faster and easier to find and access your data. Next, you can now create filter groups with multiple filter conditions that are evaluated together using the OR operation. Finally, you can now use the built-in Amazon S3 analytics connector to visualize your S3 storage access patterns across multiple S3 buckets and configurations within a single Amazon QuickSight dashboard to optimize for cost.

Search

You can now easily and quickly find and access your datasets, analyses, and dashboards using the unified search box in Amazon QuickSight. Type in what you’re looking for and you get a list of all matches in a unified view. From there, you can take actions such as creating an analysis from a dataset, modifying a dataset, or accessing an analysis or dashboard.

(more…)

Analyzing Salesforce Data with Amazon QuickSight

Salesforce Sales Cloud is a powerful platform for managing customer data. One of the key functions that the platform provides is the ability to track customer opportunities. Opportunities in Salesforce are used to track revenue, sales pipelines, and other activities from the very first contact with a potential customer to a closed sale.

Amazon QuickSight is a rich data visualization tool that provides the ability to connect to Salesforce data and use it as a data source for creating analyses, stories, and dashboards  and easily share them with others in the organization. This post focuses on how to connect to Salesforce as a data source and create a useful opportunity dashboard, incorporating Amazon QuickSight features like relative date filters, Key Performance Indicator (KPI) charts, and more.

Walkthrough

In this post, you walk through the following tasks:

  • Creating a new data set based on Salesforce data
  • Creating your analysis and adding visuals
  • Creating an Amazon QuickSight dashboard
  • Working with filters

Note: For this walkthrough, I am using my own Salesforce.com Developer Edition account. You can sign up for your own free developer account at https://developer.salesforce.com/.

Creating a new Amazon QuickSight data set based on Salesforce data

To start, you need to create a new Amazon QuickSight data set. Sign in to Amazon QuickSight at https://quicksight.aws using the link from the home page. Enter your Amazon QuickSight account name and choose Continue. Next, enter your Email address or user name and password, then choose Sign In.

On the Amazon QuickSight start page, choose Manage Data, which takes you to a list of your data sets. Choose New Data Set, and choose Salesforce as your data source. Enter a data source name—in this example, I called mine “SFDC Opportunity.” Choose Create Data Source to open the Salesforce authentication page, where you can enter your Salesforce user name and password.

After you are authenticated to Salesforce, you are presented with a drop-down list that lets you select data from Reports or Objects. For this tutorial, choose Object. Scroll down in the list to choose the Opportunity object, and then choose Select.

(more…)

From Data Lake to Data Warehouse: Enhancing Customer 360 with Amazon Redshift Spectrum

Achieving a 360o-view of your customer has become increasingly challenging as companies embrace omni-channel strategies, engaging customers across websites, mobile, call centers, social media, physical sites, and beyond. The promise of a web where online and physical worlds blend makes understanding your customers more challenging, but also more important. Businesses that are successful in this medium have a significant competitive advantage.

The big data challenge requires the management of data at high velocity and volume. Many customers have identified Amazon S3 as a great data lake solution that removes the complexities of managing a highly durable, fault tolerant data lake infrastructure at scale and economically.

AWS data services substantially lessen the heavy lifting of adopting technologies, allowing you to spend more time on what matters most—gaining a better understanding of customers to elevate your business. In this post, I show how a recent Amazon Redshift innovation, Redshift Spectrum, can enhance a customer 360 initiative.

Customer 360 solution

A successful customer 360 view benefits from using a variety of technologies to deliver different forms of insights. These could range from real-time analysis of streaming data from wearable devices and mobile interactions to historical analysis that requires interactive, on demand queries on billions of transactions. In some cases, insights can only be inferred through AI via deep learning. Finally, the value of your customer data and insights can’t be fully realized until it is operationalized at scale—readily accessible by fleets of applications. Companies are leveraging AWS for the breadth of services that cover these domains, to drive their data strategy.

A number of AWS customers stream data from various sources into a S3 data lake through Amazon Kinesis. They use Kinesis and technologies in the Hadoop ecosystem like Spark running on Amazon EMR to enrich this data. High-value data is loaded into an Amazon Redshift data warehouse, which allows users to analyze and interact with data through a choice of client tools. Redshift Spectrum expands on this analytics platform by enabling Amazon Redshift to blend and analyze data beyond the data warehouse and across a data lake.

The following diagram illustrates the workflow for such a solution.

This solution delivers value by:

  • Reducing complexity and time to value to deeper insights. For instance, an existing data model in Amazon Redshift may provide insights across dimensions such as customer, geography, time, and product on metrics from sales and financial systems. Down the road, you may gain access to streaming data sources like customer-care call logs and website activity that you want to blend in with the sales data on the same dimensions to understand how web and call center experiences maybe correlated with sales performance. Redshift Spectrum can join these dimensions in Amazon Redshift with data in S3 to allow you to quickly gain new insights, and avoid the slow and more expensive alternative of fully integrating these sources with your data warehouse.
  • Providing an additional avenue for optimizing costs and performance. In cases like call logs and clickstream data where volumes could be many TBs to PBs, storing the data exclusively in S3 yields significant cost savings. Interactive analysis on massive datasets may now be economically viable in cases where data was previously analyzed periodically through static reports generated by inexpensive batch processes. In some cases, you can improve the user experience while simultaneously lowering costs. Spectrum is powered by a large-scale infrastructure external to your Amazon Redshift cluster, and excels at scanning and aggregating large volumes of data. For instance, your analysts maybe performing data discovery on customer interactions across millions of consumers over years of data across various channels. On this large dataset, certain queries could be slow if you didn’t have a large Amazon Redshift cluster. Alternatively, you could use Redshift Spectrum to achieve a better user experience with a smaller cluster.

(more…)

Analyzing AWS Cost and Usage Reports with Looker and Amazon Athena

This is a guest post by Dillon Morrison at Looker. Looker is, in their own words, “a new kind of analytics platform–letting everyone in your business make better decisions by getting reliable answers from a tool they can use.” 

As the breadth of AWS products and services continues to grow, customers are able to more easily move their technology stack and core infrastructure to AWS. One of the attractive benefits of AWS is the cost savings. Rather than paying upfront capital expenses for large on-premises systems, customers can instead pay variables expenses for on-demand services. To further reduce expenses AWS users can reserve resources for specific periods of time, and automatically scale resources as needed.

The AWS Cost Explorer is great for aggregated reporting. However, conducting analysis on the raw data using the flexibility and power of SQL allows for much richer detail and insight, and can be the better choice for the long term. Thankfully, with the introduction of Amazon Athena, monitoring and managing these costs is now easier than ever.

In the post, I walk through setting up the data pipeline for cost and usage reports, Amazon S3, and Athena, and discuss some of the most common levers for cost savings. I surface tables through Looker, which comes with a host of pre-built data models and dashboards to make analysis of your cost and usage data simple and intuitive.

Analysis with Athena

With Athena, there’s no need to create hundreds of Excel reports, move data around, or deploy clusters to house and process data. Athena uses Apache Hive’s DDL to create tables, and the Presto querying engine to process queries. Analysis can be performed directly on raw data in S3. Conveniently, AWS exports raw cost and usage data directly into a user-specified S3 bucket, making it simple to start querying with Athena quickly. This makes continuous monitoring of costs virtually seamless, since there is no infrastructure to manage. Instead, users can leverage the power of the Athena SQL engine to easily perform ad-hoc analysis and data discovery without needing to set up a data warehouse.

After the data pipeline is established, cost and usage data (the recommended billing data, per AWS documentation) provides a plethora of comprehensive information around usage of AWS services and the associated costs. Whether you need the report segmented by product type, user identity, or region, this report can be cut-and-sliced any number of ways to properly allocate costs for any of your business needs. You can then drill into any specific line item to see even further detail, such as the selected operating system, tenancy, purchase option (on-demand, spot, or reserved), and so on.

Walkthrough

By default, the Cost and Usage report exports CSV files, which you can compress using gzip (recommended for performance). There are some additional configuration options for tuning performance further, which are discussed below.

Prerequisites

If you want to follow along, you need the following resources:

Enable the cost and usage reports

First, enable the Cost and Usage report. For Time unit, select Hourly. For Include, select Resource IDs. All options are prompted in the report-creation window.

The Cost and Usage report dumps CSV files into the specified S3 bucket. Please note that it can take up to 24 hours for the first file to be delivered after enabling the report.

(more…)

Harmonize, Query, and Visualize Data from Various Providers using AWS Glue, Amazon Athena, and Amazon QuickSight

Have you ever been faced with many different data sources in different formats that need to be analyzed together to drive value and insights?  You need to be able to query, analyze, process, and visualize all your data as one canonical dataset, regardless of the data source or original format.

In this post, I walk through using AWS Glue to create a query optimized, canonical dataset on Amazon S3 from three different datasets and formats. Then, I use Amazon Athena and Amazon QuickSight to query against that data quickly and easily.

AWS Glue overview

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for query and analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. Point AWS Glue to your data stored on AWS, and a crawler discovers your data, classifies it, and stores the associated metadata (such as table definitions) in the AWS Glue Data Catalog. After it’s cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the ETL code for data transformation, and loads the transformed data into a target data store for analytics.

The AWS Glue ETL engine generates Python code that is entirely customizable, reusable, and portable. You can edit the code using your favorite IDE or notebook and share it with others using GitHub. After your ETL job is ready, you can schedule it to run on the AWS Glue fully managed, scale-out Spark environment, using its flexible scheduler with dependency resolution, job monitoring, and alerting.

AWS Glue is serverless. It automatically provisions the environment needed to complete the job, and customers pay only for the compute resources consumed while running ETL jobs. With AWS Glue, data can be available for analytics in minutes.

Walkthrough

During this post, I step through an example using the New York City Taxi Records dataset. I focus on one month of data, but you could easily do this for the entire eight years of data. At the time of this post, AWS Glue is available in US-East-1 (N. Virginia).

As you crawl the unknown dataset, you discover that the data is in different formats, depending on the type of taxi. You then convert the data to a canonical form, start to analyze it, and build a set of visualizations… all without launching a single server.

(more…)

Upsert into Amazon Redshift using AWS Glue and SneaQL

This is a guest post by Jeremy Winters and Ritu Mishra, Solution Architects at Full 360. In their own words, “Full 360 is a cloud first, cloud native integrator, and true believers in the cloud since inception in 2007, our focus has been on helping customers with their journey into the cloud. Our practice areas – Big Data and Warehousing, Application Modernization, and Cloud Ops/Strategy – represent deep, but focused expertise.”

AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores.  As a company who has been building data warehouse solutions in the cloud for 10 years, we at Full 360 were interested to see how we can leverage AWS Glue in customer solutions. This post details our experience and lessons learned from using AWS Glue for an Amazon Redshift data integration use case.

UI-based ETL Tools

We have been anticipating the release of AWS Glue since it was announced at re:Invent 2016. Many of our customers are looking for an easy to use, UI-based tooling to manage their data transformation pipeline. However in our experience, the complexity of any production pipeline tends to be difficult to unwind, regardless of the technology used to create them. At Full 360, we build cloud-native, script-based applications deployed in containers to handle data integration. We think script-based transformation provides a balance of robustness and flexibility necessary to handle any data problem that comes our way.

AWS Glue caters both to developers who prefer writing scripts and those who want UI-based interactions. It is possible to initially create jobs using the UI, by selecting data source and target. Under the hood, AWS Glue auto-generates the Python code for you, which can be edited if needed, though this isn’t necessary for the majority of use cases.

Of course, you don’t have to rely on the UI at all. You can simply write your own Python scripts, store them in Amazon S3 with any dependent libraries, and import them into AWS Glue. AWS Glue also supports IDE integration with tools such as PyCharm, and interestingly enough, Zeppelin notebooks! These integrations are targeted toward developers who prefer writing Python themselves and want a cleaner dev/test cycle.

Developers who are already in the business of scripting ETL will be excited by the ability to easily deploy Python scripts with AWS Glue, using existing workflows for source control and CI/CD, and have them deployed and executed in a fully managed manner by AWS.  The UI experience of AWS Glue works well, but it is good to know that the tool accommodates those who prefer traditional coding. You can also dig into complex edge cases for data handling where the UI doesn’t cut it.

Serverless!

When AWS Lambda was released, we were excited for its potential to host our ETL processes. With Lambda, you are limited to the five-minute maximum for function execution. We resorted to running Docker containers and Amazon ECS to orchestrate many of our customers long running tasks. With this approach, we are still required to manage the underlying infrastructure.

After a closer look at AWS Glue, we realize that it is a full serverless PySpark runtime, accompanied by an Apache Hive metastore compatible catalog-as-a-service. This means that you are not just running a script on a single core, but instead you have the full power of a multi-worker Spark environment available. If you’re not familiar with the Spark framework, it introduces a new paradigm that allows for the processing of distributed, in-memory datasets. Spark has many uses, from data transformation to machine learning.

In AWS Glue, you use PySpark dataframes to transform data before reaching your database. Dataframes manage data in a way similar to relational databases, so the methods are likely to be familiar to most SQL users. Additionally, you can use SQL in your PySpark code to manipulate the data.

AWS Glue also simplifies the management of the runtime environment by providing you with a DPU setting, which allows you to dial up or down the amount of compute resources used to run your job. One DPU is equivalent to 4 vCPU, 16 GB Mem.

Common use case

We can see that most customers would leverage AWS Glue to load one or many files from S3 into Amazon Redshift. To accelerate this process, you can use the crawler, an AWS console-based utility, to discover the schema of your data and store it in the AWS Glue Data Catalog, whether your data sits in a file or a database. We were able to discover the schemas of our source file and target table, then have AWS Glue construct and execute the ETL job. It worked! We successfully loaded the beer drinkers’ dataset, JSON files used in our advanced tuning class, into Amazon Redshift!

Our use case

For our use case, we integrated AWS Glue and Amazon Redshift Spectrum with an open-source project that we initiated called SneaQL. SneaQL is an open source, containerized framework for sneaking interactivity into static SQL. SneaQL provides an extension to ANSI SQL with command tags to provide functionality such as loops, variables, conditional execution, and error handling to your scripts. It allows you to manage your scripts in an AWS CodeCommit Git repo, which then gets deployed and executed in a container.

We use SneaQL for complex data integrations, usually with an ELT model, where the data is loaded into the database, then transformed into fact tables using parameterized SQL. SneaQL enables advanced use cases like partial upsert aggregation of data, where multiple data sources can merge into the same fact table.

We think AWS Glue, Redshift Spectrum, and SneaQL offer a compelling way to build a data lake in S3, with all of your metadata accessible through a variety of tools such as Hive, Presto, Spark, and Redshift Spectrum). Build your aggregation table in Amazon Redshift to drive your dashboards or other high-performance analytics.

In the video below, you see a demonstration of using AWS Glue to convert JSON documents into Parquet for partitioned storage in an S3 data lake. We then access the data from S3 into Amazon Redshift by way of Redshift Spectrum. Nearing the end of the AWS Glue job, we then call AWS boto3 to trigger an Amazon ECS SneaQL task to perform an upsert of the data into our fact table. All the sample artifacts needed for this demonstration are available in the Full360/Sneaql Github repository.

(more…)