AWS Big Data Blog

The following diagram shows the overall architecture to address our two challenges.

Extract multidimensional data from Microsoft SQL Server Analysis Services using AWS Glue

AWS Glue is fully managed service that makes it easier for you to extract, transform, and load (ETL) data for analytics. You can easily create ETL jobs to connect to backend data sources. There are several natively supported data sources, but what if you need to extract data from an unsupported data source? What if […]

Read More
The following diagram shows our solution architecture.

Effective data lakes using AWS Lake Formation, Part 2: Creating a governed table for streaming data sources

We announced the preview of AWS Lake Formation transactions, row-level security, and acceleration at AWS re:Invent 2020. In Part 1 of this series, we explained how to set up a governed table and add objects to it. In this post, we expand on this example, and demonstrate how to ingest streaming data into governed tables using Lake Formation transactions. […]

Read More
The following graph shows that the minimum throughput achieved with the persistent HFile

Amazon EMR 6.2.0 adds persistent HFile tracking to improve performance with HBase on Amazon S3

Apache HBase is an open-source, NoSQL database that you can use to achieve low latency random access to billions of rows. Starting with Amazon EMR 5.2.0, you can enable HBase on Amazon Simple Storage Service (Amazon S3). With HBase on Amazon S3, the HBase data files (HFiles) are written to Amazon S3, enabling data lake […]

Read More

Easily ingest and analyze Google Analytics data with Upsolver and Amazon AppFlow

This post is co-written by Mei Long at Upsolver.  Software as a service (SaaS) based applications are in demand today, and customers have growing need for adopting many of them in their use cases. As adoption grows, extracting data within these various SaaS applications and running analytics across them gets complicated. Although there are several […]

Read More

Best Western slashes analytics costs, improves operations worldwide using Amazon QuickSight

This is a guest blog post by Best Western Hotels and Resorts. In their own words, “Best Western Hotels & Resorts is an award-winning global network of hotels located in over 100 countries and territories that offers accommodations for all types of travelers.” With 18 brands and varied ownership structures across geographies, Best Western Hotel […]

Read More

How 1Strategy simplified their spreadsheet ETL process using AWS Glue DataBrew

This is a guest blog post by Pat Reilly and Gary Houk at 1Strategy. In their own words, “1Strategy is an APN Premier Consulting Partner focusing exclusively on AWS solutions. 1Strategy consultants help businesses architect, migrate, and optimize their workloads on AWS, creating scalable, cost-effective, secure, and reliable solutions. 1Strategy holds the AWS DevOps, Migration, […]

Read More

Automate dynamic mapping and renaming of column names in data files using AWS Glue: Part 2

In Part 1 of this two-part post, we looked at how we can create an AWS Glue ETL job that is agnostic enough to rename columns of a data file by mapping to column names of another file. The solution focused on using a single file that was populated in the AWS Glue Data Catalog […]

Read More

Automate dynamic mapping and renaming of column names in data files using AWS Glue: Part 1

A common challenge ETL and big data developers face is working with data files that don’t have proper name header records. They’re tasked with renaming the columns of the data files appropriately so that downstream application and mappings for data load can work seamlessly. One example use case is while working with ORC files and […]

Read More

New features from Apache Hudi available in Amazon EMR

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development by providing record-level insert, update and delete capabilities. This record-level capability is helpful if you’re building your data lakes on Amazon S3 or HDFS. You can use it to comply with data privacy regulations and simplify data […]

Read More

Top 9 performance tuning tips for PrestoDB on Amazon EMR

Presto is a popular distributed SQL query engine for interactive data analytics. With its massively parallel processing (MPP) architecture, it’s capable of directly querying large datasets without the need of time-consuming and costly ETL processes. With a properly tuned Presto cluster you can run fast queries against big data with response times ranging from subsecond […]

Read More