AWS Big Data Blog

Category: AWS Big Data

The following diagram shows the overall architecture to address our two challenges.

Extract multidimensional data from Microsoft SQL Server Analysis Services using AWS Glue

AWS Glue is fully managed service that makes it easier for you to extract, transform, and load (ETL) data for analytics. You can easily create ETL jobs to connect to backend data sources. There are several natively supported data sources, but what if you need to extract data from an unsupported data source? What if […]

Read More
The following diagram shows our solution architecture.

Effective data lakes using AWS Lake Formation, Part 2: Creating a governed table for streaming data sources

We announced the preview of AWS Lake Formation transactions, row-level security, and acceleration at AWS re:Invent 2020. In Part 1 of this series, we explained how to set up a governed table and add objects to it. In this post, we expand on this example, and demonstrate how to ingest streaming data into governed tables using Lake Formation transactions. […]

Read More
The following graph shows that the minimum throughput achieved with the persistent HFile

Amazon EMR 6.2.0 adds persistent HFile tracking to improve performance with HBase on Amazon S3

Apache HBase is an open-source, NoSQL database that you can use to achieve low latency random access to billions of rows. Starting with Amazon EMR 5.2.0, you can enable HBase on Amazon Simple Storage Service (Amazon S3). With HBase on Amazon S3, the HBase data files (HFiles) are written to Amazon S3, enabling data lake […]

Read More

Automate dynamic mapping and renaming of column names in data files using AWS Glue: Part 2

In Part 1 of this two-part post, we looked at how we can create an AWS Glue ETL job that is agnostic enough to rename columns of a data file by mapping to column names of another file. The solution focused on using a single file that was populated in the AWS Glue Data Catalog […]

Read More

Automate dynamic mapping and renaming of column names in data files using AWS Glue: Part 1

A common challenge ETL and big data developers face is working with data files that don’t have proper name header records. They’re tasked with renaming the columns of the data files appropriately so that downstream application and mappings for data load can work seamlessly. One example use case is while working with ORC files and […]

Read More

Top 9 performance tuning tips for PrestoDB on Amazon EMR

Presto is a popular distributed SQL query engine for interactive data analytics. With its massively parallel processing (MPP) architecture, it’s capable of directly querying large datasets without the need of time-consuming and costly ETL processes. With a properly tuned Presto cluster you can run fast queries against big data with response times ranging from subsecond […]

Read More
In the navigation name, choose Marketplace and search for Salesforce.

Ingest Salesforce data into Amazon S3 using the CData JDBC custom connector with AWS Glue

Organizations that successfully generate business value from their data will outperform their peers. Many AWS customers require a data storage and analytics solution that combines the prospect information stored in Salesforce, a popular and widely used customer relationship management (CRM) platform, with other structured and unstructured data in their data lake to innovate and build […]

Read More

Effective data lakes using AWS Lake Formation, Part 1: Getting started with governed tables

Thousands of customers are building their data lakes on Amazon Simple Storage Service (Amazon S3). You can use AWS Lake Formation to build your data lakes easily—in a matter of days as opposed to months. However, there are still some difficult challenges to address with your data lakes: Supporting streaming updates and deletes in your data […]

Read More
The following diagram illustrates this architecture.

Run usage analytics on Amazon QuickSight using AWS CloudTrail

Amazon QuickSight is a cloud-native BI service that allows end users to create and publish dashboards in minutes, without provisioning any servers or requiring complex licensing. You can view these dashboards on the QuickSight product console or embed them into applications and websites. After you deploy a dashboard, it’s important to assess how they and […]

Read More
The following diagram illustrates the architecture of this intermediate pipeline to generate training data.

Retaining data streams up to one year with Amazon Kinesis Data Streams

Streaming data is used extensively for use cases like sharing data between applications, streaming ETL (extract, transform, and load), real-time analytics, processing data from internet of things (IoT) devices, application monitoring, fraud detection, live leaderboards, and more. Typically, data streams are stored for short durations of time before being loaded into a permanent data store […]

Read More