AWS Big Data Blog

Tag: Data Lake

Visualize Big Data with Amazon QuickSight, Presto, and Apache Spark on Amazon EMR

Last December, we introduced the Amazon Athena connector in Amazon QuickSight, in the Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight post. The connector allows you to visualize your big data easily in Amazon S3 using Athena’s interactive query engine in a serverless fashion. This turned […]

Read More

Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight

Ben Snively is a Solutions Architect with AWS Speed and agility are essential with today’s analytics tools. The quicker you can get from idea to first results, the more you can experiment and innovate with your data, perform ad-hoc analysis, and drive answers to new business questions. Serverless architectures help in this respect by taking […]

Read More

Introducing the Data Lake Solution on AWS

NOTE: The solution in this post is in the process of being updated. For the most current information, please visit the What is a data lake? page. This blog post has been translated into Japanese. Many of our customers choose to build their data lake on AWS. They find the flexible, pay-as-you-go, cloud model is […]

Read More

Optimize Amazon S3 for High Concurrency in Distributed Workloads

In today’s blog post, I will discuss how to optimize Amazon S3 for an architecture commonly used to enable genomic data analyses. This optimization is important to my work in genomics because, as genome sequencing continues to drop in price, the rate at which data becomes available is accelerating.

Read More

Data Lake Ingestion: Automatically Partition Hive External Tables with AWS

In this post, I introduce a simple data ingestion and preparation framework based on AWS Lambda, Amazon DynamoDB, and Apache Hive on EMR for data from different sources landing in S3. This solution lets Hive pick up new partitions as data is loaded into S3 because Hive by itself cannot detect new partitions as data lands.

Read More

Using Spark SQL for ETL

Ben Snively is a Solutions Architect with AWS With big data, you deal with many different formats and large volumes of data. SQL-style queries have been around for nearly four decades. Many systems support SQL-style syntax on top of the data layers, and the Hadoop/Spark ecosystem is no exception. This allows companies to try new […]

Read More