AWS Big Data Blog
Tag: Data Lake
Visualize Big Data with Amazon QuickSight, Presto, and Apache Spark on Amazon EMR
Last December, we introduced the Amazon Athena connector in Amazon QuickSight, in the Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight post. The connector allows you to visualize your big data easily in Amazon S3 using Athena’s interactive query engine in a serverless fashion. This turned […]
Top 10 Performance Tuning Tips for Amazon Athena
May 2022: This post was reviewed and updated with more details like using EXPLAIN ANALYZE, updated compression, ORDER BY and JOIN tips, using partition indexing, updated stats (with performance improvements), added bonus tips. Amazon Athena is an interactive query service that makes it easy to analyze data stored in Amazon Simple Storage Service (Amazon S3) […]
Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight
Ben Snively is a Solutions Architect with AWS Speed and agility are essential with today’s analytics tools. The quicker you can get from idea to first results, the more you can experiment and innovate with your data, perform ad-hoc analysis, and drive answers to new business questions. Serverless architectures help in this respect by taking […]
Introducing the Data Lake Solution on AWS
NOTE: The solution in this post is in the process of being updated. For the most current information, please visit the What is a data lake? page. This blog post has been translated into Japanese. Many of our customers choose to build their data lake on AWS. They find the flexible, pay-as-you-go, cloud model is […]
Optimize Amazon S3 for High Concurrency in Distributed Workloads
In today’s blog post, I will discuss how to optimize Amazon S3 for an architecture commonly used to enable genomic data analyses. This optimization is important to my work in genomics because, as genome sequencing continues to drop in price, the rate at which data becomes available is accelerating.
Data Lake Ingestion: Automatically Partition Hive External Tables with AWS
In this post, I introduce a simple data ingestion and preparation framework based on AWS Lambda, Amazon DynamoDB, and Apache Hive on EMR for data from different sources landing in S3. This solution lets Hive pick up new partitions as data is loaded into S3 because Hive by itself cannot detect new partitions as data lands.
Using Spark SQL for ETL
Ben Snively is a Solutions Architect with AWS With big data, you deal with many different formats and large volumes of data. SQL-style queries have been around for nearly four decades. Many systems support SQL-style syntax on top of the data layers, and the Hadoop/Spark ecosystem is no exception. This allows companies to try new […]