AWS Big Data Blog

Category: AWS Big Data

Amazon EMR launches support for Amazon EC2 M6A, R6A instances to improve cost performance for Spark workloads by 15–50% 

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over 2x performance improvements over open-source Apache Spark and Presto. With Amazon EMR release 6.8, you can now use […]

Gain visibility into your Amazon MSK cluster by deploying the Conduktor Platform

This is a guest post by AWS Data Hero and co-founder of Conduktor, Stephane Maarek. Deploying Apache Kafka on AWS is now easier, thanks to Amazon Managed Streaming for Apache Kafka (Amazon MSK). In a few clicks, it provides you with a production-ready Kafka cluster on which you can run your applications and create data […]

Introducing the Cloud Shuffle Storage Plugin for Apache Spark

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. In AWS Glue, you can use Apache Spark, an open-source, distributed processing system for your data integration tasks and big data workloads. Apache Spark utilizes in-memory caching and optimized […]

Build your Apache Hudi data lake on AWS using Amazon EMR – Part 1

Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development. It does this by bringing core warehouse and database functionality directly to a data lake on Amazon Simple Storage Service (Amazon S3) or Apache HDFS. Hudi provides table management, instantaneous views, efficient upserts/deletes, advanced indexes, streaming […]

How Etleap and Amazon Redshift Serverless optimize costs for ETL

Amazon Redshift Serverless lets you avoid managing infrastructure while only paying for what you use. Etleap provides data integration software that is natively built on AWS. It’s an AWS Advanced Technology Partner with the AWS Data & Analytics Competency and Amazon Redshift Service Ready designation. In this post, we share how you can minimize the […]

How GoDaddy built a data mesh to decentralize data ownership using AWS Lake Formation

This is a guest post co-written with Ankit Jhalaria from GoDaddy. GoDaddy is empowering everyday entrepreneurs by providing all the help and tools to succeed online. With more than 20 million customers worldwide, GoDaddy is the place people come to name their idea, build a professional website, attract customers, and manage their work. GoDaddy is […]

Get started with data integration from Amazon S3 to Amazon Redshift using AWS Glue interactive sessions

Organizations are placing a high priority on data integration, especially to support analytics, machine learning (ML), business intelligence (BI), and application development initiatives. Data is growing exponentially and is generated by increasingly diverse data sources. Data integration becomes challenging when processing data at scale and the inherent heavy lifting associated with infrastructure required to manage […]

Share and publish your Snowflake data to AWS Data Exchange using Amazon Redshift data sharing

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Today, tens of thousands of AWS customers—from Fortune 500 companies, startups, and everything in between—use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, […]

Use an event-driven architecture to build a data mesh on AWS

In this post, we take the data mesh design discussed in Design a data mesh architecture using AWS Lake Formation and AWS Glue, and demonstrate how to initialize data domain accounts to enable managed sharing; we also go through how we can use an event-driven approach to automate processes between the central governance account and […]

How Hudl built a cost-optimized AWS Glue pipeline with Apache Hudi datasets

This is a guest blog post co-written with Addison Higley and Ramzi Yassine from Hudl. Hudl Agile Sports Technologies, Inc. is a Lincoln, Nebraska based company that provides tools for coaches and athletes to review game footage and improve individual and team play. Its initial product line served college and professional American football teams. Today, […]