AWS Big Data Blog

Category: Analytics

Optimize Ama­zon EMR costs for legacy and Spark workloads

December 2023: This post was reviewed and updated for accuracy. Customers migrating from large on-premises Hadoop clusters to Amazon EMR like to reduce their operational costs while running resilient applications. On-premises customers typically use in-elastic, large, fixed-size Hadoop clusters, which incurs high capital expenditure. You can now migrate your mixed workloads to Amazon EMR, which […]

Identify source schema changes using AWS Glue

In today’s world, organizations are collecting an unprecedented amount of data from all kinds of different data sources, such as transactional data stores, clickstreams, log data, IoT data, and more. This data is often in different formats, such as structured data or unstructured data, and is usually referred to as the three Vs of big […]

Run Apache Spark with Amazon EMR on EKS backed by Amazon FSx for Lustre storage

September 2023: This post was reviewed and updated for accuracy to reflect recent improvements and changes. Traditionally, Spark workloads have been run on a dedicated setup like a Hadoop stack with YARN or MESOS as a resource manager. Starting from Apache Spark 2.3, Spark added support for Kubernetes as a resource manager. The new Kubernetes […]

Choose the k-NN algorithm for your billion-scale use case with OpenSearch

February 2023: This post was reviewed and updated for accuracy of the code. When organizations set out to build machine learning (ML) applications such as natural language processing (NLP) systems, recommendation engines, or search-based systems, often times k-Nearest Neighbor (k-NN) search will be used at some point in the workflow. As the number of data […]

Fine-grained entitlements in Amazon Redshift: A case study from TrustLogix

This post is co-written with Srikanth Sallaka from TrustLogix as the lead author. TrustLogix is a cloud data access governance platform that monitors data usage to discover patterns, provide insights on least privileged access controls, and manage fine-grained data entitlements across data lake storage solutions like Amazon Simple Storage Service (Amazon S3), data warehouses like […]

Amazon migrates financial reporting to Amazon QuickSight

This is a guest post by from Chitradeep Barman and Yaniv Ackerman  from Amazon Finance Technology (FinTech). Amazon Finance Technology (FinTech) is responsible for financial reporting on Earth’s largest transaction dataset, as the central organization supporting accounting and tax operations across Amazon. Amazon FinTech’s accounting, tax, and business finance teams close books and file taxes […]

New additions to line charts in Amazon QuickSight

Amazon QuickSight is a fully-managed, cloud-native business intelligence (BI) service that makes it easy to create and deliver insights to everyone in your organization or even with your customers and partners. You can make your data come to life with rich interactive charts and create beautiful dashboards to be shared with thousands of users, either […]

Integrate AWS IAM Identity Center (successor to AWS Single Sign-On) with AWS Lake Formation fine-grained access controls

Data lakes are a centralized repository for storing structured and unstructured data at scale. Data lakes enable you to create dashboards, perform big data processing and real-time analytics, and create machine learning (ML) models on your data to drive business decisions. Many customers are choosing AWS Lake Formation as their data lake management solution. Lake […]

Implement a highly available key distribution center for Amazon EMR

High availability (HA) is the property of a system or service to operate continuously without failing for a designated period of time. Implementing HA properties over a system allows you to eliminate single points of failure that usually translate to service disruptions, which can then lead to a business loss or the inability to use […]

Create single output files for recipe jobs using AWS Glue DataBrew

July 2023: This post was reviewed for accuracy. AWS Glue DataBrew offers over 350 pre-built transformations to automate data preparation tasks (such as filtering anomalies, standardizing formats, and correcting invalid values) that would otherwise require days or weeks writing hand-coded transformations. You can now choose single or multiple output files instead of autogenerated files for […]