AWS Big Data Blog

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

Organizations have chosen to build data lakes on top of Amazon Simple Storage Service (Amazon S3) for many years. A data lake is the most popular choice for organizations to store all their organizational data generated by different teams, across business domains, from all different formats, and even over history. According to a study, the […]

Monitor Apache HBase on Amazon EMR using Amazon Managed Service for Prometheus and Amazon Managed Grafana

Amazon EMR provides a managed Apache Hadoop framework that makes it straightforward, fast, and cost-effective to run Apache HBase. Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. It is an open-source, non-relational, versioned database that runs on top of the Apache Hadoop Distributed File System (HDFS). It’s built […]

Chargeback Gurus empowers eCommerce merchants with advanced chargeback intelligence to recover millions using Amazon QuickSight

This is a guest post by Suresh Dakshina and Damodharan Sampathkumar from Chargeback Gurus. Chargeback Gurus, a global financial technology company helps businesses fight, prevent, and win chargebacks. To date, we have helped businesses worldwide recover over $2 billion in lost revenue. As trusted advisors to card networks and Fortune 500 companies, we are known […]

How OLX Group migrated to Amazon Redshift RA3 for simpler, faster, and more cost-effective analytics

This is a guest post by Miguel Chin, Data Engineering Manager at OLX Group and David Greenshtein, Specialist Solutions Architect for Analytics, AWS. OLX Group is one of the world’s fastest-growing networks of online marketplaces, operating in over 30 countries around the world. We help people buy and sell cars, find housing, get jobs, buy […]

Synchronize your Salesforce and Snowflake data to speed up your time to insight with Amazon AppFlow

This post was co-written with Amit Shah, Principal Consultant at Atos. Customers across industries seek meaningful insights from the data captured in their Customer Relationship Management (CRM) systems. To achieve this, they combine their CRM data with a wealth of information already available in their data warehouse, enterprise systems, or other software as a service […]

­­Use fuzzy string matching to approximate duplicate records in Amazon Redshift

It’s common to ingest multiple data sources into Amazon Redshift to perform analytics. Often, each data source will have its own processes of creating and maintaining data, which can lead to data quality challenges within and across sources. One challenge you may face when performing analytics is the presence of imperfect duplicate records within the source data. This post presents one possible approach to addressing this challenge in an Amazon Redshift data warehouse using fuzzy matching.

Extract data from SAP ERP using AWS Glue and the SAP SDK

This is a guest post by Siva Manickam and Prahalathan M from Vyaire Medical Inc. Vyaire Medical Inc. is a global company, headquartered in suburban Chicago, focused exclusively on supporting breathing through every stage of life. Established from legacy brands with a 65-year history of pioneering breathing technology, the company’s portfolio of integrated solutions is […]

Automate schema evolution at scale with Apache Hudi in AWS Glue

In the data analytics space, organizations often deal with many tables in different databases and file formats to hold data for different business functions. Business needs often drive table structure, such as schema evolution (the addition of new columns, removal of existing columns, update of column names, and so on) for some of these tables […]

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

In the post Introducing the AWS ProServe Hadoop Migration Delivery Kit TCO tool, we introduced the AWS ProServe Hadoop Migration Delivery Kit (HMDK) TCO tool and the benefits of migrating on-premises Hadoop workloads to Amazon EMR. In this post, we dive deep into the tool, walking through all steps from log ingestion, transformation, visualization, and […]

Introducing the AWS ProServe Hadoop Migration Delivery Kit TCO tool

When migrating Hadoop workloads to Amazon EMR, it’s often difficult to identify the optimal cluster configuration without analyzing existing workloads by hand. To solve this, we’re introducing the Hadoop migration assessment Total Cost of Ownership (TCO) tool. You now have a Hadoop migration assessment TCO tool within the AWS ProServe Hadoop Migration Delivery Kit (HMDK). […]