AWS Big Data Blog

Category: Analytics

Power highly resilient use cases with Amazon Redshift

Amazon Redshift is the most popular and fastest cloud data warehouse, offering seamless integration with your data lake and other data sources, up to three times faster performance than any other cloud data warehouse, automated maintenance, separation of storage and compute, and up to 75% lower cost than any other cloud data warehouse. This post […]

Enrich datasets for descriptive analytics with AWS Glue DataBrew

Data analytics remains a constantly hot topic. More and more businesses are beginning to understand the potential their data has to allow them to serve customers more effectively and give them a competitive advantage. However, for many small to medium businesses, gaining insight from their data can be challenging because they often lack in-house data […]

Automating Index State Management for Amazon OpenSearch Service

When it comes to time-series data, it’s more common to access new data than existing data, such as the last four hours or one day. Often, application teams must maintain multiple indexes for diverse data workloads, which bring new requirements to set up a custom solution to manage the index lifecycles. This becomes tedious as […]

Build a modern data architecture on AWS with Amazon AppFlow, AWS Lake Formation, and Amazon Redshift

This is a guest post written by Dr. Yannick Misteli, lead cloud platform and ML engineering in global product strategy (GPS) at Roche. Recently the Roche Data Insights (RDI) initiative was launched to achieve our vision using new ways of working and collaboration in order to build shared, interoperable data & insights with federated governance. […]

New features from Apache Hudi 0.7.0 and 0.8.0 available on Amazon EMR

Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development by providing record-level insert, update, and delete capabilities. This record-level capability is helpful if you’re building your data lakes on Amazon Simple Storage Service (Amazon S3) or Hadoop Distributed File System (HDFS). You can use it […]

Query cross-account AWS Glue Data Catalogs using Amazon Athena

Many AWS customers rely on a multi-account strategy to scale their organization and better manage their data lake across different projects or lines of business. The AWS Glue Data Catalog contains references to data used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. Using a centralized Data Catalog […]

Ibotta builds a self-service data lake with AWS Glue

This is a guest post co-written by Erik Franco at Ibotta. Ibotta is a free cash back rewards and payments app that gives consumers real cash for everyday purchases when they shop and pay through the app. Ibotta provides thousands of ways for consumers to earn cash on their purchases by partnering with more than […]

Introducing new features for Amazon Redshift COPY: Part 1

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL. Amazon Redshift offers up to three times better price performance than any other cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of […]

How Goldman Sachs built persona tagging using Apache Flink on Amazon EMR

The Global Investment Research (GIR) division at Goldman Sachs is responsible for providing research and insights to the firm’s clients in the equity, fixed income, currency, and commodities markets. One of the long-standing goals of the GIR team is to deliver a personalized experience and relevant research content to their research users. Previously, in order to customize […]