AWS Big Data Blog

Category: Technical How-to

How ATPCO enables governed self-service data access to accelerate innovation with Amazon DataZone

ATPCO is the backbone of modern airline retailing, enabling airlines and third-party channels to deliver the right offers to customers at the right time. ATPCO’s reach is impressive, with its fare data covering over 89% of global flight schedules. In this post, using one of ATPCO’s use cases, we show you how ATPCO uses AWS services, including Amazon DataZone, to make data discoverable by data consumers across different business units so that they can innovate faster. We encourage you to read Amazon DataZone concepts and terminologies first to become familiar with the terms used in this post.

Configure SAML federation with Amazon OpenSearch Serverless and Keycloak

Amazon OpenSearch Serverless is a serverless version of Amazon OpenSearch Service, a fully managed open search and analytics platform. On Amazon OpenSearch Service you can run petabyte-scale search and analytics workloads without the heavy lifting of managing the underlying OpenSearch Service clusters and Amazon OpenSearch Serverless supports workloads up to 30TB of data for time-series […]

Visual representation of the relationships of the high level entities in the customer, event and product subject areas

How ActionIQ built a truly composable customer data platform using Amazon Redshift

This post is written in collaboration with Mackenzie Johnson and Phil Catterall from ActionIQ. ActionIQ is a leading composable customer data (CDP) platform designed for enterprise brands to grow faster and deliver meaningful experiences for their customers. ActionIQ taps directly into a brand’s data warehouse to build smart audiences, resolve customer identities, and design personalized […]

Streamline your data governance by deploying Amazon DataZone with the AWS CDK

Managing data across diverse environments can be a complex and daunting task. Amazon DataZone simplifies this so you can catalog, discover, share, and govern data stored across AWS, on premises, and third-party sources. Many organizations manage vast amounts of data assets owned by various teams, creating a complex landscape that poses challenges for scalable data […]

Migrate from Apache Solr to OpenSearch

OpenSearch is an open source, distributed search engine suitable for a wide array of use-cases such as ecommerce search, enterprise search (content management search, document search, knowledge management search, and so on), site search, application search, and semantic search. It’s also an analytics suite that you can use to perform interactive log analytics, real-time application […]

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

This blog post introduces Amazon DataZone and explores how VW used it to build their data mesh to enable streamlined data access across multiple data lakes. It focuses on the key aspect of the solution, which was enabling data providers to automatically publish data assets to Amazon DataZone, which served as the central data mesh for enhanced data discoverability. Additionally, the post provides code to guide you through the implementation.

Migrate data from an on-premises Hadoop environment to Amazon S3 using S3DistCp with AWS Direct Connect

This post demonstrates how to migrate nearly any amount of data from an on-premises Apache Hadoop environment to Amazon Simple Storage Service (Amazon S3) by using S3DistCp on Amazon EMR with AWS Direct Connect. To transfer resources from a target EMR cluster, the traditional Hadoop DistCp must be run on the source cluster to move […]

Improve your Amazon OpenSearch Service performance with OpenSearch Optimized Instances

Amazon OpenSearch Service introduced the OpenSearch Optimized Instances (OR1), deliver price-performance improvement over existing instances. The newly introduced OR1 instances are ideally tailored for heavy indexing use cases like log analytics and observability workloads. OR1 instances use a local and a remote store. The local storage utilizes either Amazon Elastic Block Store (Amazon EBS) of […]

Run Apache XTable on Amazon MWAA to translate open table formats

In this post, we show you how to get started with Apache XTable on AWS and how you can use it in a batch pipeline orchestrated with Amazon Managed Workflows for Apache Airflow (Amazon MWAA). To understand how XTable and similar solutions work, we start with a high-level background on metadata management in an OTF and then dive deeper into XTable and its usage.

Architecture Diagram

How EchoStar ingests terabytes of data daily across its 5G Open RAN network in near real-time using Amazon Redshift Serverless Streaming Ingestion

EchoStar, a connectivity company providing television entertainment, wireless communications, and award-winning technology to residential and business customers throughout the US, deployed the first standalone, cloud-native Open RAN 5G network on AWS public cloud. This post provides an overview of real-time data analysis with Amazon Redshift and how EchoStar uses it to ingest hundreds of megabytes per second. As data sources and volumes grew across its network, EchoStar migrated from a single Redshift Serverless workgroup to a multi-warehouse architecture with live data sharing.