AWS Big Data Blog

Amazon OpenSearch Serverless expands support for larger workloads and collections

We recently announced new enhancements to Amazon OpenSearch Serverless that can scan and search source data sizes of up to 6 TB. At launch, OpenSearch Serverless supported searching one or more indexes within a collection, with the total combined size of up to 1 TB. With the support for 6 TB source data, you can now scale up your log analytics, machine learning applications, and ecommerce data more effectively. With OpenSearch Serverless, you can enjoy the benefits of these expanded limits without having to worry about sizing, monitoring your usage, or manually scaling an OpenSearch domain.

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Apache Iceberg is an open table format for large datasets in Amazon Simple Storage Service (Amazon S3) and provides fast query performance over large tables, atomic commits, concurrent writes, and SQL-compatible table evolution. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time […]

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Apache Iceberg is an open table format for very large analytic datasets. Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. The Iceberg specification allows seamless table evolution such as schema and partition evolution, and its design is […]

Derive operational insights from application logs using Automated Data Analytics on AWS

Automated Data Analytics (ADA) on AWS is an AWS solution that enables you to derive meaningful insights from data in a matter of minutes through a simple and intuitive user interface. ADA offers an AWS-native data analytics platform that is ready to use out of the box by data analysts for a variety of use […]

Use Amazon Athena to query data stored in Google Cloud Platform

As customers accelerate their migrations to the cloud and transform their businesses, some find themselves in situations where they have to manage data analytics in a multi-cloud environment, such as acquiring a company that runs on a different cloud provider. Customers who use multi-cloud environments often face challenges in data access and compatibility that can […]

The art and science of data product portfolio management

This post is the first in a series dedicated to the art and science of practical data mesh implementation (for an overview of data mesh, read the original whitepaper The data mesh shift). The series attempts to bridge the gap between the tenets of data mesh and its real-life implementation by deep-diving into the functional […]

How Ontraport reduced data processing cost by 80% with AWS Glue

This post is written in collaboration with Elijah Ball from Ontraport. Customers are implementing data and analytics workloads in the AWS Cloud to optimize cost. When implementing data processing workloads in AWS, you have the option to use technologies like Amazon EMR or serverless technologies like AWS Glue. Both options minimize the undifferentiated heavy lifting […]

Introducing Apache Airflow version 2.6.3 support on Amazon MWAA

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that makes it simple to set up and operate end-to-end data pipelines in the cloud. Trusted across various industries, Amazon MWAA helps organizations like Siemens, ENGIE, and Choice Hotels International enhance and scale their business workflows, while significantly improving security […]

Perform Amazon Kinesis load testing with Locust

February 9, 2024: Amazon Kinesis Data Firehose has been renamed to Amazon Data Firehose. Read the AWS What’s New post to learn more. Building a streaming data solution requires thorough testing at the scale it will operate in a production environment. Streaming applications operating at scale often handle large volumes of up to GBs per […]

Monitor data pipelines in a serverless data lake

AWS serverless services, including but not limited to AWS Lambda, AWS Glue, AWS Fargate, Amazon EventBridge, Amazon Athena, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Storage Service (Amazon S3), have become the building blocks for any serverless data lake, providing key mechanisms to ingest and transform data […]