AWS Big Data Blog
Category: Learning Levels
Build an analytics pipeline that is resilient to Avro schema changes using Amazon Athena
This post demonstrates how to build a solution by combining Amazon Simple Storage Service (Amazon S3) for data storage, AWS Glue Data Catalog for schema management, and Amazon Athena for one-time querying. We’ll focus specifically on handling Avro-formatted data in partitioned S3 buckets, where schemas can change frequently while providing consistent query capabilities across all data regardless of schema versions.
Amazon OpenSearch Service 101: How many shards do I need
Customers new to Amazon OpenSearch Service often ask how many shards their indexes need. An index is a collection of shards, and an index’s shard count can affect both indexing and search request efficiency. OpenSearch Service can take in large amounts of data, split it into smaller units called shards, and distribute those shards across a dynamically changing set of instances. In this post, we provide some practical guidance for determining the ideal shard count for your use case.
Optimizing vector search using Amazon S3 Vectors and Amazon OpenSearch Service
We now have a public preview of two integrations between Amazon Simple Storage Service (Amazon S3) Vectors and Amazon OpenSearch Service that give you more flexibility in how you store and search vector embeddings. In this post, we walk through this seamless integration, providing you with flexible options for vector search implementation.
Integrating Amazon OpenSearch Ingestion with Amazon RDS and Amazon Aurora
We are happy to announce the general availability of the integration of Amazon OpenSearch Service with Amazon Relational Database Service (Amazon RDS) and Amazon Aurora. This new integration eliminates complex data pipelines and enables near real-time data synchronization between Amazon Aurora (including Amazon Aurora MySQL-Compatible Edition and Amazon Aurora PostgreSQL-Compatible Edition) and Amazon RDS databases (including Amazon RDS for MySQL and Amazon RDS for PostgreSQL), and Amazon OpenSearch Service, unlocking advanced search capabilities such as hybrid search, ranked results, and faceted search on transactional databases.
Scale your AWS Glue for Apache Spark jobs with R type, G.12X, and G.16X workers
This post demonstrates how AWS Glue R type, G.12X, and G.16X workers help you scale up your AWS Glue for Apache Spark jobs.
Orchestrate data processing jobs, querybooks, and notebooks using visual workflow experience in Amazon SageMaker
Today, we are excited to launch a new visual workflows builder in SageMaker Unified Studio. With the new visual workflow experience, you don’t need to code the Python DAGs manually. Instead, you can visually define the orchestration workflow in SageMaker Unified Studio, and the visual definition is automatically converted to a Python DAG definition that is supported in Airflow.This post demonstrates the new visual workflow experience in SageMaker Unified Studio.
Perform per-project cost allocation in Amazon SageMaker Unified Studio
Amazon SageMaker Unified Studio enables per-project cost allocation through resource tagging, allowing organizations to track and manage costs across different projects and domains effectively. This post demonstrates how to implement cost tracking using AWS Billing and Cost Management tools, including Cost Explorer and Data Exports, to help finance and business analysts follow FinOps best practices for controlling cloud infrastructure costs.
Near real-time baggage operational insights for airlines using Amazon Kinesis Data Streams
This post explores a framework developed by IBM to modernize baggage analytics using AWS managed services like Amazon Kinesis Data Streams, DynamoDB Streams, and other AWS services within a serverless architecture. The solution enables near real-time baggage operational insights for airlines, delivering cost savings, enhanced scalability, and improved performance while providing better security and operational efficiency to meet evolving airline needs.
Overcome your Kafka Connect challenges with Amazon Data Firehose
We’re happy to announce a new feature in the Amazon Data Firehose integration with Amazon MSK. You can now specify the Firehose stream to either read from the earliest position on the Kafka topic or from a custom timestamp to begin reading from your MSK topic. In this post of this series, we focus on managed data delivery from Kafka to your data lake.
How Stifel built a modern data platform using AWS Glue and an event-driven domain architecture
In this post, we show you how Stifel implemented a modern data platform using AWS services and open data standards, building an event-driven architecture for domain data products while centralizing the metadata to facilitate discovery and sharing of data products.









