AWS Big Data Blog
Category: Learning Levels
Cluster manager communication simplified with Remote Publication
Amazon OpenSearch Service has taken a significant leap forward in scalability and performance with the introduction of support for 1,000-node OpenSearch Service domains capable of handling 500,000 shards with OpenSearch Service version 2.17. This post explains cluster state publication, Remote Publication, and their benefits in improving durability, scalability, and availability.
Build data pipelines with dbt in Amazon Redshift using Amazon MWAA and Cosmos
In this post, we explore a streamlined, configuration-driven approach to orchestrate dbt Core jobs using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and Cosmos, an open source package. These jobs run transformations on Amazon Redshift. With this setup, teams can collaborate effectively while maintaining data quality, operational efficiency, and observability.
Boosting search relevance: Automatic semantic enrichment in Amazon OpenSearch Serverless
In this post, we show how automatic semantic enrichment removes friction and makes the implementation of semantic search for text data seamless, with step-by-step instructions to enhance your search functionality.
Create an OpenSearch dashboard with Amazon OpenSearch Service
This post demonstrates how to harness OpenSearch Dashboards to analyze logs visually and interactively. With this solution, IT administrators, developers, and DevOps engineers can create custom dashboards to monitor system behavior, detect anomalies early, and troubleshoot issues faster through interactive charts and graphs.
Build a multi-tenant healthcare system with Amazon OpenSearch Service
In this post, we address common multi-tenancy challenges and provide actionable solutions for security, tenant isolation, workload management, and cost optimization across diverse healthcare tenants.
Develop and deploy a generative AI application using Amazon SageMaker Unified Studio
In this post, we demonstrate how to use Amazon Bedrock Flows in SageMaker Unified Studio to build a sophisticated generative AI application for financial analysis and investment decision-making.
Amazon Redshift out-of-the-box performance innovations for data lake queries
In this post, we first briefly review how planner statistics are collected and what impact they have on queries. Then, we discuss Amazon Redshift features that deliver optimal plans on Iceberg tables and Parquet data even with the lack of statistics. Finally, we review some example queries that now execute faster because of these latest Amazon Redshift innovations.
Optimize traffic costs of Amazon MSK consumers on Amazon EKS with rack awareness
In this post, we walk you through a solution for implementing rack awareness in consumer applications that are dynamically deployed across multiple Availability Zones using Amazon EKS.
Build an analytics pipeline that is resilient to Avro schema changes using Amazon Athena
This post demonstrates how to build a solution by combining Amazon Simple Storage Service (Amazon S3) for data storage, AWS Glue Data Catalog for schema management, and Amazon Athena for one-time querying. We’ll focus specifically on handling Avro-formatted data in partitioned S3 buckets, where schemas can change frequently while providing consistent query capabilities across all data regardless of schema versions.
Amazon OpenSearch Service 101: How many shards do I need
Customers new to Amazon OpenSearch Service often ask how many shards their indexes need. An index is a collection of shards, and an index’s shard count can affect both indexing and search request efficiency. OpenSearch Service can take in large amounts of data, split it into smaller units called shards, and distribute those shards across a dynamically changing set of instances. In this post, we provide some practical guidance for determining the ideal shard count for your use case.