AWS Big Data Blog

Category: Analytics

Amazon CloudWatch metrics for Amazon OpenSearch Service storage and shard skew health

In this post, we explore how to deploy Amazon CloudWatch metrics using an AWS CloudFormation template to monitor an OpenSearch Service domain’s storage and shard skew. This solution uses an AWS Lambda function to extract storage and shard distribution metadata from your OpenSearch Service domain, calculates the level of skew, and then pushes this information to CloudWatch metrics so that you can easily monitor, alert, and respond.

Try semantic search with the Amazon OpenSearch Service vector engine

Amazon OpenSearch Service has long supported both lexical and vector search, since the introduction of its kNN plugin in 2020. With recent developments in generative AI, including AWS’s launch of Amazon Bedrock earlier in 2023, you can now use Amazon Bedrock-hosted models in conjunction with the vector database capabilities of OpenSearch Service, allowing you to implement semantic search, retrieval augmented generation (RAG), recommendation engines, and rich media search based on high-quality vector search. The recent launch of the vector engine for Amazon OpenSearch Serverless makes it even easier to deploy such solutions.

Amazon OpenSearch Serverless expands support for larger workloads and collections

We recently announced new enhancements to Amazon OpenSearch Serverless that can scan and search source data sizes of up to 6 TB. At launch, OpenSearch Serverless supported searching one or more indexes within a collection, with the total combined size of up to 1 TB. With the support for 6 TB source data, you can now scale up your log analytics, machine learning applications, and ecommerce data more effectively. With OpenSearch Serverless, you can enjoy the benefits of these expanded limits without having to worry about sizing, monitoring your usage, or manually scaling an OpenSearch domain.

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Apache Iceberg is an open table format for large datasets in Amazon Simple Storage Service (Amazon S3) and provides fast query performance over large tables, atomic commits, concurrent writes, and SQL-compatible table evolution. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time […]

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Apache Iceberg is an open table format for very large analytic datasets. Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. The Iceberg specification allows seamless table evolution such as schema and partition evolution, and its design is […]

Derive operational insights from application logs using Automated Data Analytics on AWS

Automated Data Analytics (ADA) on AWS is an AWS solution that enables you to derive meaningful insights from data in a matter of minutes through a simple and intuitive user interface. ADA offers an AWS-native data analytics platform that is ready to use out of the box by data analysts for a variety of use […]

Use Amazon Athena to query data stored in Google Cloud Platform

As customers accelerate their migrations to the cloud and transform their businesses, some find themselves in situations where they have to manage data analytics in a multi-cloud environment, such as acquiring a company that runs on a different cloud provider. Customers who use multi-cloud environments often face challenges in data access and compatibility that can […]

How Ontraport reduced data processing cost by 80% with AWS Glue

This post is written in collaboration with Elijah Ball from Ontraport. Customers are implementing data and analytics workloads in the AWS Cloud to optimize cost. When implementing data processing workloads in AWS, you have the option to use technologies like Amazon EMR or serverless technologies like AWS Glue. Both options minimize the undifferentiated heavy lifting […]

Perform Amazon Kinesis load testing with Locust

February 9, 2024: Amazon Kinesis Data Firehose has been renamed to Amazon Data Firehose. Read the AWS What’s New post to learn more. Building a streaming data solution requires thorough testing at the scale it will operate in a production environment. Streaming applications operating at scale often handle large volumes of up to GBs per […]

Monitor data pipelines in a serverless data lake

AWS serverless services, including but not limited to AWS Lambda, AWS Glue, AWS Fargate, Amazon EventBridge, Amazon Athena, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Storage Service (Amazon S3), have become the building blocks for any serverless data lake, providing key mechanisms to ingest and transform data […]