AWS Big Data Blog

Apply fine-grained access and transformation on the SUPER data type in Amazon Redshift

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per […]

Build multimodal search with Amazon OpenSearch Service

Multimodal search enables both text and image search capabilities, transforming how users access data through search applications. Consider building an online fashion retail store: you can enhance the users’ search experience with a visually appealing application that customers can use to not only search using text but they can also upload an image depicting a […]

Introducing AWS Glue usage profiles for flexible cost control

AWS Glue is a serverless data integration service that enables you to run extract, transform, and load (ETL) workloads on your data in a scalable and serverless manner. One of the main advantages of using a cloud platform is its flexibility; you can provision compute resources when you actually need them. However, with this ease […]

Disaster recovery strategies for Amazon MWAA – Part 2

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a fully managed orchestration service that makes it straightforward to run data processing workflows at scale. Amazon MWAA takes care of operating and scaling Apache Airflow so you can focus on developing workflows. However, although Amazon MWAA provides high availability within an AWS Region through features […]

Ingest and analyze your data using Amazon OpenSearch Service with Amazon OpenSearch Ingestion

In today’s data-driven world, organizations are continually confronted with the task of managing extensive volumes of data securely and efficiently. Whether it’s customer information, sales records, or sensor data from Internet of Things (IoT) devices, the importance of handling and storing data at scale with ease of use is paramount. A common use case that […]

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 2

In this series, we talk about Swisscom’s journey of automating Amazon Redshift provisioning as part of the Swisscom One Data Platform (ODP) solution using the AWS Cloud Development Kit (AWS CDK), and we provide code snippets and the other useful references. In Part 1, we did a deep dive on provisioning a secure and compliant […]

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

In this post, we deep dive into provisioning a secure and compliant Redshift cluster using the AWS CDK and discuss the best practices of secret rotation. We also explain how Swisscom used AWS CDK custom resources in automating the creation of dynamic user groups that are relevant for the AWS Identity and Access management (IAM) roles matching different job functions.

Optimize storage costs in Amazon OpenSearch Service using Zstandard compression

As part of an indexing operation, the ingested documents are stored as immutable segments. Each segment is a collection of various data structures, such as inverted index, block K dimensional tree (BKD), term dictionary, or stored fields, and these data structures are responsible for retrieving the document faster during the search operation. Out of these data structures, stored fields, which are largest fields in the segment, are compressed when stored on the disk and based on the compression strategy used, the compression speed and the index storage size will vary. In this post, we discuss the performance of the Zstandard algorithm, which was introduced in OpenSearch v2.9, amongst other available compression algorithms in OpenSearch.

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

This post is co-written with Amit Gilad, Alex Dickman and Itay Takersman from Cloudinary.  Enterprises and organizations across the globe want to harness the power of data to make better decisions by putting data at the center of every decision-making process. Data-driven decisions lead to more effective responses to unexpected events, increase innovation and allow […]

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

In this post, we delve into the key aspects of using Amazon EMR for modern data management, covering topics such as data governance, data mesh deployment, and streamlined data discovery. One of the key challenges in modern big data management is facilitating efficient data sharing and access control across multiple EMR clusters. Organizations have multiple […]