AWS Big Data Blog

Category: Technical How-to

Set up fine-grained permissions for your data pipeline using MWAA and EKS

This blog post shows how to improve security in a data pipeline architecture based on Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and Amazon Elastic Kubernetes Service (Amazon EKS) by setting up fine-grained permissions, using HashiCorp Terraform for infrastructure as code.

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

Stitch Fix is a personalized clothing styling service for men, women, and kids. In this post, we will describe how and why we decided to migrate from self-managed Kafka to Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Externalize Amazon MSK Connect configurations with Terraform

Managing configurations for Amazon MSK Connect, a feature of Amazon Managed Streaming for Apache Kafka (Amazon MSK), can become challenging, especially as the number of topics and configurations grows. In this post, we address this complexity by using Terraform to optimize the configuration of the Kafka topic to Amazon S3 Sink connector. By adopting this […]

Operational Data Processing Framework for Modern Data Architectures

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS has invested in native service integration with Apache Hudi and published technical contents to enable you to use Apache Hudi with AWS Glue (for example, refer to Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 1: Getting Started). In AWS ProServe-led customer engagements, the use cases we work on usually come with technical complexity and scalability requirements. In this post, we discuss a common use case in relation to operational data processing and the solution we built using Apache Hudi and AWS Glue.

Securely process near-real-time data from Amazon MSK Serverless using an AWS Glue streaming ETL job with IAM authentication

Streaming data has become an indispensable resource for organizations worldwide because it offers real-time insights that are crucial for data analytics. The escalating velocity and magnitude of collected data has created a demand for real-time analytics. This data originates from diverse sources, including social media, sensors, logs, and clickstreams, among others. With streaming data, organizations […]

Build streaming data pipelines with Amazon MSK Serverless and IAM authentication

Amazon’s serverless Apache Kafka offering, Amazon Managed Streaming for Apache Kafka (Amazon MSK) Serverless, is attracting a lot of interest. It’s appreciated for its user-friendly approach, ability to scale automatically, and cost-saving benefits over other Kafka solutions. However, a hurdle encountered by many users is the requirement of MSK Serverless to use AWS Identity and Access Management (IAM) access control. At the time of writing, the Amazon MSK library for IAM is exclusive to Kafka libraries in Java, creating a challenge for users of other programming languages. In this post, we aim to address this issue and present how you can use Amazon API Gateway and AWS Lambda to navigate around this obstacle.

Use the reverse token filter to enable suffix matching queries in OpenSearch

In this post, we show how you can implement a suffix-based search. OpenSearch is an open-source RESTful search engine built on top of the Apache Lucene library. OpenSearch full-text search is fast, can give the result of complex queries within a fraction of a second. With OpenSearch, you can convert unstructured text into structured text using different text analyzers, tokenizers, and filters to improve search. OpenSearch uses a default analyzer, called the standard analyzer, which works well for most use cases out of the box. But for some use cases, it may not work best, and you need to use a specific analyzer.

Deploy Amazon OpenSearch Serverless with Terraform

This post demonstrates how to use Terraform to create, deploy, and clean up OpenSearch Serverless infrastructure.. Amazon OpenSearch Serverless provides the search and analytical functionality of OpenSearch without the manual overhead of configuring, managing, and scaling OpenSearch clusters. It automatically scales the resources based on your workload, and you only pay for the resources consumed. Managing OpenSearch Serverless is simple, but with infrastructure as code (IaC) software like Terraform, you can simplify your resource management even more.

Monitor Apache Spark applications on Amazon EMR with Amazon Cloudwatch

To improve a Spark application’s efficiency, it’s essential to monitor its performance and behavior. In this post, we demonstrate how to publish detailed Spark metrics from Amazon EMR to Amazon CloudWatch. This will give you the ability to identify bottlenecks while optimizing resource utilization.