Raj Ramasubbu | AWS Big Data Blog

Build stateful streaming applications with Apache Spark 4.0 on Amazon EMR Serverless

In this post, we demonstrate how to build a production-ready IoT device monitoring system using Spark 4.0’s transformWithState API on Amazon EMR Serverless. This example showcases the key capabilities of stateful streaming and provides a template you can adapt for your own use cases.

Deploy real-time analytics with StarTree for managed Apache Pinot on AWS

In this post, we introduce StarTree as a managed solution on AWS for teams seeking the advantages of Pinot. We highlight the key distinctions between open-source Pinot and StarTree, and provide valuable insights for organizations considering a more streamlined approach to their real-time analytics infrastructure.

Read and write S3 Iceberg table using AWS Glue Iceberg Rest Catalog from Open Source Apache Spark

In this post, we will explore how to harness the power of Open source Apache Spark and configure a third-party engine to work with AWS Glue Iceberg REST Catalog. The post will include details on how to perform read/write data operations against Amazon S3 tables with AWS Lake Formation managing metadata and underlying data access using temporary credential vending.

Build a real-time analytics solution with Apache Pinot on AWS

In this, we will provide a step-by-step guide showing you how you can build a real-time OLAP datastore on Amazon Web Services (AWS) using Apache Pinot on Amazon Elastic Compute Cloud (Amazon EC2) and do near real-time visualization using Tableau. You can use Apache Pinot for batch processing use cases as well but, in this post, we will focus on a near real-time analytics use case.

Monitoring Amazon OpenSearch Serverless using AWS User Notifications

Amazon OpenSearch Serverless is a serverless deployment option for Amazon OpenSearch Service that makes it simple for you to run search and analytics workloads without having to think about infrastructure management. The compute capacity used for data ingestion, and search and query in OpenSearch Serverless is measured in OpenSearch Compute Units (OCUs). Customers can configure […]

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

We recently announced support for streaming extract, transform, and load (ETL) jobs in AWS Glue version 4.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue streaming ETL jobs continuously consume data from streaming sources, clean and transform the data in-flight, and make it available for analysis in seconds. AWS also offers a broad selection of services to support your needs. A database replication service such as AWS Database Migration Service (AWS DMS) can replicate the data from your source systems to Amazon Simple Storage Service (Amazon S3), which commonly hosts the storage layer of the data lake. This post demonstrates how to apply CDC changes from Amazon Relational Database Service (Amazon RDS) or other relational databases to an S3 data lake, with flexibility to denormalize, transform, and enrich the data in near-real time.

Access Amazon OpenSearch Serverless collections using a VPC endpoint

Amazon OpenSearch Serverless helps you index, analyze, and search your logs and data using OpenSearch APIs and dashboards. The OpenSearch Serverless collection is a group of indexes. API and dashboard clients can access the collections from public networks or one or more VPCs. For VPC access to collections and dashboards, you can create VPC endpoints. […]

AWS Big Data Blog

Author: Raj Ramasubbu

Build stateful streaming applications with Apache Spark 4.0 on Amazon EMR Serverless

Deploy real-time analytics with StarTree for managed Apache Pinot on AWS

Read and write S3 Iceberg table using AWS Glue Iceberg Rest Catalog from Open Source Apache Spark

Build a real-time analytics solution with Apache Pinot on AWS

Monitoring Amazon OpenSearch Serverless using AWS User Notifications

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

Access Amazon OpenSearch Serverless collections using a VPC endpoint

Learn

Resources

Developers

Help