AWS Big Data Blog
Batch data ingestion into Amazon OpenSearch Service using AWS Glue
This post showcases how to use Spark on AWS Glue to seamlessly ingest data into OpenSearch Service. We cover batch ingestion methods, share practical examples, and discuss best practices to help you build optimized and scalable data pipelines on AWS.
Build a high-performance quant research platform with Apache Iceberg
In our previous post Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg, we showed how to use Apache Iceberg in the context of strategy backtesting. In this post, we focus on data management implementation options such as accessing data directly in Amazon Simple Storage Service (Amazon S3), using popular data formats like Parquet, or using open table formats like Iceberg. Our experiments are based on real-world historical full order book data, provided by our partner CryptoStruct, and compare the trade-offs between these choices, focusing on performance, cost, and quant developer productivity.
Cost Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization techniques
This blog post introduces a new disk-based vector search approach that allows efficient querying of vectors stored on disk without loading them entirely into memory. By implementing these quantization methods, organizations can achieve compression ratios of up to 64x, enabling cost-effective scaling of vector databases for large-scale AI and machine learning applications.
Use CI/CD best practices to automate Amazon OpenSearch Service cluster management operations
This post explores how to automate Amazon OpenSearch Service cluster management using CI/CD best practices. It presents two options: the Terraform OpenSearch provider and the Evolution library. The solution demonstrates how to use AWS CDK, Lambda, and CodeBuild to implement automated index template creation and management. By applying these techniques, organizations can improve the consistency, reliability, and efficiency of their OpenSearch operations.
Ingest data from Google Analytics 4 and Google Sheets to Amazon Redshift using Amazon AppFlow
Amazon AppFlow bridges the gap between Google applications and Amazon Redshift, empowering organizations to unlock deeper insights and drive data-informed decisions. In this post, we show you how to establish the data ingestion pipeline between Google Analytics 4, Google Sheets, and an Amazon Redshift Serverless workgroup.
Amazon EMR 7.5 runtime for Apache Spark and Iceberg can run Spark workloads 3.6 times faster than Spark 3.5.3 and Iceberg 1.6.1
The Amazon EMR runtime for Apache Spark offers a high-performance runtime environment while maintaining 100% API compatibility with open source Apache Spark and Apache Iceberg table format. In this post, we demonstrate the performance benefits of using the Amazon EMR 7.5 runtime for Spark and Iceberg compared to open source Spark 3.5.3 with Iceberg 1.6.1 tables on the TPC-DS 3TB benchmark v2.13.
Fitch Group achieves multi-Region resiliency for mission-critical Kafka infrastructure with Amazon MSK Replicator
In this post, we explore how Fitch Group, one of the top credit rating companies, used Amazon MSK and Amazon MSK Replicator to achieve multi-Region resiliency for their mission-critical Kafka infrastructure.
Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation
Amazon Q data integration, introduced in January 2024, allows you to use natural language to author extract, transform, load (ETL) jobs and operations in AWS Glue specific data abstraction DynamicFrame. This post introduces exciting new capabilities for Amazon Q data integration that work together to make ETL development more efficient and intuitive. We’ve added support for DataFrame-based code generation that works across any Spark environment. We’ve also introduced in-prompt context-aware development that applies details from your conversations, working seamlessly with a new iterative development experience.
Jumia builds a next-generation data platform with metadata-driven specification frameworks
Jumia is a technology company born in 2012, present in 14 African countries, with its main headquarters in Lagos, Nigeria. In this post, we share part of the journey that Jumia took with AWS Professional Services to modernize its data platform that ran under a Hadoop distribution to AWS serverless based solutions.
HEMA accelerates their data governance journey with Amazon DataZone
HEMA is a household Dutch retail brand name since 1926, providing daily convenience products using unique design. This post describes how HEMA used Amazon DataZone to build their data mesh and enable streamlined data access across multiple business areas. It explains HEMA’s unique journey of deploying Amazon DataZone, the key challenges they overcame, and the transformative benefits they have realized since deployment in May 2024. From establishing an enterprise-wide data inventory and improving data discoverability, to enabling decentralized data sharing and governance, Amazon DataZone has been a game changer for HEMA.