AWS Big Data Blog

Category: Learning Levels

How Socure achieved 50% cost reduction by migrating from self-managed Spark to Amazon EMR Serverless

Socure is one of the leading providers of digital identity verification and fraud solutions. Socure’s data science environment includes a streaming pipeline called Transaction ETL (TETL), built on OSS Apache Spark running on Amazon EKS. TETL ingests and processes data volumes ranging from small to large datasets while maintaining high-throughput performance. In this post, we show how Socure was able to achieve 50% cost reduction by migrating the TETL streaming pipeline from self-managed spark to Amazon EMR serverless.

Introducing Apache Iceberg materialized views in AWS Glue Data Catalog

Hundreds of thousands of customers build artificial intelligence and machine learning (AI/ML) and analytics applications on AWS, frequently transforming data through multiple stages for improved query performance—from raw data to processed datasets to final analytical tables. Data engineers must solve complex problems, including detecting what data has changed in base tables, writing and maintaining transformation […]

Introducing AWS Glue 5.1 for Apache Spark

AWS recently announced Glue 5.1, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue 5.1 upgrades the Spark engines to Apache Spark 3.5.6, giving you newer Spark release along with the newer dependent libraries so you can develop, run, and scale your data integration workloads and get insights faster. In this post, we describe what’s new in AWS Glue 5.1, key highlights on Spark and related libraries, and how to get started on AWS Glue 5.1.

Auto-optimize your Amazon OpenSearch Service vector database

AWS recently announced the general availability of auto-optimize for the Amazon OpenSearch Service vector engine. This feature streamlines vector index optimization by automatically evaluating configuration trade-offs across search quality, speed, and cost savings. You can then run a vector ingestion pipeline to build an optimized index on your desired collection or domain. Previously, optimizing index […]

Build billion-scale vector databases in under an hour with GPU acceleration on Amazon OpenSearch Service

AWS recently announced the general availability of GPU-accelerated vector (k-NN) indexing on Amazon OpenSearch Service. You can now build billion-scale vector databases in under an hour and index vectors up to 10 times faster at a quarter of the cost. This feature dynamically attaches serverless GPUs to boost domains and collections running CPU-based instances. With […]

SAP data ingestion and replication with AWS Glue zero-ETL

AWS Glue zero-ETL with SAP now supports data ingestion and replication from SAP data sources such as Operational Data Provisioning (ODP) managed SAP Business Warehouse (BW) extractors, Advanced Business Application Programming (ABAP), Core Data Services (CDS) views, and other non-ODP data sources. Zero-ETL data replication and schema synchronization writes extracted data to AWS services like Amazon Redshift, Amazon SageMaker lakehouse, and Amazon S3 Tables, alleviating the need for manual pipeline development. In this post, we show how to create and monitor a zero-ETL integration with various ODP and non-ODP SAP sources.

Achieve 2x faster data lake query performance with Apache Iceberg on Amazon Redshift

In 2025, Amazon Redshift delivered several performance optimizations that improved query performance over twofold for Iceberg workloads on Amazon Redshift Serverless, delivering exceptional performance and cost-effectiveness for your data lake workloads. In this post, we describe some of the optimizations that led to these performance gains.