AWS Big Data Blog
Category: Advanced (300)
Use account-agnostic, reusable project profiles in Amazon SageMaker to streamline governance
Amazon SageMaker now supports account-agnostic project profiles, so you can create reusable project templates across multiple AWS accounts and organizational units. In this post, we demonstrate how account-agnostic project profiles can help you simplify and streamline the management of SageMaker project creation while maintaining security and governance features. We walk through the technical steps to configure account-agnostic, reusable project profiles, helping you maximize the flexibility of your SageMaker deployments.
Deploy Apache YuniKorn batch scheduler for Amazon EMR on EKS
This post explores Kubernetes scheduling fundamentals, examines the limitations of the default kube-scheduler for batch workloads, and demonstrates how YuniKorn addresses these challenges. We discuss how to deploy YuniKorn as a custom scheduler for Amazon EMR on EKS, its integration with job submissions, how to configure queues and placement rules, and how to establish resource quotas. We also show these features in action through practical Spark job examples.
Modernize Amazon Redshift authentication by migrating user management to AWS IAM Identity Center
Amazon Redshift is a powerful cloud-based data warehouse that organizations can use to analyze both structured and semi-structured data through advanced SQL queries. As a fully managed service, it provides high performance and scalability while allowing secure access to the data stored in the data warehouse. Organizations worldwide rely on Amazon Redshift to handle massive […]
How Ancestry optimizes a 100-billion-row Iceberg table
This is a guest post by Thomas Cardenas, Staff Software Engineer at Ancestry, in partnership with AWS. Ancestry, the global leader in family history and consumer genomics, uses family trees, historical records, and DNA to help people on their journeys of personal discovery. Ancestry has the largest collection of family history records, consisting of 40 […]
How AppZen enhances operational efficiency, scalability, and security with Amazon OpenSearch Serverless
AppZen is a leading provider of AI-driven finance automation solutions. The company’s core offering centers around an innovative AI platform designed for modern finance teams, featuring expense management, fraud detection, and autonomous accounts payable solutions. AppZen’s technology stack uses computer vision, deep learning, and natural language processing (NLP) to automate financial processes and ensure compliance. […]
Zeta reduces banking incident response time by 80% with Amazon OpenSearch Service observability
In this post we explain how Zeta built a more unified monitoring solution using Amazon OpenSearch Service that improved performance, reduced manual processes, and increased end-user satisfaction. Zeta has achieved over an 80% reduction in mean time to resolution (MTTR), with incident response times decreasing from 30+ minutes to under 5 minutes.
Improve Amazon EMR HBase availability and tail latency using generational ZGC
Large-scale HBase deployments on Amazon EMR suffer from unpredictable garbage collection behavior that creates performance bottlenecks for business-critical applications. To solve this problem, Amazon EMR leverages Oracle’s generational ZGC technology from JDK 21 to deliver predictable, sub-millisecond pause times. This post shows you how to configure generational ZGC in Amazon EMR 7.10.0, apply performance tuning methods, and optimize HBase RegionServer garbage collection settings.
Guide to adopting Amazon SageMaker Unified Studio from ATPCO’s Journey
ATPCO is the backbone of modern airline retailing, helping airlines and third-party channels deliver the right offers to customers at the right time. ATPCO addressed data governance challenges using Amazon DataZone. SageMaker Unified Studio, built on the same architecture as Amazon DataZone, offers additional capabilities, so users can complete various tasks such as building data pipelines using AWS Glue and Amazon EMR, or conducting analyses using Amazon Athena and Amazon Redshift query editor across diverse datasets, all within a single, unified environment. In this post, we walk you through the challenges ATPCO addresses for their business using SageMaker Unified Studio.
Achieve low-latency data processing with Amazon EMR on AWS Local Zones
By deploying Amazon EMR on AWS Local Zones, organizations can achieve single-digit millisecond latency data processing for applications while maintaining data residency compliance. This post demonstrates how to use AWS Local Zones to deploy EMR clusters closer to your users, enabling millisecond-level response times.
Export JMX metrics from Kafka connectors in Amazon Managed Streaming for Apache Kafka Connect with a custom plugin
In this post, we demonstrate how you can export the JMX metrics for Debezium connector when used with Amazon MSK Connect.