AWS Big Data Blog

Category: Learning Levels

Automate alerting and reporting for AWS Glue job resource usage

Data transformation plays a pivotal role in providing the necessary data insights for businesses in any organization, small and large. To gain these insights, customers often perform ETL (extract, transform, and load) jobs from their source systems and output an enriched dataset. Many organizations today are using AWS Glue to build ETL pipelines that bring data […]

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Apache Iceberg is an open table format for large datasets in Amazon Simple Storage Service (Amazon S3) and provides fast query performance over large tables, atomic commits, concurrent writes, and SQL-compatible table evolution. When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational […]

Real-time time series anomaly detection for streaming applications on Amazon Kinesis Data Analytics

Detecting anomalies in real time from high-throughput streams is key for informing on timely decisions in order to adapt and respond to unexpected scenarios. Stream processing frameworks such as Apache Flink empower users to design systems that can ingest and process continuous flows of data at scale. In this post, we present a streaming time […]

Simplify AWS Glue job orchestration and monitoring with Amazon MWAA

Organizations across all industries have complex data processing requirements for their analytical use cases across different analytics systems, such as data lakes on AWS, data warehouses (Amazon Redshift), search (Amazon OpenSearch Service), NoSQL (Amazon DynamoDB), machine learning (Amazon SageMaker), and more. Analytics professionals are tasked with deriving value from data stored in these distributed systems […]

What’s new with Amazon MWAA support for startup scripts

Amazon Managed Workflow for Apache Airflow (Amazon MWAA) is a managed service for Apache Airflow that lets you use the same familiar Apache Airflow environment to orchestrate your workflows and enjoy improved scalability, availability, and security without the operational burden of having to manage the underlying infrastructure. In April 2023, Amazon MWAA added support for […]

Dive deep into AWS Glue 4.0 for Apache Spark

Deriving insight from data is hard. It’s even harder when your organization is dealing with silos that impede data access across different data stores. Seamless data integration is a key requirement in a modern data architecture to break down data silos. AWS Glue is a serverless data integration service that makes data preparation simpler, faster, […]

Stream data with Amazon MSK Connect using an open-source JDBC connector

Customers are adopting Amazon Managed Service for Apache Kafka (Amazon MSK) as a fast and reliable streaming platform to build their enterprise data hub. In addition to streaming capabilities, setting up Amazon MSK enables organizations to use a pub/sub model for data distribution with loosely coupled and independent components. To publish and distribute the data […]

Improve power utility operational efficiency using smart sensor data and Amazon QuickSight

This blog post is co-written with Steve Alexander at PG&E. In today’s rapidly changing energy landscape, power disturbances cause businesses millions of dollars due to service interruptions and power quality issues. Large utility territories make it difficult to detect and locate faults when power outages occur, leading to longer restoration times, recurring outages, and unhappy […]

Amazon OpenSearch Service Under the Hood: Multi-AZ with Standby

Amazon OpenSearch Service recently announced Multi-AZ with Standby, a new deployment option for managed clusters that enables 99.99% availability and consistent performance for business-critical workloads. With Multi-AZ with Standby, clusters are resilient to infrastructure failures like hardware or networking failure. This option provides improved reliability and the added benefit of simplifying cluster configuration and management […]

Perform secure database write-backs with Amazon QuickSight

Amazon QuickSight is a scalable, serverless, machine learning (ML)-powered business intelligence (BI) solution that makes it easy to connect to your data, create interactive dashboards, get access to ML-enabled insights, and share visuals and dashboards with tens of thousands of internal and external users, either within QuickSight itself or embedded into any application. A write-back […]