AWS Big Data Blog
Category: Learning Levels
Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog
AWS Glue is a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, machine learning (ML), and application development. You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores. Hundreds of […]
Deep dive on Amazon MSK tiered storage
In the first post of the series, we described some core concepts of Apache Kafka cluster sizing, the best practices for optimizing the performance, and the cost of your Kafka workload. This post explains how the underlying infrastructure affects Kafka performance when you use Amazon Managed Streaming for Apache Kafka (Amazon MSK) tiered storage. We […]
Introducing in-place version upgrades with Amazon MWAA
Today, AWS is announcing the availability of in-place version upgrades for Amazon Managed Workflow for Apache Airflow (Amazon MWAA). This enhancement allows you to seamlessly upgrade your existing Apache Airflow version 2.x environments to newer available versions while retaining the workflow run history and environment configurations. You can now take advantage of the latest capabilities […]
Advanced patterns with AWS SDK for pandas on AWS Glue for Ray
September 2023: This post was reviewed and updated with a new dataset and related code blocks and images. AWS SDK for pandas is a popular Python library among data scientists, data engineers, and developers. It simplifies interaction between AWS data and analytics services and pandas DataFrames. It allows easy integration and data movement between 22 […]
Enable complex row-level security in embedded dashboards for non-provisioned users in Amazon QuickSight with OR-based tags
Amazon QuickSight is a fully managed, cloud-native business intelligence (BI) service that makes it easy to connect to your data, create interactive dashboards, and share these with tens of thousands of users, both within QuickSight and embedded in your software as a service (SaaS) applications. QuickSight Enterprise edition started supporting nested conditions within row-level security […]
Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework
Amazon Redshift is a widely used, fully managed, petabyte-scale cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytic workloads. Customers are looking for tools that make it easier to migrate from other data warehouses, such as Google BigQuery, to Amazon Redshift to […]
Real-time inference using deep learning within Amazon Managed Service for Apache Flink
August 30, 2023: Amazon Kinesis Data Analytics has been renamed to Amazon Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. Apache Flink is a framework and distributed processing engine for stateful computations over data streams. Amazon Managed Service for Apache Flink is a fully managed service that […]
Configure Amazon OpenSearch Service for high availability
September 2025: This post was reviewed for accuracy. Amazon OpenSearch Service is a fully open-source search and analytics engine that securely unlocks real-time search, monitoring, and analysis of business and operational data for use cases like recommendation engines, ecommerce sites, and catalog search. To be successful in your business, you need your systems to be […]
Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB
Customers have been using data warehousing solutions to perform their traditional analytics tasks. Recently, data lakes have gained lot of traction to become the foundation for analytical solutions, because they come with benefits such as scalability, fault tolerance, and support for structured, semi-structured, and unstructured datasets. Data lakes are not transactional by default; however, there […]
Automate alerting and reporting for AWS Glue job resource usage
Data transformation plays a pivotal role in providing the necessary data insights for businesses in any organization, small and large. To gain these insights, customers often perform ETL (extract, transform, and load) jobs from their source systems and output an enriched dataset. Many organizations today are using AWS Glue to build ETL pipelines that bring data […]









