AWS Big Data Blog

Cybersecurity Awareness Month: Learn about the job zero of securing your data using Amazon Redshift

Amazon Redshift is a fast, petabyte-scale cloud data warehouse delivering the best price-performance. It allows you to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar on high-performance storage, and massively parallel query execution. At AWS, we embrace the culture that security is job zero, by […]

Copy large datasets from Google Cloud Storage to Amazon S3 using Amazon EMR

Data migration between GCS and Amazon S3 is possible by utilizing Hadoop’s native support for S3 object storage and using a Google-provided Hadoop connector for GCS. This post demonstrates how to configure an EMR cluster for DistCp and S3DistCP, goes over the settings and parameters for both tools, performs a copy of a test 9.4 TB dataset, and compares the performance of the copy.

Automate building an integrated analytics solution with AWS Analytics Automation Toolkit

This blog post was last reviewed and updated July 2022, to be consistent with the new menu interface launched by the AWS Analytics Automation Toolkit. Amazon Redshift is a fast, fully managed, widely popular cloud data warehouse that powers the modern data architecture enabling fast and deep insights or machine learning (ML) predictions using SQL […]

Accelerate large-scale data migration validation using PyDeequ

March 2023: You can now use AWS Glue Data Quality to measure and manage the quality of your data. AWS Glue Data Quality is built on DeeQu and it offers a simplified user experience for customers who want to this open-source package. Refer to the blog and documentation for additional details. Many enterprises are migrating their […]

Stream data from relational databases to Amazon Redshift with upserts using AWS Glue streaming jobs

Traditionally, read replicas of relational databases are often used as a data source for non-online transactions of web applications such as reporting, business analysis, ad hoc queries, operational excellence, and customer services. Due to the exponential growth of data volume, it became common practice to replace such read replicas with data warehouses or data lakes […]

Build operational metrics for your enterprise AWS Glue Data Catalog at scale

Over the last several years, enterprises have accumulated massive amounts of data. Data volumes have increased at an unprecedented rate, exploding from terabytes to petabytes and sometimes exabytes of data. Increasingly, many enterprises are building highly scalable, available, secure, and flexible data lakes on AWS that can handle extremely large datasets. After data lakes are […]

Configure single sign-on authentication for Amazon Athena with Azure AD integrated to on-premises AD

Amazon Athena is an interactive query service that makes it easier to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. Cloud operation teams can use AWS Identity and Access Management (IAM) federation to centrally manage access to Athena. This simplifies administration by allowing a governing team to control user access […]

Automate Amazon Redshift Cluster management operations using AWS CloudFormation

Amazon Redshift is a fast, petabyte-scale cloud data warehouse delivering the best price-performance. Tens of thousands of customers run business-critical workloads on Amazon Redshift. Amazon Redshift offers many features that enable you to build scalable, highly performant, cost-effective, and easy-to-manage workloads. For example, you can scale an Amazon Redshift cluster up or down based on […]

Compare different node types for your workload using Amazon Redshift

February 2023: This post was reviewed and updated to include support for Amazon Redshift Serverless. The Amazon Redshift Node Configuration Comparison utility latest release now supports Amazon Redshift Serverless to test your workload performance. If you want to either explore different Amazon Redshift Serverless configurations or combination of Amazon Redshift Provisioned and Serverless configurations based […]

Migrate to an Amazon Redshift Lake House Architecture from Snowflake

The need to derive meaningful and timely insights increases proportionally with the amount of data being collected. Data warehouses play a key role in storing, transforming, and making data easily accessible to enable a wide range of use cases, such as data mining, business intelligence (BI) and reporting, and diagnostics, as well as predictive, prescriptive, […]