AWS Big Data Blog

Category: Analytics

Query hierarchical data models within Amazon Redshift

In a hierarchical database model, information is stored in a tree-like structure or parent-child structure, where each record can have a single parent but many children. Hierarchical databases are useful when you need to represent data in a tree-like hierarchy. The perfect example of a hierarchical data model is the navigation file and folders or […]

Now Available: Updated guidance on the Data Analytics Lens for AWS Well-Architected Framework

Nearly all businesses today require some form of data analytics processing, from auditing user access to generating sales reports. For all your analytics needs, the Data Analytics Lens for AWS Well-Architected Framework provides prescriptive guidance to help you assess your workloads and identify best practices aligned to the AWS Well-Architected Pillars: Operational Excellence, Security, Reliability, […]

Cybersecurity Awareness Month: Learn about the job zero of securing your data using Amazon Redshift

Amazon Redshift is a fast, petabyte-scale cloud data warehouse delivering the best price-performance. It allows you to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar on high-performance storage, and massively parallel query execution. At AWS, we embrace the culture that security is job zero, by […]

Copy large datasets from Google Cloud Storage to Amazon S3 using Amazon EMR

Data migration between GCS and Amazon S3 is possible by utilizing Hadoop’s native support for S3 object storage and using a Google-provided Hadoop connector for GCS. This post demonstrates how to configure an EMR cluster for DistCp and S3DistCP, goes over the settings and parameters for both tools, performs a copy of a test 9.4 TB dataset, and compares the performance of the copy.

Accelerate large-scale data migration validation using PyDeequ

March 2023: You can now use AWS Glue Data Quality to measure and manage the quality of your data. AWS Glue Data Quality is built on DeeQu and it offers a simplified user experience for customers who want to this open-source package. Refer to the blog and documentation for additional details. Many enterprises are migrating their […]

Stream data from relational databases to Amazon Redshift with upserts using AWS Glue streaming jobs

Traditionally, read replicas of relational databases are often used as a data source for non-online transactions of web applications such as reporting, business analysis, ad hoc queries, operational excellence, and customer services. Due to the exponential growth of data volume, it became common practice to replace such read replicas with data warehouses or data lakes […]

Build operational metrics for your enterprise AWS Glue Data Catalog at scale

Over the last several years, enterprises have accumulated massive amounts of data. Data volumes have increased at an unprecedented rate, exploding from terabytes to petabytes and sometimes exabytes of data. Increasingly, many enterprises are building highly scalable, available, secure, and flexible data lakes on AWS that can handle extremely large datasets. After data lakes are […]

Configure single sign-on authentication for Amazon Athena with Azure AD integrated to on-premises AD

Amazon Athena is an interactive query service that makes it easier to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. Cloud operation teams can use AWS Identity and Access Management (IAM) federation to centrally manage access to Athena. This simplifies administration by allowing a governing team to control user access […]

Automate Amazon Redshift Cluster management operations using AWS CloudFormation

Amazon Redshift is a fast, petabyte-scale cloud data warehouse delivering the best price-performance. Tens of thousands of customers run business-critical workloads on Amazon Redshift. Amazon Redshift offers many features that enable you to build scalable, highly performant, cost-effective, and easy-to-manage workloads. For example, you can scale an Amazon Redshift cluster up or down based on […]

Compare different node types for your workload using Amazon Redshift

February 2023: This post was reviewed and updated to include support for Amazon Redshift Serverless. The Amazon Redshift Node Configuration Comparison utility latest release now supports Amazon Redshift Serverless to test your workload performance. If you want to either explore different Amazon Redshift Serverless configurations or combination of Amazon Redshift Provisioned and Serverless configurations based […]