AWS Big Data Blog

Category: Analytics

Copy large datasets from Google Cloud Storage to Amazon S3 using Amazon EMR

Many organizations have data sitting in various data sources in a variety of formats. Even though data is a critical component of decision-making, for many organizations this data is spread across multiple public clouds. Organizations are looking for tools that make it easy and cost-effective to copy large datasets across cloud vendors. With Amazon EMR […]

Accelerate large-scale data migration validation using PyDeequ

Many enterprises are migrating their on-premises data stores to the AWS Cloud. During data migration, a key requirement is to validate all the data that has been moved from on premises to the cloud. This data validation is a critical step and if not done correctly, may result in the failure of the entire project. […]

Stream data from relational databases to Amazon Redshift with upserts using AWS Glue streaming jobs

Traditionally, read replicas of relational databases are often used as a data source for non-online transactions of web applications such as reporting, business analysis, ad hoc queries, operational excellence, and customer services. Due to the exponential growth of data volume, it became common practice to replace such read replicas with data warehouses or data lakes […]

Build operational metrics for your enterprise AWS Glue Data Catalog at scale

Over the last several years, enterprises have accumulated massive amounts of data. Data volumes have increased at an unprecedented rate, exploding from terabytes to petabytes and sometimes exabytes of data. Increasingly, many enterprises are building highly scalable, available, secure, and flexible data lakes on AWS that can handle extremely large datasets. After data lakes are […]

Configure single sign-on authentication for Amazon Athena with Azure AD integrated to on-premises AD

Amazon Athena is an interactive query service that makes it easier to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. Cloud operation teams can use AWS Identity and Access Management (IAM) federation to centrally manage access to Athena. This simplifies administration by allowing a governing team to control user access […]

Automate Amazon Redshift Cluster management operations using AWS CloudFormation

Amazon Redshift is the fastest and most widely used cloud data warehouse. Tens of thousands of customers run business-critical workloads on Amazon Redshift. Amazon Redshift offers many features that enable you to build scalable, highly performant, cost-effective, and easy-to-manage workloads. For example, you can scale an Amazon Redshift cluster up or down based on your […]

Compare different node types for your workload using Amazon Redshift

This post was last reviewed and updated May, 2022 to include a new section for limitations and additional details around access permissions and security. Amazon Redshift is a fast, fully managed, widely popular cloud data warehouse that offers different node types to meet the flexible workload patterns of our customers. Amazon Redshift RA3 with managed […]

Migrate to an Amazon Redshift Lake House Architecture from Snowflake

The need to derive meaningful and timely insights increases proportionally with the amount of data being collected. Data warehouses play a key role in storing, transforming, and making data easily accessible to enable a wide range of use cases, such as data mining, business intelligence (BI) and reporting, and diagnostics, as well as predictive, prescriptive, […]

Extract, prepare, and analyze Salesforce.com data using Amazon AppFlow, AWS Glue DataBrew, and Amazon Athena

As organizations embark on their data modernization journey, big data analytics and machine learning (ML) use cases are becoming even more integral parts of business. The ease for data preparation and seamless integration with third-party data sources is of paramount importance in order to gain insights quickly and make critical business decisions faster. AWS Glue […]

Create and reuse governed datasets in Amazon QuickSight with new Dataset-as-a-Source feature

Amazon QuickSight is a fast, cloud-powered, business intelligence (BI) service that makes it easy to deliver insights to everyone in your organization. QuickSight recently introduced Dataset-as-a-Source, a new feature that allows data owners to create authoritative datasets that can then be reused and further extended by thousands of users across the enterprise. This post walks […]