AWS Big Data Blog

Introducing Amazon S3 shuffle in AWS Glue

Nov 2022: Newer version of the product is now available to be used for this post. AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. In AWS Glue, you can use Apache Spark, which is an open-source, distributed processing […]

Integrate AWS Glue DataBrew and Amazon PinPoint to launch marketing campaigns

Marketing teams often rely on data engineers to provide a consumer dataset that they can use to launch marketing campaigns. This can sometimes cause delays in launching campaigns and consume data engineers’ bandwidth. The campaigns are often launched using complex solutions that are either code heavy or using licensed tools. The processes of both extract, […]

TrueBlue uses Amazon QuickSight to deliver more accurate pricing and grow business

This is a guest post by TrueBlue. In their own words, “Founded in 1989, TrueBlue provides specialized workforce solutions, including staffing, talent management, and recruitment process outsourcing (RPO). In 2020, the company connected approximately 490,000 people with work.” At TrueBlue, we offer solutions that help employers connect with workers worldwide. Every day, sales teams at […]

Query hierarchical data models within Amazon Redshift

In a hierarchical database model, information is stored in a tree-like structure or parent-child structure, where each record can have a single parent but many children. Hierarchical databases are useful when you need to represent data in a tree-like hierarchy. The perfect example of a hierarchical data model is the navigation file and folders or […]

Now Available: Updated guidance on the Data Analytics Lens for AWS Well-Architected Framework

Nearly all businesses today require some form of data analytics processing, from auditing user access to generating sales reports. For all your analytics needs, the Data Analytics Lens for AWS Well-Architected Framework provides prescriptive guidance to help you assess your workloads and identify best practices aligned to the AWS Well-Architected Pillars: Operational Excellence, Security, Reliability, […]

Cybersecurity Awareness Month: Learn about the job zero of securing your data using Amazon Redshift

Amazon Redshift is a fast, petabyte-scale cloud data warehouse delivering the best price-performance. It allows you to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar on high-performance storage, and massively parallel query execution. At AWS, we embrace the culture that security is job zero, by […]

Copy large datasets from Google Cloud Storage to Amazon S3 using Amazon EMR

Many organizations have data sitting in various data sources in a variety of formats. Even though data is a critical component of decision-making, for many organizations this data is spread across multiple public clouds. Organizations are looking for tools that make it easy and cost-effective to copy large datasets across cloud vendors. With Amazon EMR […]

Automate building an integrated analytics solution with AWS Analytics Automation Toolkit

This blog post was last reviewed and updated July 2022, to be consistent with the new menu interface launched by the AWS Analytics Automation Toolkit. Amazon Redshift is a fast, fully managed, widely popular cloud data warehouse that powers the modern data architecture enabling fast and deep insights or machine learning (ML) predictions using SQL […]

Accelerate large-scale data migration validation using PyDeequ

Many enterprises are migrating their on-premises data stores to the AWS Cloud. During data migration, a key requirement is to validate all the data that has been moved from on premises to the cloud. This data validation is a critical step and if not done correctly, may result in the failure of the entire project. […]

Stream data from relational databases to Amazon Redshift with upserts using AWS Glue streaming jobs

Traditionally, read replicas of relational databases are often used as a data source for non-online transactions of web applications such as reporting, business analysis, ad hoc queries, operational excellence, and customer services. Due to the exponential growth of data volume, it became common practice to replace such read replicas with data warehouses or data lakes […]