AWS Big Data Blog

Category: Analytics

The following image shows how a player is positioned based on this data.

Estimating scoring probabilities by preparing soccer matches data with AWS Glue DataBrew

In soccer (or football outside of the US), players decide to take shots when they think they can score. But how do they make that determination vs. when to pass or dribble? In a fraction of a second, in motion, while chased from multiple directions by other professional athletes, they think about their distance from […]

Read More
We use Amazon SNS for sending notifications to users, and EventBridge is integrated to schedule running the Step Functions workflow.

Orchestrating an AWS Glue DataBrew job and Amazon Athena query with AWS Step Functions

As the industry grows with more data volume, big data analytics is becoming a common requirement in data analytics and machine learning (ML) use cases. Also, as we start building complex data engineering or data analytics pipelines, we look for a simpler orchestration mechanism with graphical user interface-based ETL (extract, transform, load) tools. Recently, AWS […]

Read More
AQUA is available on Amazon Redshift RA3 instances at no additional cost.

The best new features for data analysts in Amazon Redshift in 2020

This is a guest post by Helen Anderson, data analyst and AWS Data Hero Every year, the Amazon Redshift team launches new and exciting features, and 2020 was no exception. New features to improve the data warehouse service and add interoperability with other AWS services were rolling out all year. I am part of a […]

Read More
The following architecture diagram illustrates the wind turbine protection system.

Building a real-time notification system with Amazon Kinesis Data Streams for Amazon DynamoDB and Amazon Kinesis Data Analytics for Apache Flink

Amazon DynamoDB helps you capture high-velocity data such as clickstream data to form customized user profiles and Internet of Things (IoT) data so that you can develop insights on sensor activity across various industries, including smart spaces, connected factories, smart packing, fitness monitoring, and more. It’s important to store these data points in a centralized […]

Read More
The following screenshot shows a pie chart for Sum_profit grouped by Nation.

Accessing and visualizing data from multiple data sources with Amazon Athena and Amazon QuickSight

Amazon Athena now supports federated query, a feature that allows you to query data in sources other than Amazon Simple Storage Service (Amazon S3). You can use federated queries in Athena to query the data in place or build pipelines that extract data from multiple data sources and store them in Amazon S3. With Athena […]

Read More
Let’s look at PyDeequ’s main components, and how they relate to Deequ (shown in the following diagram)

Testing data quality at scale with PyDeequ

You generally write unit tests for your code, but do you also test your data? Incoming data quality can make or break your application. Incorrect, missing, or malformed data can have a large impact on production systems. Examples of data quality issues include the following: Missing values can lead to failures in production system that […]

Read More

Running queries securely from the same VPC where an Amazon Redshift cluster is running

Customers who don’t need to set up a VPN or a private connection to AWS often use public endpoints to access AWS. Although this is acceptable for testing out the services, most production workloads need a secure connection to their VPC on AWS. If you’re running your production data warehouse on Amazon Redshift, you can […]

Read More
As illustrated in the following architecture diagram, the DQAF exclusively uses serverless AWS technology.

Building a serverless data quality and analysis framework with Deequ and AWS Glue

With ever-increasing amounts of data at their disposal, large organizations struggle to cope with not only the volume but also the quality of the data they manage. Indeed, alongside volume and velocity, veracity is an equally critical issue in data analysis, often seen as a precondition to analyzing data and guaranteeing its value. High-quality data […]

Read More
This blog covers use case based walkthroughs of how we can achieve the top 7 among those transformations in AWS Glue DataBrew.

7 most common data preparation transformations in AWS Glue DataBrew

For all analytics and ML modeling use cases, data analysts and data scientists spend a bulk of their time running data preparation tasks manually to get a clean and formatted data to meet their needs. We ran a survey among data scientists and data analysts to understand the most frequently used transformations in their data […]

Read More

Scheduling SQL queries on your Amazon Redshift data warehouse

Amazon Redshift is the most popular cloud data warehouse today, with tens of thousands of customers collectively processing over 2 exabytes of data on Amazon Redshift daily. Amazon Redshift is fully managed, scalable, secure, and integrates seamlessly with your data lake. In this post, we discuss how to set up and use the new query […]

Read More