AWS Big Data Blog

Category: Amazon SageMaker

Provide data reliability in Amazon Redshift at scale using Great Expectations library

Ensuring data reliability is one of the key objectives of maintaining data integrity and is crucial for building data trust across an organization. Data reliability means that the data is complete and accurate. It’s the catalyst for delivering trusted data analytics and insights. Incomplete or inaccurate data leads business leaders and data analysts to make […]

Read More

WeatherBug reduced ETL latency to 30 times faster using Amazon Redshift Spectrum

This post is co-written with data engineers, Anton Morozov and James Phillips, from Weatherbug. WeatherBug is a brand owned by GroundTruth, based in New York City, that provides location-based advertising solutions to businesses. WeatherBug consists of a mobile app reporting live and forecast data on hyperlocal weather to consumer users. The WeatherBug Data Engineering team […]

Read More

How MEDHOST’s cardiac risk prediction successfully leveraged AWS analytic services

MEDHOST has been providing products and services to healthcare facilities of all types and sizes for over 35 years. Today, more than 1,000 healthcare facilities are partnering with MEDHOST and enhancing their patient care and operational excellence with its integrated clinical and financial EHR solutions. MEDHOST also offers a comprehensive Emergency Department Information System with […]

Read More

How Imperva uses Amazon Athena for machine learning botnets detection

This is a guest post by Ori Nakar, Principal Engineer at Imperva. In their own words, “Imperva is a large cyber security company and an AWS Partner Network (APN) Advanced Technology Partner, who protects web applications and data assets. Imperva protects over 6,200 enterprises worldwide and many of them use Imperva Web Application Firewall (WAF) […]

Read More

Effective data lakes using AWS Lake Formation, Part 3: Using ACID transactions on governed tables

Data lakes on Amazon Simple Storage Service (Amazon S3) have become the default repository for all enterprise data and serve as common choice for a large number of users querying from a variety of analytics and ML tools. Often times you want to ingest data continuously into the data lake from multiple sources and query against the […]

Read More
Let’s look at PyDeequ’s main components, and how they relate to Deequ (shown in the following diagram)

Testing data quality at scale with PyDeequ

You generally write unit tests for your code, but do you also test your data? Incoming data quality can make or break your application. Incorrect, missing, or malformed data can have a large impact on production systems. Examples of data quality issues include the following: Missing values can lead to failures in production system that […]

Read More

Optimize Python ETL by extending Pandas with AWS Data Wrangler

Developing extract, transform, and load (ETL) data pipelines is one of the most time-consuming steps to keep data lakes, data warehouses, and databases up to date and ready to provide business insights. You can categorize these pipelines into distributed and non-distributed, and the choice of one or the other depends on the amount of data […]

Read More

Exploring the public AWS COVID-19 data lake

This post walks you through accessing the AWS COVID-19 data lake through the AWS Glue Data Catalog via Amazon SageMaker or Jupyter and using the open-source AWS Data Wrangler library. AWS Data Wrangler is an open-source Python package that extends the power of Pandas library to AWS and connects DataFrames and AWS data-related services (such as Amazon Redshift, Amazon S3, AWS Glue, Amazon Athena, and Amazon EMR). For more information about what you can build by using this data lake, see the associated public Jupyter notebook on GitHub.

Read More

Build machine learning-powered business intelligence analyses using Amazon QuickSight

Imagine you can see the future—to know how many customers will order your product months ahead of time so you can make adequate provisions, or to know how many of your employees will leave your organization several months in advance so you can take preemptive actions to encourage staff retention. For an organization that sees […]

Read More

Provisioning the Intuit Data Lake with Amazon EMR, Amazon SageMaker, and AWS Service Catalog

This post outlines the approach taken by Intuit, though it is important to remember that there are many ways to build a data lake (for example, AWS Lake Formation). We’ll cover the technologies and processes involved in creating the Intuit Data Lake at a high level, including the overall structure and the automation used in provisioning accounts and resources. Watch this space in the future for more detailed blog posts on specific aspects of the system, from the other teams and engineers who worked together to build the Intuit Data Lake.

Read More