AWS Big Data Blog

Category: Analytics

Use HyperLogLog for trend analysis with Amazon Redshift

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL. Amazon Redshift offers up to three times better price performance than any other cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of […]

Monitor data quality in your data lake using PyDeequ and AWS Glue

March 2023: You can now use AWS Glue Data Quality to measure and manage the quality of your data. AWS Glue Data Quality is built on DeeQu and it offers a simplified user experience for customers who want to this open-source package. Refer to the blog and documentation for additional details. In our previous post, we […]

Effective data lakes using AWS Lake Formation, Part 3: Using ACID transactions on governed tables

February 2023: The content of this blog post can be now be found on AWS Lake Formation public documentation. Please refer to it instead. Data lakes on Amazon Simple Storage Service (Amazon S3) have become the default repository for all enterprise data and serve as common choice for a large number of users querying from a variety […]

Use Grok patterns in AWS Glue to process streaming data into Amazon Elasticsearch Service

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. Recently, we launched AWS Glue custom connectors for Amazon OpenSearch Service, which provides the capability to ingest data into Amazon OpenSearch Service with just a few clicks. You can now use Amazon OpenSearch Service as a data store for your […]

How OrthoFi delivers better insights for customers with Amazon Redshift and AWS Glue

This is a guest post by Christa Pierson and Jon Fearer at OrthoFi. OrthoFi is an orthodontic industry leader in revenue cycle management (RCM), and has partnered with more than 550 orthodontic practices across the country, delivering an end-to-end platform that enables orthodontists to bring on more patients and run their businesses more effectively. To […]

How Digital Infuzion solves the challenge of large-scale scientific data collaboration with Amazon Quicksight

This is a guest post by Digital Infuzion. In their own words, “Digital Infuzion (DIFZ), a leader in information technology, helps solve complex challenges related to genomics, health, and biomedical data, while collaborating with partners including the J. Craig Venter Institute, Gryphon Scientific, ICF International, and others engaged in scientific research. Together, we create novel […]

Orchestrate AWS Glue DataBrew jobs using Amazon Managed Workflows for Apache Airflow

As the industry grows with more data volume, big data analytics is becoming a common requirement in data analytics and machine learning (ML) use cases. Analysts are building complex data transformation pipelines that include multiple steps for data preparation and cleansing. However, analysts may want a simpler orchestration mechanism with a graphical user interface that […]

Enrich your data stream asynchronously using Amazon Kinesis Data Analytics for Apache Flink

August 30, 2023: Amazon Kinesis Data Analytics has been renamed to Amazon Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. Streaming data into or out of a data system must be fast. One of the most expensive pieces of any streaming system is the I/O of the […]

The following graph shows performance improvements measured as total runtime for TPC-DS queries. Amazon EMR 5.31 with EMR runtime has the better (lower) runtime.

Amazon EMR introduces EMR runtime for Presto, providing a 2.6 times speedup

Presto is an open-source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics, and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook. Running Presto […]

Amazon Redshift announces general availability of support for JSON and semi-structured data processing

At AWS re:Invent 2020, we announced the preview of native support for JSON and semi-structured data in Amazon Redshift. This includes a new data type, SUPER, which allows you to store JSON and other semi-structured data in Amazon Redshift tables, and support for the PartiQL query language, which allows you to seamlessly query and process […]