AWS Big Data Blog

Category: AWS Big Data

Enforce customized data quality rules in AWS Glue DataBrew

GIGO (garbage in, garbage out) is a concept common to computer science and mathematics: the quality of the output is determined by the quality of the input. In modern data architecture, you bring data from different data sources, which creates challenges around volume, velocity, and veracity. You might write unit tests for applications, but it’s […]

Read More
architecture diagram

Create a serverless event-driven workflow to ingest and process Microsoft data with AWS Glue and Amazon EventBridge

Microsoft SharePoint is a document management system for storing files, organizing documents, and sharing and editing documents in collaboration with others. Your organization may want to ingest SharePoint data into your data lake, combine the SharePoint data with other data that’s available in the data lake, and use it for reporting and analytics purposes. AWS […]

Read More

How MEDHOST’s cardiac risk prediction successfully leveraged AWS analytic services

MEDHOST has been providing products and services to healthcare facilities of all types and sizes for over 35 years. Today, more than 1,000 healthcare facilities are partnering with MEDHOST and enhancing their patient care and operational excellence with its integrated clinical and financial EHR solutions. MEDHOST also offers a comprehensive Emergency Department Information System with […]

Read More

Query a Teradata database using Amazon Athena Federated Query and join with data in your Amazon S3 data lake

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Teradata as your transactional data store, you may need to join the data in your data lake with Teradata in the cloud, Teradata running on Amazon Elastic Compute Cloud (Amazon EC2), or with an on-premises Teradata database, for example to build […]

Read More

Query Snowflake using Athena Federated Query and join with data in your Amazon S3 data lake

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Snowflake as your data warehouse solution, you may need to join your data in your data lake with Snowflake. For example, you may want to build a dashboard by joining historical data in your Amazon S3 data lake and the latest […]

Read More
You can plot the output into a chart using matplotlib.

Analyzing petabytes of trade and quote data with Amazon FinSpace

We recently announced Amazon FinSpace, a fully-managed data management and analytics service that makes it easy to store, catalog, and prepare financial industry data at scale, reducing the time it takes for financial services industry (FSI) customers to find and access all types of financial data for analysis from months to minutes. Financial services organizations […]

Read More
The following graph shows performance improvements measured as total runtime for TPC-DS queries. Amazon EMR 5.31 with EMR runtime has the better (lower) runtime.

Amazon EMR introduces EMR runtime for Presto, providing a 2.6 times speedup

Presto is an open-source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics, and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook. Running Presto […]

Read More
The following diagram shows the overall architecture to address our two challenges.

Extract multidimensional data from Microsoft SQL Server Analysis Services using AWS Glue

AWS Glue is fully managed service that makes it easier for you to extract, transform, and load (ETL) data for analytics. You can easily create ETL jobs to connect to backend data sources. There are several natively supported data sources, but what if you need to extract data from an unsupported data source? What if […]

Read More
The following diagram shows our solution architecture.

Effective data lakes using AWS Lake Formation, Part 2: Creating a governed table for streaming data sources

We announced the general availability of AWS Lake Formation transactions, row-level security, and acceleration at AWS re:Invent 2021. In Part 1 of this series, we explained how to set up a governed table and add objects to it. In this post, we expand on this example, and demonstrate how to ingest streaming data into governed tables […]

Read More
The following graph shows that the minimum throughput achieved with the persistent HFile

Amazon EMR 6.2.0 adds persistent HFile tracking to improve performance with HBase on Amazon S3

Apache HBase is an open-source, NoSQL database that you can use to achieve low latency random access to billions of rows. Starting with Amazon EMR 5.2.0, you can enable HBase on Amazon Simple Storage Service (Amazon S3). With HBase on Amazon S3, the HBase data files (HFiles) are written to Amazon S3, enabling data lake […]

Read More