AWS Big Data Blog

Category: Analytics

Entity resolution and fuzzy matches in AWS Glue using the Zingg open source library

In this post, we explore how to use Zingg’s entity resolution capabilities within an AWS Glue notebook, which you can later run as an extract, transform, and load (ETL) job. By integrating Zingg in your notebooks or ETL jobs, you can effectively address data governance challenges and provide consistent and accurate data across your organization.

Introducing blueprint discovery and other UI enhancements for Amazon OpenSearch Ingestion

Amazon OpenSearch Ingestion is a fully managed serverless pipeline that allows you to ingest, filter, transform, enrich, and route data to an Amazon OpenSearch Service domain or Amazon OpenSearch Serverless collection. OpenSearch Ingestion is capable of ingesting data from a wide variety of sources and has a rich ecosystem of built-in processors to take care […]

Use AWS Data Exchange to seamlessly share Apache Hudi datasets

Apache Hudi was originally developed by Uber in 2016 to bring to life a transactional data lake that could quickly and reliably absorb updates to support the massive growth of the company’s ride-sharing platform. Apache Hudi is now widely used to build very large-scale data lakes by many across the industry. Today, Hudi is the […]

Overview of the solution

AVB accelerates search in LINQ with Amazon OpenSearch Service

AVB Marketing delivers custom digital solutions for their members across a wide range of products. LINQ, AVB’s proprietary product information management system, empowers their appliance, consumer electronics, and furniture retailer members to streamline the management of their product catalog. In this post, we share how AVB reduced their average search time from 3 seconds to 300 milliseconds in LINQ by adopting Amazon OpenSearch Service while processing 14.5 million record updates daily.

Understanding Apache Iceberg on AWS with the new technical guide

We’re excited to announce the launch of the Apache Iceberg on AWS technical guide. Whether you are new to Apache Iceberg on AWS or already running production workloads on AWS, this comprehensive technical guide offers detailed guidance on foundational concepts to advanced optimizations to build your transactional data lake with Apache Iceberg on AWS.

Amazon DocumentDB zero-ETL integration with Amazon OpenSearch Service is now available

Today, we are announcing the general availability of Amazon DocumentDB (with MongoDB compatibility) zero-ETL integration with Amazon OpenSearch Service. Amazon DocumentDB provides native text search and vector search capabilities. With Amazon OpenSearch Service, you can perform advanced search analytics, such as fuzzy search, synonym search, cross-collection search, and multilingual search, on Amazon DocumentDB data. Zero-ETL […]

Safely remove Kafka brokers from Amazon MSK provisioned clusters

Today, we are announcing broker removal capability for Amazon Managed Streaming for Apache Kafka (Amazon MSK) provisioned clusters, which lets you remove multiple brokers from your provisioned clusters. You can now reduce your cluster’s storage and compute capacity by removing sets of brokers, with no availability impact, data durability risk, or disruption to your data streaming […]

Figure 1 – Map built with CARTO Builder and the native support to visualize H3 indexes

Breaking barriers in geospatial: Amazon Redshift, CARTO, and H3

In this post, we discuss how Amazon Redshift spatial index functions such as Hexagonal hierarchical geospatial indexing system (or H3) can be used to represent spatial data using H3 indexing for fast spatial lookups at scale. Navigating the vast landscape of data-driven insights has always been an exciting endeavor. As technology continues to evolve, one specific facet of this journey is reaching unprecedented proportions: geospatial data.

Achieve peak performance and boost scalability using multiple Amazon Redshift serverless workgroups and Network Load Balancer

As data analytics use cases grow, factors of scalability and concurrency become crucial for businesses. Your analytic solution architecture should be able to handle large data volumes at high concurrency and without compromising speed, thereby delivering a scalable high-performance analytics environment. Amazon Redshift Serverless provides a fully managed, petabyte-scale, auto scaling cloud data warehouse to […]

Use AWS Glue Data Catalog views to analyze data

In this post, we show you how to use the new views feature the AWS Glue Data Catalog. SQL views are a powerful object used across relational databases. You can use views to decrease the time to insights of data by tailoring the data that is queried. Additionally, you can use the power of SQL […]