AWS Big Data Blog

Tag: Data Lake

Access and manage data from multiple accounts from a central AWS Lake Formation account

his post shows how to access and manage data in multiple accounts from a central AWS Lake Formation account. The walkthrough demonstrates a centralized catalog residing in the master Lake Formation account, with data residing in the different accounts. The post shows how to grant access permissions from the Lake Formation service to read, write and update the catalog and access data in different accounts.

Getting started with AWS Lake Formation

AWS Lake Formation enables you to set up a secure data lake. A data lake is a centralized, curated, and secured repository storing all your structured and unstructured data, at any scale. You can store your data as-is, without having first to structure it. And you can run different types of analytics to better guide […]

Integrate and deduplicate datasets using AWS Lake Formation FindMatches

AWS Lake Formation FindMatches is a new machine learning (ML) transform that enables you to match records across different datasets as well as identify and remove duplicate records, with little to no human intervention. FindMatches is part of Lake Formation, a new AWS service that helps you build a secure data lake in a few simple steps.

To use FindMatches, you don’t have to write code or know how ML works. Your data doesn’t have to include a unique identifier, nor must fields match exactly.

Amazon EMR Migration Guide

Today, we’re introducing the Amazon EMR Migrations Guide (first published June 2019.) This paper is a comprehensive guide to offer sound technical advice to help customers in planning how to move from on-premises big data deployments to EMR.

Build and automate a serverless data lake using an AWS Glue trigger for the Data Catalog and ETL jobs

September 2022: This post was reviewed and updated with latest screenshots and instructions. Today, data is flowing from everywhere, whether it is unstructured data from resources like IoT sensors, application logs, and clickstreams, or structured data from transaction applications, relational databases, and spreadsheets. Data has become a crucial part of every business. This has resulted […]

Build a Data Lake Foundation with AWS Glue and Amazon S3

A data lake is an increasingly popular way to store and analyze data that addresses the challenges of dealing with massive volumes of heterogeneous data. A data lake allows organizations to store all their data—structured and unstructured—in one centralized repository. Because data can be stored as-is, there is no need to convert it to a predefined schema. This post walks you through the process of using AWS Glue to crawl your data on Amazon S3 and build a metadata store that can be used with other AWS offerings.

From Data Lake to Data Warehouse: Enhancing Customer 360 with Amazon Redshift Spectrum

Achieving a 360o-view of your customer has become increasingly challenging as companies embrace omni-channel strategies, engaging customers across websites, mobile, call centers, social media, physical sites, and beyond. The promise of a web where online and physical worlds blend makes understanding your customers more challenging, but also more important. Businesses that are successful in this […]

Building a Real World Evidence Platform on AWS

Deriving insights from large datasets is central to nearly every industry, and life sciences is no exception. To combat the rising cost of bringing drugs to market, pharmaceutical companies are looking for ways to optimize their drug development processes. They are turning to big data analytics to better quantify the effect that their drug compounds […]

New AWS Training: Building a Serverless Data Lake

AWS Training allows you to learn from the experts so that you can advance your knowledge with practical skills and get more out of the AWS Cloud. We are adding one of our most popular event boot camps, Building a Serverless Data Lake, to our permanent instructor-led training portfolio. This one-day course is designed to […]