AWS Big Data Blog

How to delete user data in an AWS data lake

General Data Protection Regulation (GDPR) is an important aspect of today’s technology world, and processing data in compliance with GDPR is a necessity for those who implement solutions within the AWS public cloud. One article of GDPR is the “right to erasure” or “right to be forgotten” which may require you to implement a solution […]

Read More

Streaming data from Amazon S3 to Amazon Kinesis Data Streams using AWS DMS

Stream processing is very useful in use cases where we need to detect a problem quickly and improve the outcome based on data, for example production line monitoring or supply chain optimizations. This blog post walks you through process of streaming existing data files and ongoing changes from Amazon Simple Storage Service (Amazon S3) to […]

Read More

Using the Amazon Redshift Data API to interact with Amazon Redshift clusters

This post was updated on July 28, 2021, to include multi-statement and parameterization support. Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. […]

Read More

Analyzing Amazon S3 server access logs using Amazon ES

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. When you use Amazon Simple Storage Service (Amazon S3) to store corporate data and host websites, you need additional logging to monitor access to your data and the performance of your application. An effective logging solution enhances security and improves […]

Read More

Implementing LDAP authentication for Hive on a multi-tenant Amazon EMR cluster

As Amazon EMR continues its widespread adoption, it’s important to enforce separation of duties using role-based access when submitting your hive jobs on EMR clusters in multi-tenant environments. In this post, we walk through the steps to set up authentication for Hive using Lightweight Directory Access Protocol (LDAP) and Microsoft Active Directory Domain Controller. Solution […]

Read More

Enhanced monitoring and automatic scaling for Apache Flink

Thousands of developers use Apache Flink to build streaming applications to transform and analyze data in real time. Apache Flink is an open-source framework and engine for processing data streams. It’s highly available and scalable, delivering high throughput and low latency for the most demanding stream-processing applications. Monitoring and scaling your applications is critical to […]

Read More

Stream CDC into an Amazon S3 data lake in Parquet format with AWS DMS

Most organizations generate data in real time and ever-increasing volumes. Data is captured from a variety of sources, such as transactional and reporting databases, application logs, customer-facing websites, and external feeds. Companies want to capture, transform, and analyze this time-sensitive data to improve customer experiences, increase efficiency, and drive innovations. With increased data volume and […]

Read More

Amazon EMR supports Apache Hive ACID transactions

Apache Hive is an open-source data warehouse package that runs on top of an Apache Hadoop cluster. You can use Hive for batch processing and large-scale data analysis. Hive uses Hive Query Language (HiveQL), which is similar to SQL. ACID (atomicity, consistency, isolation, and durability) properties make sure that the transactions in a database are […]

Read More

Zoopla drives KPIs with centralized data using Fivetran ELT for Amazon Redshift

This is a guest post by Steven Collings, Senior Data Consultant at Zoopla Zoopla is a property website that enables users to find residential or commercial property to buy or rent in the UK and overseas. Since acquiring Property Software Group and Expert Agent, we also offer a backend software that agents can use to […]

Read More