AWS Big Data Blog

Using the Amazon Redshift Data API to interact with Amazon Redshift clusters

This post was updated on July 28, 2021, to include multi-statement and parameterization support. Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. […]

Analyzing Amazon S3 server access logs using Amazon OpenSearch Service

This blog post was last reviewed and updated April, 2022. When you use Amazon Simple Storage Service (Amazon S3) to store corporate data and host websites, you need additional logging to monitor access to your data and the performance of your application. An effective logging solution enhances security and improves the detection of security incidents. […]

Implementing LDAP authentication for Hive on a multi-tenant Amazon EMR cluster

As Amazon EMR continues its widespread adoption, it’s important to enforce separation of duties using role-based access when submitting your hive jobs on EMR clusters in multi-tenant environments. In this post, we walk through the steps to set up authentication for Hive using Lightweight Directory Access Protocol (LDAP) and Microsoft Active Directory Domain Controller. Solution […]

Enhanced monitoring and automatic scaling for Apache Flink

Thousands of developers use Apache Flink to build streaming applications to transform and analyze data in real time. Apache Flink is an open-source framework and engine for processing data streams. It’s highly available and scalable, delivering high throughput and low latency for the most demanding stream-processing applications. Monitoring and scaling your applications is critical to […]

Stream CDC into an Amazon S3 data lake in Parquet format with AWS DMS

Most organizations generate data in real time and ever-increasing volumes. Data is captured from a variety of sources, such as transactional and reporting databases, application logs, customer-facing websites, and external feeds. Companies want to capture, transform, and analyze this time-sensitive data to improve customer experiences, increase efficiency, and drive innovations. With increased data volume and […]

Amazon EMR supports Apache Hive ACID transactions

December 2022: The best practice of using EMRFS consistent in this blog post is now obsolete as Amazon S3 has supported strong read-after-write consistency since December, 2020.  Apache Hive is an open-source data warehouse package that runs on top of an Apache Hadoop cluster. You can use Hive for batch processing and large-scale data analysis. […]

Zoopla drives KPIs with centralized data using Fivetran ELT for Amazon Redshift

This is a guest post by Steven Collings, Senior Data Consultant at Zoopla Zoopla is a property website that enables users to find residential or commercial property to buy or rent in the UK and overseas. Since acquiring Property Software Group and Expert Agent, we also offer a backend software that agents can use to […]

Fast and predictable performance with serverless compilation using Amazon Redshift

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Customers tell us that they want extremely fast query response times so they can make equally fast decisions. This post presents the recently launched, […]

Power data analytics, monitoring, and search use cases with the Open Distro for Elasticsearch SQL Engine on Amazon ES

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. Amazon OpenSearch Service is a popular choice for log analytics, search, real-time application monitoring, clickstream analysis, and more. One commonality among these use cases is the need to write and run queries to obtain search results at lightning speed. However, […]