AWS Big Data Blog

Making ETL easier with AWS Glue Studio

AWS Glue Studio is an easy-to-use graphical interface that speeds up the process of authoring, running, and monitoring extract, transform, and load (ETL) jobs in AWS Glue. The visual interface allows those who don’t know Apache Spark to design jobs without coding experience and accelerates the process for those who do. AWS Glue Studio was […]

Read More

Automating bucketing of streaming data using Amazon Athena and AWS Lambda

In today’s world, data plays a vital role in helping businesses understand and improve their processes and services to reduce cost. You can use several tools to gain insights from your data, such as Amazon Kinesis Data Analytics or open-source frameworks like Structured Streaming and Apache Flink to analyze the data in real time. Alternatively, […]

Read More

Best practices using AWS SCT and AWS Snowball to migrate from Teradata to Amazon Redshift

This is a guest post from ZS. In their own words, “ZS is a professional services firm that works closely with companies to help develop and deliver products and solutions that drive customer value and company results. ZS engagements involve a blend of technology, consulting, analytics, and operations, and are targeted toward improving the commercial […]

Read More

Bringing the power of embedded analytics to your apps and services with Amazon QuickSight

In the world we live in today, companies need to quickly react to change—and to anticipate it. Customers tell us that their reliance on data has never been greater than what it is today. To improve your decision-making, you have two types of data transformation needs: data agility, the speed at which data turns into […]

Read More

Building an AWS Glue ETL pipeline locally without an AWS account

If you’re new to AWS Glue and looking to understand its transformation capabilities without incurring an added expense, or if you’re simply wondering if AWS Glue ETL is the right tool for your use case and want a holistic view of AWS Glue ETL functions, then please continue reading. In this post, we walk you […]

Read More

How to delete user data in an AWS data lake

General Data Protection Regulation (GDPR) is an important aspect of today’s technology world, and processing data in compliance with GDPR is a necessity for those who implement solutions within the AWS public cloud. One article of GDPR is the “right to erasure” or “right to be forgotten” which may require you to implement a solution […]

Read More

Streaming data from Amazon S3 to Amazon Kinesis Data Streams using AWS DMS

Stream processing is very useful in use cases where we need to detect a problem quickly and improve the outcome based on data, for example production line monitoring or supply chain optimizations. This blog post walks you through process of streaming existing data files and ongoing changes from Amazon Simple Storage Service (Amazon S3) to […]

Read More

Using the Amazon Redshift Data API to interact with Amazon Redshift clusters

This post was updated on July 28, 2021, to include multi-statement and parameterization support. Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. […]

Read More

Analyzing Amazon S3 server access logs using Amazon ES

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. When you use Amazon Simple Storage Service (Amazon S3) to store corporate data and host websites, you need additional logging to monitor access to your data and the performance of your application. An effective logging solution enhances security and improves […]

Read More

Implementing LDAP authentication for Hive on a multi-tenant Amazon EMR cluster

As Amazon EMR continues its widespread adoption, it’s important to enforce separation of duties using role-based access when submitting your hive jobs on EMR clusters in multi-tenant environments. In this post, we walk through the steps to set up authentication for Hive using Lightweight Directory Access Protocol (LDAP) and Microsoft Active Directory Domain Controller. Solution […]

Read More