AWS Big Data Blog

Category: Analytics

Chasing earthquakes: How to prepare an unstructured dataset for visualization via ETL processing with Amazon Redshift

As organizations expand analytics practices and hire data scientists and other specialized roles, big data pipelines are growing increasingly complex. Sophisticated models are being built using the troves of data being collected every second. The bottleneck today is often not the know-how of analytical techniques. Rather, it’s the difficulty of building and maintaining ETL (extract, transform, and load) jobs using tools that might be unsuitable for the cloud. In this post, I demonstrate a solution to this challenge.

Read More

Dynamically scale up storage on Amazon EMR clusters

In a managed Apache Hadoop environment—like an Amazon EMR cluster—when the storage capacity on your cluster fills up, there is no convenient solution to deal with it. This situation occurs because you set up Amazon Elastic Block Store (Amazon EBS) volumes and configure mount points when the cluster is launched, so it’s difficult to modify […]

Read More

Closing the customer journey loop with Amazon Redshift at Equinox Fitness Clubs

Clickstream analysis tools handle their data well, and some even have impressive BI interfaces. However, analyzing clickstream data in isolation comes with many limitations. For example, a customer is interested in a product or service on your website. They go to your physical store to purchase it. The clickstream analyst asks, “What happened after they […]

Read More

Advanced analytics with table calculations in Amazon QuickSight

Amazon QuickSight recently launched table calculations, which enable you to perform complex calculations on your data to derive meaningful insights. In this blog post, we go through examples of applying these calculations to a sample sales data set so that you can start using these for your own needs. You can find the sample data […]

Read More

Restrict access to your AWS Glue Data Catalog with resource-level IAM permissions and resource-based policies

Data cataloging is an important part of many analytical systems. The AWS Glue Data Catalog provides integration with a wide number of tools. Using the Data Catalog, you also can specify a policy that grants permissions to objects in the Data Catalog. Data lakes require detailed access control at both the content level and the level of the metadata describing the content. In this post, we show how you can define the access policies for the metadata in the catalog.

Read More

Migrate to Apache HBase on Amazon S3 on Amazon EMR: Guidelines and Best Practices

This whitepaper walks you through the stages of a migration. It also helps you determine when to choose Apache HBase on Amazon S3 on Amazon EMR, plan for platform security, tune Apache HBase and EMRFS to support your application SLA, identify options to migrate and restore your data, and manage your cluster in production.

Read More

Connect to Amazon Athena with federated identities using temporary credentials

This post walks through three scenarios to enable trusted users to access Athena using temporary security credentials. First, we use SAML federation where user credentials were stored in Active Directory. Second, we use a custom credentials provider library to enable cross-account access. And third, we use an EC2 Instance Profile role to provide temporary credentials for users in our organization to access Athena.

Read More

How Annalect built an event log data analytics solution using Amazon Redshift

By establishing a data warehouse strategy using Amazon S3 for storage and Redshift Spectrum for analytics, we increased the size of the datasets we support by over an order of magnitude. In addition, we improved our ability to ingest large volumes of data quickly, and maintained fast performance without increasing our costs. Our analysts and modelers can now perform deeper analytics to improve ad buying strategies and results.

Read More