AWS Big Data Blog

Access web interfaces securely on Amazon EMR launched in a private subnet using an Application Load Balancer

Amazon EMR web interfaces are hosted on the master node of an EMR cluster. When you launch an EMR cluster in a private subnet, the EMR master node doesn’t have a public DNS record. The web interfaces hosted in a private subnet aren’t easily accessible outside the subnet. You can use an Application Load Balancer (ALB), launched in a public subnet, as an HTTPS proxy to access EMR web interfaces over the internet without requiring SSH tunneling through a bastion host. This approach greatly simplifies accessing EMR web interfaces. This post outlines how to use an ALB to securely access EMR web interfaces over the internet for an EMR cluster launched in a private subnet.

Read More

Best practices for Amazon Redshift Federated Query

This post discusses 10 best practices to help you maximize the benefits of Federated Query when you have large federated data sets, when your federated queries retrieve large volumes of data, or when you have many Redshift users accessing federated data sets. These techniques are not necessary for general usage of Federated Query. They are intended for advanced users who want to make the most of this exciting feature.

Read More

Analyzing Google Analytics data with Amazon AppFlow and Amazon Athena

This post demonstrates how you can transfer Google Analytics data to Amazon S3 using Amazon AppFlow, and analyze it with Amazon Athena. You no longer need to build your own application to extract data from Google Analytics and other SaaS applications. Amazon AppFlow enables you to develop a fully automated data transfer and transformation workflow and an integrated query environment in one place.

Read More

Setting up trust between ADFS and AWS and using Active Directory credentials to connect to Amazon Athena with ODBC driver

This post walks you through configuring ADFS 3.0 on a Windows Server 2012 R2 Amazon Elastic Compute Cloud (Amazon EC2) instance and setting up trust between ADFS 3.0 IdP and AWS through SAML 2.0. The post then demonstrates how to install the Athena OBDC driver on Amazon Linux EC2 instance (RHEL instance) and configure it to use ADFS for authentication.

Read More

Using Random Cut Forests for real-time anomaly detection in Amazon Elasticsearch Service

Anomaly detection is a rich field of machine learning. Many mathematical and statistical techniques have been used to discover outliers in data, and as a result, many algorithms have been developed for performing anomaly detection in a computational setting. In this post, we take a close look at the output and accuracy of the anomaly detection feature available in Amazon Elasticsearch Service (Amazon ES) and Open Distro for Elasticsearch, and provide insight as to why we chose Random Cut Forests (RCF) as the core anomaly detection algorithm.

Read More

Running a high-performance SAS Grid Manager cluster on AWS with Amazon FSx for Lustre

SAS® is a software provider of data science and analytics used by enterprises and government organizations. SAS Grid is a highly available, fast processing analytics platform that offers centralized management that balances workloads across different compute nodes. This application suite is capable of data management, visual analytics, governance and security, forecasting and text mining, statistical […]

Read More

Moving to managed: The case for the Amazon Elasticsearch Service

You need to factor several considerations into your decision to move to a managed service. Obviously, you want your teams focused on doing meaningful work that propels the growth of your company. Deciding what processes you offload to a managed service versus what are best self-managed can be a challenge. Based on my experience managing Elasticsearch at my prior employer, and having worked with thousands of customers who have migrated to AWS, I consider the following sections important topics for you to review.

Read More

Monitor and control the storage space of a schema with quotas with Amazon Redshift

Many organizations are moving toward self-service analytics, where different personas create their own insights on the evolved volume, variety, and velocity of data to keep up with the acceleration of business. This data democratization creates the need to enforce data governance, control cost, and prevent data mismanagement. Controlling the storage quota of different personas is a significant challenge for data governance and data storage operation. This post shows you how to set up Amazon Redshift storage quotas by different personas.

Read More