AWS Big Data Blog

Best practices for Amazon Redshift Federated Query

This post discusses 10 best practices to help you maximize the benefits of Federated Query when you have large federated data sets, when your federated queries retrieve large volumes of data, or when you have many Redshift users accessing federated data sets. These techniques are not necessary for general usage of Federated Query. They are intended for advanced users who want to make the most of this exciting feature.

Analyzing Google Analytics data with Amazon AppFlow and Amazon Athena

This post demonstrates how you can transfer Google Analytics data to Amazon S3 using Amazon AppFlow, and analyze it with Amazon Athena. You no longer need to build your own application to extract data from Google Analytics and other SaaS applications. Amazon AppFlow enables you to develop a fully automated data transfer and transformation workflow and an integrated query environment in one place.

Setting up trust between ADFS and AWS and using Active Directory credentials to connect to Amazon Athena with ODBC driver

This post walks you through configuring ADFS 3.0 on a Windows Server 2012 R2 Amazon Elastic Compute Cloud (Amazon EC2) instance and setting up trust between ADFS 3.0 IdP and AWS through SAML 2.0. The post then demonstrates how to install the Athena OBDC driver on Amazon Linux EC2 instance (RHEL instance) and configure it to use ADFS for authentication.

Using Random Cut Forests for real-time anomaly detection in Amazon OpenSearch Service

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. Anomaly detection is a rich field of machine learning. Many mathematical and statistical techniques have been used to discover outliers in data, and as a result, many algorithms have been developed for performing anomaly detection in a computational setting. In […]

Running a high-performance SAS Grid Manager cluster on AWS with Amazon FSx for Lustre

SAS® is a software provider of data science and analytics used by enterprises and government organizations. SAS Grid is a highly available, fast processing analytics platform that offers centralized management that balances workloads across different compute nodes. This application suite is capable of data management, visual analytics, governance and security, forecasting and text mining, statistical […]

Moving to managed: The case for Amazon OpenSearch Service

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. Prior to joining AWS, I led a development team that built mobile advertising solutions with Elasticsearch. Elasticsearch is a popular open-source search and analytics engine for log analytics, real-time application monitoring, clickstream analysis, and (of course) search. The platform I […]

Monitor and control the storage space of a schema with quotas with Amazon Redshift

Many organizations are moving toward self-service analytics, where different personas create their own insights on the evolved volume, variety, and velocity of data to keep up with the acceleration of business. This data democratization creates the need to enforce data governance, control cost, and prevent data mismanagement. Controlling the storage quota of different personas is a significant challenge for data governance and data storage operation. This post shows you how to set up Amazon Redshift storage quotas by different personas.

How Goldman Sachs builds cross-account connectivity to their Amazon MSK clusters with AWS PrivateLink

This guest post presents patterns for accessing an Amazon Managed Streaming for Apache Kafka cluster across your AWS account or Amazon Virtual Private Cloud (Amazon VPC) boundaries using AWS PrivateLink. In addition, the post discusses the pattern that the Transaction Banking team at Goldman Sachs (TxB) chose for their cross-account access, the reasons behind their […]