AWS Big Data Blog

Category: Amazon Redshift

Scale your cloud data warehouse and reduce costs with the new Amazon Redshift RA3 nodes with managed storage

One of our favorite things about working on Amazon Redshift, the cloud data warehouse service at AWS, is the inspiring stories from customers about how they’re using data to gain business insights. Many of our recent engagements have been with customers upgrading to the new instance type, Amazon Redshift RA3 with managed storage. In this […]

Read More

Optimize Python ETL by extending Pandas with AWS Data Wrangler

Developing extract, transform, and load (ETL) data pipelines is one of the most time-consuming steps to keep data lakes, data warehouses, and databases up to date and ready to provide business insights. You can categorize these pipelines into distributed and non-distributed, and the choice of one or the other depends on the amount of data […]

Read More

Stream Twitter data into Amazon Redshift using Amazon MSK and AWS Glue streaming ETL

This post demonstrates how customers, system integrator (SI) partners, and developers can use the serverless streaming ETL capabilities of AWS Glue with Amazon Managed Streaming for Kafka (Amazon MSK) to stream data to a data warehouse such as Amazon Redshift. We also show you how to view Twitter streaming data on Amazon QuickSight via Amazon Redshift.

Read More

Manage and control your cost with Amazon Redshift Concurrency Scaling and Spectrum

This post shares the simple steps you can take to use the new Amazon Redshift usage controls feature to monitor and control your usage and associated cost for Amazon Redshift Spectrum and Concurrency Scaling features. Redshift Spectrum enables you to power a lake house architecture to directly query and join data across your data warehouse and data lake, and Concurrency Scaling enables you to support thousands of concurrent users and queries with consistently fast query performance.

Read More

Federate access to your Amazon Redshift cluster with Active Directory Federation Services (AD FS): Part 2

In the first post of this series, Federating access to your Amazon Redshift cluster with Active Directory: Part 1, you set up Microsoft Active Directory Federation Services (AD FS) and Security Assertion Markup Language (SAML) based authentication and tested the SAML federation using a web browser. In Part 2, you learn to set up an […]

Read More

Federate access to your Amazon Redshift cluster with Active Directory Federation Services (AD FS): Part 1

Many customers request detailed steps to set up federated single sign-on (SSO) using Microsoft Active Directory Federation Services (AD FS) for Amazon Redshift. In this two-part series, you will find detailed steps to achieve federated SSO using AD FS. Part 1 contains steps to set up a Windows 2016 domain controller and AD FS and […]

Read More

Develop an application migration methodology to modernize your data warehouse with Amazon Redshift

This post demonstrates how to develop a comprehensive, wave-based application migration methodology for a complex project to modernize a traditional MPP data warehouse with Amazon Redshift. It provides best practices and lessons learned by considering business priority, data dependency, workload profiles and existing service level agreements (SLAs).

Read More

Restrict Amazon Redshift Spectrum external table access to Amazon Redshift IAM users and groups using role chaining

With Amazon Redshift Spectrum, you can query the data in your Amazon Simple Storage Service (Amazon S3) data lake using a central AWS Glue metastore from your Amazon Redshift cluster. This capability extends your petabyte-scale Amazon Redshift data warehouse to unbounded data storage limits, which allows you to scale to exabytes of data cost-effectively. Like Amazon EMR, you get the benefits of open data formats and inexpensive storage, and you can scale out to thousands of Redshift Spectrum nodes to pull data, filter, project, aggregate, group, and sort. Like Amazon Athena, Redshift Spectrum is serverless and there’s nothing to provision or manage. You only pay $5 for every 1 TB of data scanned. This post discusses how to configure Amazon Redshift security to enable fine grained access control using role chaining to achieve high-fidelity user-based permission management.

Read More

How Wind Mobility built a serverless data architecture

We parse through millions of scooter and user events generated daily (over 300 events per second) to extract actionable insight. We selected AWS Glue to perform this task. Our primary ETL job reads the newly added raw event data from Amazon S3, processes it using Apache Spark, and writes the results to our Amazon Redshift data warehouse. AWS Glue plays a critical role in our ability to scale on demand. After careful evaluation and testing, we concluded that AWS Glue ETL jobs meet all our needs and free us from procuring and managing infrastructure.

Read More

Process data with varying data ingestion frequencies using AWS Glue job bookmarks

We often have data processing requirements in which we need to merge multiple datasets with varying data ingestion frequencies. Some of these datasets are ingested one time in full, received infrequently, and always used in their entirety, whereas other datasets are incremental, received at certain intervals, and joined with the full datasets to generate output. To address this requirement, this post demonstrates how to build an extract, transform, and load (ETL) pipeline using AWS Glue.

Read More