AWS Big Data Blog

Category: Analytics

Joining and Enriching Streaming Data on Amazon Kinesis

Are you trying to move away from a batch-based ETL pipeline? You might do this, for example, to get real-time insights into your streaming data, such as clickstream, financial transactions, sensor data, customer interactions, and so on.  If so, it’s possible that as soon as you get down to requirements, you realize your streaming data […]

Read More

Amazon Redshift Engineering’s Advanced Table Design Playbook: Table Data Durability

Part 1: Preamble, Prerequisites, and Prioritization Part 2: Distribution Styles and Distribution Keys Part 3: Compound and Interleaved Sort Keys Part 4: Compression Encodings Part 5: Table Data Durability (Translated into Japanese) In the fifth and final installment of the Advanced Table Design Playbook, I’ll discuss how to use two simple table durability properties to […]

Read More

Interactive Analysis of Genomic Datasets Using Amazon Athena

Aaron Friedman is a Healthcare and Life Sciences Solutions Architect with Amazon Web Services The genomics industry is in the midst of a data explosion. Due to the rapid drop in the cost to sequence genomes, genomics is now central to many medical advances. When your genome is sequenced and analyzed, raw sequencing files are […]

Read More

Amazon Redshift Engineering’s Advanced Table Design Playbook: Compression Encodings

Part 1: Preamble, Prerequisites, and Prioritization Part 2: Distribution Styles and Distribution Keys Part 3: Compound and Interleaved Sort Keys Part 4: Compression Encodings (Translated into Japanese) Part 5: Table Data Durability In part 4 of this blog series, I’ll be discussing when and when not to apply column encoding for compression, methods for determining ideal […]

Read More

Amazon Redshift Engineering’s Advanced Table Design Playbook: Compound and Interleaved Sort Keys

  Part 1: Preamble, Prerequisites, and Prioritization Part 2: Distribution Styles and Distribution Keys Part 3: Compound and Interleaved Sort Keys (Translated into Japanese) Part 4: Compression Encodings Part 5: Table Data Durability In this installment, I’ll cover different sort key options, when to use sort keys, and how to identify the most optimal sort key […]

Read More

Amazon Redshift Engineering’s Advanced Table Design Playbook: Distribution Styles and Distribution Keys

  Part 1: Preamble, Prerequisites, and Prioritization Part 2: Distribution Styles and Distribution Keys (Translated into Japanese) Part 3: Compound and Interleaved Sort Keys Part 4: Compression Encodings Part 5: Table Data Durability The first table and column properties we discuss in this blog series are table distribution styles (DISTSTYLE) and distribution keys (DISTKEY). This blog […]

Read More

Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization

  Part 1: Preamble, Prerequisites, and Prioritization (Translated into Japanese) Part 2: Distribution Styles and Distribution Keys Part 3: Compound and Interleaved Sort Keys Part 4: Compression Encodings Part 5: Table Data Durability Amazon Redshift is a fully managed, petabyte scale, massively parallel data warehouse that offers simple operations and high performance. AWS customers use Amazon […]

Read More

Implementing Authorization and Auditing using Apache Ranger on Amazon EMR

Updated 3/30/2022: Amazon EMR has announced official support of Apache Ranger (link). Open-source plugin support will not be maintained moving forward and compatibility with latest versions will not be tested. We recommend customers to move to the Amazon EMR support for Apache Ranger. Ranger Presto plugin support on EMR has been deprecated. Updated 12/03/2020: Support for […]

Read More

Analyzing Data in S3 using Amazon Athena

This blog was last reviewed May, 2022. Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage and you can start analyzing your data immediately. You don’t even need to load […]

Read More

Low-Latency Access on Trillions of Records: FINRA’s Architecture Using Apache HBase on Amazon EMR with Amazon S3

John Hitchingham is Director of Performance Engineering at FINRA The Financial Industry Regulatory Authority (FINRA) is a private sector regulator responsible for analyzing 99% of the equities and 65% of the option activity in the US. In order to look for fraud, market manipulation, insider trading, and abuse, FINRA’s technology group has developed a robust […]

Read More