AWS Big Data Blog

Month in Review (January 2016)

Lots for big data enthusiasts in January on the AWS Big Data Blog. Take a look! Running an External Zeppelin Instance using S3 Backed Notebooks with Spark on Amazon EMR Learn how to set up Zeppelin running “off-cluster” on a separate EC2 instance. You’ll be able to submit Spark jobs to an EMR cluster directly […]

Turning Amazon EMR into a Massive Amazon S3 Processing Engine with Campanile

Michael Wallman is a senior consultant with AWS ProServ Have you ever had to copy a huge Amazon S3 bucket to another account or region? Or create a list based on object name or size? How about mapping a function over millions of objects? Amazon EMR to the rescue! EMR allows you to deploy large […]

Agile Analytics with Amazon Redshift

Nick Corbett is a Big Data Consultant for AWS Professional Services What makes outstanding business intelligence (BI)? It needs to be accurate and up-to-date, but this alone won’t differentiate a solution. Perhaps a better measure is to consider the reaction you get when your latest report or metric is released to the business. Good BI […]

Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming

Amo Abeyaratne is a Big Data consultant with AWS Professional Services Introduction What if you could use your SQL knowledge to discover patterns directly from an incoming stream of data? Streaming analytics is a very popular topic of conversation around big data use cases. These use cases can vary from just accumulating simple web transaction […]

Join us at the AWS Big Data Meetup on January 13th in San Francisco

The AWS Big Data Meetup brings Big Data developers and enthusiasts together to discuss Big Data solutions with each other and AWS team members. At the event you will hear speakers from AWS and the wider community who are pushing the boundaries of Big Data. We are committed to maintaining a technical focus, and invite […]

Running an External Zeppelin Instance using S3 Backed Notebooks with Spark on Amazon EMR

Dominic Murphy is an Enterprise Solution Architect with Amazon Web Services Apache Zeppelin is an open source GUI which creates interactive and collaborative notebooks for data exploration using Spark. You can use Scala, Python, SQL (using Spark SQL), or HiveQL to manipulate data and quickly visualize results. Zeppelin notebooks can be shared among several users, […]

Month in Review: December 2015

Lots for big data enthusiasts in December on the AWS Big Data Blog. Take a look! Top 10 Performance Tuning Techniques for Amazon Redshift “This post takes you through the most common issues that customers find as they adopt Amazon Redshift, and gives you concrete guidance on how to address each.” Migrating Metadata when Encrypting […]

Query Routing and Rewrite: Introducing pgbouncer-rr for Amazon Redshift and PostgreSQL

This post was last reviewed and updated August, 2022 with a section on Deploying pgbouncer in Elastic Kubernetes Service (EKS). NOTE: You can now use federated queries in Amazon Redshift to query and analyze data across operational databases, data warehouses, and data lakes. For more information, please review the Amazon Redshift documentation article, “Querying Data […]

Securely Access Web Interfaces on Amazon EMR Launched in a Private Subnet

Ben Snively is a Solutions Architect with AWS Private subnets allow you to limit access to deployed components, and to control security and routing of the system. You can also use a private subnet to connect an on-premises local network to AWS through a VPN or AWS Direct Connect. Amazon EMR allows customers to launch […]

Performance Tuning Your Titan Graph Database on AWS

At AWS re:Invent 2017, we announced the preview of Amazon Neptune, a fast and reliable graph database built for the cloud. Neptune is fully managed and highly available, and it includes read replicas, point-in-time recovery, and continuous backups to Amazon S3. If you are about to build an application yourself and need a graph database, […]