Tag: DynamoDB

Near Zero Downtime Migration from MySQL to DynamoDB

by YongSeong Lee | on | | Comments

Many companies consider migrating from relational databases like MySQL to Amazon DynamoDB, a fully managed, fast, highly scalable, and flexible NoSQL database service. For example, DynamoDB can increase or decrease capacity based on traffic, in accordance with business needs. The total cost of servicing can be optimized more easily than for the typical media-based RDBMS.

However, migrations can have two common issues:

  • Service outage due to downtime, especially when customer service must be seamlessly available 24/7/365
  • Different key design between RDBMS and DynamoDB

This post introduces two methods of seamlessly migrating data from MySQL to DynamoDB, minimizing downtime and converting the MySQL key design into one more suitable for NoSQL.

AWS services

I’ve included sample code that uses the following AWS services:

  • AWS Database Migration Service (AWS DMS) can migrate your data to and from most widely used commercial and open-source databases. It supports homogeneous and heterogeneous migrations between different database platforms.
  • Amazon EMR is a managed Hadoop framework that helps you process vast amounts of data quickly. Build EMR clusters easily with preconfigured software stacks that include Hive and other business software.
  • Amazon Kinesis can continuously capture and retain a vast amount of data such as transaction, IT logs, or clickstreams for up to 7 days.
  • AWS Lambda helps you run your code without provisioning or managing servers. Your code can be automatically triggered by other AWS services such Amazon Kinesis Streams.

Migration solutions

Here are the two options I describe in this post:

  1. Use AWS DMS

AWS DMS supports migration to a DynamoDB table as a target. You can use object mapping to restructure original data to the desired structure of the data in DynamoDB during migration.

  1. Use EMR, Amazon Kinesis, and Lambda with custom scripts

Consider this method when more complex conversion processes and flexibility are required. Fine-grained user control is needed for grouping MySQL records into fewer DynamoDB items, determining attribute names dynamically, adding business logic programmatically during migration, supporting more data types, or adding parallel control for one big table.

After the initial load/bulk-puts are finished, and the most recent real-time data is caught up by the CDC (change data capture) process, you can change the application endpoint to DynamoDB.

The method of capturing changed data in option 2 is covered in the AWS Database post Streaming Changes in a Database with Amazon Kinesis. All code in this post is available in the big-data-blog GitHub repo, including test codes.

Solution architecture

The following diagram shows the overall architecture of both options.


Amazon EMR-DynamoDB Connector Repository on AWSLabs GitHub

by Mike Grimes | on | | Comments

Mike Grimes is a Software Development Engineer with Amazon EMR

Amazon Web Services is excited to announce that the Amazon EMR-DynamoDB Connector is now open-source. The EMR-DynamoDB Connector is a set of libraries that lets you access data stored in DynamoDB with Spark, Hadoop MapReduce, and Hive jobs. These libraries are currently shipped with EMR releases, but we will now build these from the emr-dynamodb-connector GitHub repository. The code you see in the repository is exactly what is available on your EMR cluster, making it easier to build applications with this component.

Amazon EMR regularly contributes open-source changes to improve applications for developers, data scientists, and analysts in the community. Check out the README to learn how to build, contribute to, and use the Amazon EMR-DynamoDB Connector!

Data Lake Ingestion: Automatically Partition Hive External Tables with AWS

by Songzhi Liu | on | | Comments

Songzhi Liu is a Professional Services Consultant with AWS

The data lake concept has become more and more popular among enterprise customers because it collects data from different sources and stores it where it can be easily combined, governed, and accessed.

On the AWS cloud, Amazon S3 is a good candidate for a data lake implementation, with large-scale data storage. Amazon EMR provides transparent scalability and seamless compatibility with many big data applications on Hadoop. However, no matter what kind of storage or processing is used, data must be defined.

In this post, I introduce a simple data ingestion and preparation framework based on AWS Lambda, Amazon DynamoDB, and Apache Hive on EMR for data from different sources landing in S3. This solution lets Hive pick up new partitions as data is loaded into S3 because Hive by itself cannot detect new partitions as data lands.

Apache Hive

Hive is a great choice as it is a general data interfacing language thanks to its well-designed Metastore and other related projects like HCatalog. Many other Hadoop applications like Pig, Spark, and Presto, etc. can leverage the schemas defined in Hive.

Moreover, external tables make Hive a great data definition language to define the data coming from different sources on S3, such as streaming data from Amazon Kinesis, log files from Amazon CloudWatch and AWS CloudTrail, or data ingested using other Hadoop applications like Sqoop or Flume.

To maximize the efficiency of data organization in Hive, you should leverage external tables and partitioning. By properly partitioning the data, you can largely reduce the amount of data needs to be retrieved and improve the efficiency during ETL or other types of analysis.

Solving the problem with AWS services

For many of the aforementioned services or applications, data is loaded periodically, as in one batch every 15 minutes. Because Hive external tables don’t pick up new partitions automatically, you need to update and add new partitions manually; this is difficult to manage at scale. A framework based on Lambda, DynamoDB, and S3 can assist with this challenge.

Architectural diagram

As data is ingested from different sources to S3, new partitions are added by this framework and become available in the predefined Hive external tables.


Monitor Your Application for Processing DynamoDB Streams

by Asmita Barve-Karandikar | on | | Comments

Asmita Barve-Karandikar is an SDE with DynamoDB

DynamoDB Streams can handle requests at scale, but you risk losing stream records if your processing application lags: DynamoDB Stream records are unavailable after 24 hours. Therefore, when you maintain multiregion read replicas of your DynamoDB table, you might be afraid of losing data.

In this post, I suggest ways you can monitor the Amazon Kinesis Client Library (KCL) application you use to process DynamoDB Streams to quickly track and resolve issues or failures so you can avoid losing data. Dashboards, metrics, and application logs all play a part. This post may be most relevant to Java applications running on Amazon EC2 instances.

Before you read further, please read my previous posts about designing KCL applications for DynamoDB Streams using a single worker or multiple workers.


You can make a dashboard from important CloudWatch metrics to get a complete overview of the state of your application. A sample dashboard for an application processing a 256-shard DynamoDB stream with two c4.large EC2 workers is shown below:

(FailoverTimeMillis = 60000)


Process Large DynamoDB Streams Using Multiple Amazon Kinesis Client Library (KCL) Workers

by Asmita Barve-Karandikar | on | | Comments

Asmita Barve-Karandikar is an SDE with DynamoDB


Imagine you own a popular mobile health app, with millions of users worldwide, that continuously records new information. It sends over one million updates per second to its master data store and needs the updates to be relayed to various replicas across different regions in real time. Amazon DynamoDB and DynamoDB Streams are accustomed to operating at this scale; and, they can handle the data storage and the updates capture for you. Developing a stream consumer application to replicate the captured updates to different regions at this scale may seem like a daunting task.

In a previous post, I described how you can use the Amazon Kinesis Client Library (KCL) and DynamoDB Streams Kinesis Adapter to efficiently process DynamoDB streams. In this post, I will focus on the KCL configurations that are likely to have an impact on the performance of your application when processing a large DynamoDB stream. By large, I mean a stream that requires your application to spin up two or more worker machines to consume and process it. Theoretically, the number of KCL workers you require depends on the following:

  1. The memory and CPU footprint of your application, which are directly proportional to the number of shards and the throughput per shard in your stream. My previous post provides more details about how to estimate this.
  2. The memory and CPU capacity on the machines you are looking to use for your application

More generally, a stream with a large number of shards (>100) or with a high throughput per shard (>200) will likely need you to spin up multiple workers.

The following figure shows three KCL workers in action, with DynamoDB stream shards distributed among them:

Three KCL workers in action


Processing Amazon DynamoDB Streams Using the Amazon Kinesis Client Library

by Asmita Barve-Karandikar | on | | Comments

Asmita Barve-Karandikar is an SDE with DynamoDB

Customers often want to process streams on an Amazon DynamoDB table with a significant number of partitions or with a high throughput. AWS Lambda and the DynamoDB Streams Kinesis Adapter are two ways to consume DynamoDB streams in a scalable way.

While Lambda lets you run your application without having to manage infrastructure, using the DynamoDB Streams Kinesis Adapter gives you more control over the behavior of your application–mainly, the state of stream-processing. And if your application requires more sophisticated record processing, such as buffering or aggregating records based on some criterion, using the adapter might be preferable.

The Amazon Kinesis Client Library (KCL) provides useful abstractions over the low-level Amazon Kinesis Streams API. The adapter implements the Amazon Kinesis Streams interface so that KCL can consume DynamoDB streams. By managing tasks like load balancing and checkpointing for you, the KCL lets you focus on writing code for processing your stream records. However, it is important to understand the configurable properties that you can tune to best fit your use case.

In this post, I demystify the KCL by explaining some of its important configurable properties and estimate its resource consumption. The KCL version used is amazon-kinesis-client 1.6.2.

KCL overview

To begin, your application needs to create one or more KCL workers that pull data from DynamoDB streams for your table. Apart from reading stream data, the KCL tracks worker state and progress in a DynamoDB table that I will call the “leases table” (the name and provisioned throughput of the table are configurable).

The following diagram shows a single KCL worker in action. The DynamoDB Streams Kinesis Adapter, which sits between DynamoDB Streams and the KCL, is not shown for the sake of simplicity.


Using Spark SQL for ETL

by Ben Snively | on | | Comments

Ben Snively is a Solutions Architect with AWS

With big data, you deal with many different formats and large volumes of data. SQL-style queries have been around for nearly four decades. Many systems support SQL-style syntax on top of the data layers, and the Hadoop/Spark ecosystem is no exception. This allows companies to try new technologies quickly without learning a new query syntax for basic retrievals, joins, and aggregations.

Amazon EMR is a managed service for the Hadoop and Spark ecosystem that allows customers to quickly focus on the analytics they want to run, not the heavy lifting of cluster management.

In this post, we demonstrate how you can leverage big data platforms and still write queries using a SQL-style syntax over data that is in different data formats within a data lake. We first show how you can use Hue within EMR to perform SQL-style queries quickly on top of Apache Hive. Then we show you how to query the dataset much faster using the Zeppelin web interface on the Spark execution engine. Lastly, we show you how to take the result from a Spark SQL query and store it in Amazon DynamoDB.

Hive and Spark SQL history

For versions <= 1.x, Apache Hive executed native Hadoop MapReduce to run the analytics and often required the interpreter to write multiple jobs that were chained together in phases.  This allowed massive datasets to be queried but was slow due to the overhead of Hadoop MapReduce jobs.

SparkSQL adds this same SQL interface to Spark, just as Hive added to the Hadoop MapReduce capabilities. SparkSQL is built on top of the Spark Core, which leverages in-memory computations and RDDs that allow it to be much faster than Hadoop MapReduce.

Spark integrates easily with many big data repositories.  The following illustration shows some of these integrations.


Real-time in-memory OLTP and Analytics with Apache Ignite on AWS

by Babu Elumalai | on | | Comments

Babu Elumalai is a Solutions Architect with AWS

Organizations are generating tremendous amounts of data, and they increasingly need tools and systems that help them use this data to make decisions. The data has both immediate value (for example, trying to understand how a new promotion is performing in real time) and historic value (trying to understand the month-over-month revenue of launched offers on a specific product).

The Lambda  architecture (not AWS Lambda) helps you gain insight into immediate and historic data by having a speed layer and a batch layer. You can use the speed layer for real-time insights and the batch layer for historical analysis.

In this post, we’ll walk through how to:

  1. Build a Lambda architecture using Apache Ignite
  2. Use Apache Ignite to perform ANSI SQL on real-time data
  3. Use Apache Ignite as a cache for online transaction processing (OLTP) reads

To illustrate these approaches, we’ll discuss a simple order-processing application. We will extend the architecture to implement analytics pipelines and then look at how to use Apache Ignite for real-time analytics.

A classic online application

Let’s assume that you’ve built a system to handle the order-processing pipeline for your organization. You have an immutable stream of order documents that are persisted in the OLTP data store. You use Amazon DynamoDB to store the order documents coming from the application.

Below is an example order payload for this system:

{'BillAddress': '5719 Hence Falls New Jovannitown  NJ 31939', 'BillCity': 'NJ', 'ShipMethod': '1-day', 'UnitPrice': 14, 'BillPostalCode': 31939, 'OrderQty': 1, 'OrderDate': 20160314050030, 'ProductCategory': 'Healthcare'}

{'BillAddress': '89460 Johanna Cape Suite 704 New Fionamouth  NV 71586-3118', 'BillCity': 'NV', 'ShipMethod': '1-hour', 'UnitPrice': 3, 'BillPostalCode': 71586, 'OrderQty': 1, 'OrderDate': 20160314050030, 'ProductCategory': 'Electronics'}

Here is example code that I used to generate sample order data like the preceding and write the sample orders  into DynamoDB.

The illustration following shows the current architecture for this example.


Analyze a Time Series in Real Time with AWS Lambda, Amazon Kinesis and Amazon DynamoDB Streams

by JustGiving | on | | Comments

This is a guest post by Richard Freeman, Ph.D., a solutions architect and data scientist at JustGiving. JustGiving in their own words: We are one of the world’s largest social platforms for giving that’s helped 26.1 million registered users in 196 countries raise $3.8 billion for over 27,000 good causes.”


As more devices, sensors and web servers continuously collect real-time streaming data, there is a growing need to analyze, understand and react to events as they occur, rather than waiting for a report that is generated the next day. For example, your support staff could be immediately notified of sudden peaks in traffic, abnormal events, or suspicious activities, so they can quickly take the appropriate corrective actions to minimize service downtime, data leaks or financial loss.

Traditionally, this would have gone through a data warehouse or a NoSQL database, and the data pipeline code could be custom built or based on third-party software. These models resulted in analysis that had a long propagation delay: the time between a check out occurring and the event being available for analysis would typically be several hours. Using a streaming analytics architecture, we can provide analysis of events typically within one minute or less.

Amazon Kinesis Streams is a service that can continuously capture and store terabytes of data from hundreds or thousands of sources. This might include website clickstreams, financial transactions, social media feeds, application logs, and location-tracking events. A variety of software platforms can be used to build an Amazon Kinesis consumer application, including the Kinesis Client Library (KCL), Apache Spark Streaming, or Elastic MapReduce via Hive.

Using Lambda and DynamoDB gives you a truly serverless architecture, where all the infrastructure including security and scalability is managed by AWS. Lambda supports function creation in Java, Node.js, and Python; at JustGiving, we use Python to give us expressiveness and flexibility in building this type of analysis.

This post explains how to perform time-series analysis on a stream of Amazon Kinesis records, without the need for any servers or clusters, using AWS Lambda, Amazon Kinesis Streams, Amazon DynamoDB and Amazon CloudWatch.  We demonstrate how to do time-series analysis on live web analytics events stored in Amazon Kinesis Streams and present the results in near real-time for use cases like live key performance indicators, ad-hoc analytics, and quality assurance, as used in our AWS-based data science and analytics  RAVEN (Reporting, Analytics, Visualization, Experimental, Networks) platform at JustGiving.


Analyze Your Data on Amazon DynamoDB with Apache Spark

by Manjeet Chayel | on | | Comments

Manjeet Chayel is a Solutions Architect with AWS

Every day, tons of customer data is generated, such as website logs, gaming data, advertising data, and streaming videos. Many companies capture this information as it’s generated and process it in real time to understand their customers.

Amazon DynamoDB is a fast and flexible NoSQL database service for applications that need consistent, single-digit-millisecond latency at any scale. It is a fully managed database, supporting both key-value and key-sorted set schemas.

Its flexible data model and reliable performance make it a great fit for mobile, web, gaming, ad tech, the Internet of Things, and many other applications, including the type of real-time data processing I’m talking about. In this blog post, I’ll show you how to use Apache Spark to process customer data in DynamoDB.

With the Amazon EMR 4.3.0 release, you can run Apache Spark 1.6.0 for your big data processing. When you launch an EMR cluster, it comes with the emr-hadoop-ddb.jar library required to let Spark interact with DynamoDB. Spark also natively supports applications written in Scala, Python, and Java and includes several tightly integrated libraries for SQL (Spark SQL), machine learning (MLlib), stream processing (Spark Streaming), and graph processing (GraphX). These tools make it easier to leverage the Spark framework for a variety of use cases.