A Data Sharing Platform Based on AWS Lambda

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.

Julien Lepine

Julien Lepine
Solutions Architect

As developers, one of our top priorities is to build reliable systems; this is a core pillar of the AWS Well Architected Framework. A common pattern to fulfill this goal is to have an architecture built around loosely coupled components.

Amazon Kinesis Streams offers an excellent answer for this, as the events generated can be consumed independently by multiple consumers and remain available for 1 to 7 days. Building an Amazon Kinesis consumer application is done by leveraging the Amazon Kinesis Client Library (KCL) or native integration with AWS Lambda.

As I was speaking with other developers and customers about their use of Amazon Kinesis, there are a few patterns that came up. This post addresses those common patterns.

Protecting streams

Amazon Kinesis has made the implementation of event buses easy and inexpensive, so that applications can send meaningful information to their surrounding ecosystem. As your applications grow and get more usage within your company, more teams will want to consume the data generated, even probably external parties such as business partners or customers.

When the applications get more usage, some concerns may arise:

When a new consumer starts (or re-starts after some maintenance), it needs to read a lot of data from the stream (its backlog) in a short amount of time in order to get up to speed
A customer may start many consumers at the same time, reading a lot of events in parallel or having a high call rate to Amazon Kinesis
A consumer may have an issue (such as infinite loop, retry error) that causes it to call Amazon Kinesis at an extremely high rate

These cases may lead to a depletion of the resources available in your stream, and that could potentially impact all your consumers.

Managing the increased load can be done by leveraging the scale-out model of Amazon Kinesis through the addition of shards to an existing stream. Each shard adds both input (ingestion) and output (consumption) capacity to your stream:

1000 write records and up to 1 megabyte per second for ingesting events
5 read transactions and up to 2 megabytes per second for consuming events

Avoiding these scenarios could be done by scaling-out your streams, and provisioning for peak, but that would create inefficiencies and may not even fully protect your consumers from the behavior of others.

What becomes apparent in these cases is the impact that a single failing consumer may have on all other consumers, a symptom described as the “noisy neighbor”, or managing the blast radius of your system. The key point is to limit the impact that a single consumer can have on others.

A solution is to compartmentalize your platform: this method consists of creating multiple streams and then creating groups of consumers that share the same stream. This gives you the possibility to limit the impact a single consumer can have on its neighbors, and potentially to propose a model where some customers have a dedicated stream.

You can build an Amazon Kinesis consumer application (via the KCL or Lambda) that reads a source stream and sends the messages to the “contained” streams that the actual consumers will use.

Transforming streams

Another use case I see from customers is the need to transfer the data in their stream to other services:

Some applications may have limitations in their ability to receive or process the events
They may not have connectors to Amazon Kinesis, and only support Amazon SQS
They may only support a push model, where their APIs need to be called directly when a message arrives
Some analytics/caching/search may be needed on the events generated
Data may need to be archived or sent to a data warehouse engine

There are many other cases, but the core need is having the ability to get the data from Amazon Kinesis into other platforms.

The solution for these use cases is to build an Amazon Kinesis consumer application that reads a stream and prepares these messages for other services.

Sharing data with external parties

The final request I have seen is the possibility to process a stream from a different AWS account or region. While you can give access to your resources to an external AWS account through cross-account IAM roles, that feature requires development and is not supported natively by some services. For example, you cannot subscribe a Lambda function to a stream in a different AWS account or region.

The solution is to replicate the Amazon Kinesis stream or the events to another environment (AWS account, region, or service).

This can be done one time through an Amazon Kinesis consumer application that reads a stream and forwards the events to the remote environment.

Solution: A Lambda-based fan-out function

These three major needs have a common solution: the deployment of an Amazon Kinesis consumer application that listens to a stream and is able to send messages to other instances of Amazon Kinesis, services, or environments (AWS accounts or regions).

In the aws-lambda-fanout GitHub repository, you’ll find a Lambda function that specifically supports this scenario. This function is made to forward incoming messages from Amazon Kinesis or DynamoDB Streams.

The architecture of the function is made to be simple and extensible, with one core file fanout.js that loads modules for the different providers. The currently supported providers are as follows:

Amazon SNS
Amazon SQS
Amazon Elasticsearch Service
Amazon Kinesis Streams
Amazon Kinesis Firehose
AWS IoT
AWS Lambda
Amazon ElastiCache for Memcached
Amazon ElastiCache for Redis

The function is built to support multiple inputs:

Amazon Kinesis streams
Amazon Kinesis streams containing Amazon Kinesis Producer Library (KPL) records
DynamoDB Streams records

It relies on Lambda for a fully-managed environment where scaling, logging, and monitoring are automated by the platform. It also supports Lambda functions in a VPC for Amazon ElastiCache.

The configuration is stored in a DynamoDB table, and associates the output configuration with each function. This table has a simple schema:

sourceArn (Partition Key): The Amazon Resource Name (ARN) of the input Amazon Kinesis stream
id [String]: The name of the mapping
type [String]: The destination type
destination [String]: The ARN or name of the destination
active [Boolean]: Whether that mapping is active

Depending on the target, some other properties are also stored.

The function can also group records together for services that don’t initially support it, such as Amazon SQS, Amazon SNS, or AWS IoT. Amazon DynamoDB Streams records can also be transformed to plain JSON objects to simplify management in later stages. The function comes with a Bash-based command line Interface to make the deployment and management easier.

As an example, the following lines deploy the function, which registers a mapping from one stream (inputStream) to another (outputStream).

./fanout deploy --function fanout

./fanout register kinesis --function fanout --source-type kinesis --source inputStream --id target1 --destination outputStream --active true

./fanout hook --function fanout --source-type kinesis --source inputStream

Summary

There are many options available for you to forward your events from one service or environment to another. For more information about this topic, see Using AWS Lambda with Amazon Kinesis. Happy eventing!

If you have questions or suggestions, please comment below.

AWS Compute Blog

A Data Sharing Platform Based on AWS Lambda

Protecting streams

Transforming streams

Sharing data with external parties

Solution: A Lambda-based fan-out function

Summary

Resources

Follow

Learn

Resources

Developers

Help