AWS Big Data Blog
Streaming web content with a log-based architecture with Amazon MSK
Content, such as breaking news or sports scores, requires updates in near-real-time. To stay up to date, you may be constantly refreshing your browser or mobile app. Building APIs to deliver this content at speed and scale can be challenging. In this post, I present an alternative to an API-based approach. I outline the concept and virtues of a log-based architecture, a software architecture that uses a commit log to capture data changes to easily build new services, and databases on other services’ full datasets. These data changes can also be content for the web. The architecture enables you to stream this content in real time — it’s simple and easy to scale.
The following video clip shows you an example of this architecture in action.
In this post, I show you how you can use Amazon Managed Streaming for Apache Kafka (Amazon MSK) to build a log-based architecture, and the other technologies you need to stream content on the web. I also show you an example microblogging service that puts everything into action. For more information, see the GitHub repo. It contains a web application, the backend service that runs on AWS Fargate on Amazon Elastic Container Service (Amazon ECS), and the AWS CloudFormation templates to create the infrastructure on the AWS Cloud. You can walk through running the example on AWS Cloud9 or your local machine.
Benefits of a push model
Most websites and mobile applications distribute content by pulling it from an API. The client has to pull any new content or updates to the content by making a new request on the backend. A common behavior is a browser window refresh, or a pull-to-refresh on a mobile app. Another behavior is to poll the server for new content in defined intervals. This is known as the pull model.
When clients use this model, they create a new request to the server every time they update the content. Every request creates stress on your application. You need to check for updates in a database or cache and send data to the client. This consumes CPU and memory, beyond the eventual need to create new connections to your services.
A different approach is to update the client from the server side, known as the push model. The server pushes new content or updates to the client. It’s an asynchronous communication between the server and the client. The following diagram illustrates the architecture for publish/subscribe (pub/sub) messaging, a common pattern for this asynchronous communication.
In the pub/sub pattern, new or updated content is an event. It originates from the publisher of the event and gets distributed to the subscriber. This pattern is used with the reader and the pub/sub service, whereby the reader subscribes to the service and waits for the service to publish new articles. The pub/sub service subscribes to the article log to consume the articles that the editor published to it. A reader that is subscribed to the pub/sub services is only required to keep the connection to the service open and wait for new articles to flow in. In the example of the microblogging service, the reader uses a simple React app to read the articles. The web application keeps the connection to the backend service open, waits for new articles to be published and updates the displayed articles on publication.
When a new article is published to the log, the article is stored as a message. A message is a key-value pair of which the log keeps the order in which the messages are stored. For the example use case, the message stored here is an article, but it can be any kind of data as they are stored as serialized bytes. The most common formats are plain text, JSON, Protobuf, and Avro. The pub/sub service consumes the messages as they flow in and publishes them to connected clients. Again, this can be a web application in a browser or a mobile application on iOS or Android. In the example it is a React app in the browser. The subscribers receive the new articles without the need to pull for the content.
This behavior of pushing the content to the clients is called content streaming, because the client waits to read from a stream of messages. These messages contain the published content. It’s similar to video streaming, in which videos are continuously delivered in pieces to the viewer.
The virtues of log-based architectures
A common approach to give consumers access to your content is building APIs. For a long time, these have been RESTful APIs; more recently, GraphQL-based APIs have gained momentum. However, API-based architectures have several issues:
- Many content-producers define their own schema for representing the data. The names and fields vary between them, especially as the number of endpoints increase and the API evolves.
- The endpoints made available differ in behavior. They use different semantics and have different request parameters.
- Many APIs don’t include a feature to notify clients about new content or updates. They need additional notification services or extensive polling mechanisms to offer this behavior.
This post’s approach is to use a log-based architecture to stream content on the web and mobile. The idea is to use a log as generic mechanism to store and distribute content. For more information, see Turning the database inside-out with Apache Samza and Designing Data-Intensive Applications.
Although many different log technologies may exist, Apache Kafka has become an industry standard, and has created a rich ecosystem. Amazon MSK became generally available last year, and is a fully managed service to build applications that use Apache Kafka. The sample microblogging service you run in this post uses Amazon MSK. It appends the published posts to the log in a partition of a topic. Partitions allow you to parallelize the orderly processing of messages in a topic by splitting the data in a topic. The topic in the microblogging example has only one partition because the global order of articles should be guaranteed. With multiple partitions, only the order in a partition is guaranteed. The pub/sub service in the example consumes the articles by reading the log in chronological order and publishing them in that order to the subscribers.
Using a log to store the articles has advantages compared to a traditional database system. Firstly, they are schema-less. They store simple key-value pairs with some additional metadata. The value can be any kind of binary-encoded data, which means that the producer and consumer are responsible to serialize and de-serialize the stored value. Therefore it’s easy to change the data stored over time without any downtime.
A log is also easy to back up and restore. You can consume and store the single messages in the log as objects to Amazon Simple Storage Service (Amazon S3) and restore the messages by republishing them to a new log. The replay of the messages happens from the object store, which is cost-effective and secure, and uses the capabilities of Amazon S3 like object versioning and lifecycle management.
Replication is also much easier compared to traditional database systems. Because the log is in chronological order, replicas are also always in order, and it’s easy to determine if they are in sync or not.
Furthermore, building derived stores (for example, the latest articles) is much easier this way. The log represents everything needed to build up those stores, whereby databases represent the latest state. Log-based architectures are inherently consistent.
A log is an ordered representation of all events that happened to the system. The events themselves are the changes to the system. This can be a new article or an updated article. The log is used to create materialized views. This can be a NoSQL database like Amazon DynamoDB, which drives a GraphQL API via AWS AppSync, or any other specific view on the data. The data in a log can materialize in any kind of view because the needed state can always be recreated by replaying the messages in the log. These views are used to eventually consume the data. Every service can have its own data store and view of the data. They can expose as much or as little of the data as they need for their view on it. The databases of these services can be more purposeful and are easier to maintain. They can use DynamoDB in one instance or Amazon Aurora in another. The example of the microblogging service, however, doesn’t use a materialized view to show the posts. It publishes and consumes directly to and from the log.
Log-based architectures are appealing to content providers because there is no distinction between accessing current and future data. Consumers are always replaying the data; they get current and all future data from where they start to read the log. The current position of the consumer is the offset. It’s a simple integer number that points to the last record the consumer read from the log. The number of messages between the offset and the latest message in the log is the lag. When a consumer starts to read messages from the offset, everything from this point on fuses to a message stream. If this is combined with a protocol that provides the capability to push data to the browser, you can stream these messages to the web.
Streaming content with gRPC
gRPC is a high-performance RPC framework that can run in any environment. It’s often used to connect services in distributed backend systems, but is also applicable to connecting devices, mobile applications, and browsers in the last mile. An RPC framework allows applications to call a function in a remote process. gRPC uses protocol buffers to define a service and its remote calls. It’s a powerful binary serialization tool and language. Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data and sending it over the wire. You define the structure of your data and use source code generation to create the code to easily read and write data in your data structures. You don’t need to write object model wrappers in your services or on the client, and it supports many popular programming languages.
The microblogging example for this post uses the Go support for the backend service, and grpc-web in the React app, which is the client for the service. The code for both languages is created from one Protobuf definition. If you’re developing a mobile application, there is support for Swift and Android. The elimination of object model wrappers to unwrap JSON into an object is a big advantage.
The most important feature to streaming content is bidirectional streaming with HTTP/2 based transport. The example uses a server-side streaming RPC to list the published articles. The client sends a request to the server and gets a stream to read the sequence of articles. Because the connection is kept alive, the server continues to push all future articles. gRPC guarantees message ordering within an individual RPC call, which is important for our example, because the articles are directly consumed from the log, and the articles in the log are in chronological order.
Microblogging example service
Learning about the virtues of log-based architectures is one thing; building a service on top of this pattern is another. This post provides a complementary microblogging service that uses the pattern and gRPC to exemplify everything that you learn. For more information, see the GitHub repo.
The service has two main components:
- A simple React app that connects to the backend service via a gRPC to list all published articles and create new articles
- A backend that implements the gRPC service that the app calls
The React app publishes new articles to a topic in Amazon MSK and lists these articles as they are appended to the log. Listing the articles is a call to the ListArticles
remote procedure, which subscribes to the topic of articles and reads the log from the beginning. They are pushed to the web application as they are read from the log.
The GitHub repo also contains the CloudFormation templates to create the needed infrastructure and deploy the services. The following diagram illustrates the architecture of the services and infrastructure.
The service uses Amazon Elastic Container Registry (Amazon ECR) to store the Docker images for Fargate. The pub/sub service uses Amazon MSK for its commit log. The service is discovered via AWS Cloud Map. Clients connect to the service via an Application Load Balancer. You use this load balancer because there can be a longer idle timeout on the connection, so that data can be pushed from the service to the client without managing the connection in the client.
Running on AWS Cloud9
You can deploy and run the service from your development machine. If you want to try it out and experiment with it, you can also run it using AWS Cloud9. AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser.
The only prerequisite for completing this walkthrough is an AWS account. For instructions on creating one, see How do I create and activate a new AWS account?
Creating the environment and running the example consumes AWS resources. Make sure you remove all resources when you’re finished to avoid ongoing charges to your AWS account.
Creating the AWS Cloud9 environment
To create your AWS Cloud9 environment, complete the following steps:
- On the AWS Cloud9 console, choose Create environment.
- Name the environment
mskworkshop
. - Choose Next step.
- For Instance type, choose small.
- Accept all default values and choose Next Step.
- On the summary page, review your inputs and choose Create environment.
AWS Cloud9 provides a default auto-hibernation setting of 30 minutes for your Amazon Elastic Compute Cloud (Amazon EC2) instances created through it. With this setting, your EC2 instances automatically stop 30 minutes after you close the IDE. They only start again when you reopen the environment.
- When your environment is ready, close the Welcome
- From the remaining tab, choose New Terminal.
Your environment should look like the following screenshot.
For more information, see Working with Environments in AWS Cloud9.
Preparing the AWS Cloud9 environment
To proceed with the walkthrough, you have to install the most current version of the AWS Command Line Interface (AWS CLI), install the needed tools, and resize the AWS Cloud9 environment.
- To view your current version of AWS CLI, enter the following code:
Bash $ > aws –version
- To update to the latest version, enter the following code:
Bash $ > pip install –user –upgrade awscli
You can ignore warnings about the outdated pip
version and check the installed AWS CLI. You need the jq command installed. It’s a lightweight and flexible command-line JSON processor that the example scripts use.
- Install the tool with the following code:
Bash $ > sudo yum install -y jq
The client runs in the Cloud9 environment. It needs a current version of Node.js and the Yarn package manager to be installed.
- To manage the installed
Node.js
, use the Node Version Manager (nvm). See the following code:
Bash $ > curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.34.0/install.sh | bash
- Activate the version manager in your environment with the following code:
Bash $ > . ~/.nvm/nvm.sh
- Use
nvm
to install the current version of Node.js:
Bash $ > nvm install node
- Use the npm to install the latest version of the Yarn package manager:
Bash $ > npm install yarn -g
You have finished preparing your AWS Cloud9 environment. Next, you clone the example repository and set up the example.
Cloning and setting up the example repository
You first need to change into the folder you want to clone the repository to. See the following code:
Bash $ > cd ~/environment
Clone the repository and enter the folder with the source code:
Bash $ > git clone https://github.com/aws-samples/aws-msk-content-streaming aws-msk-content-streaming && cd $_
You have successfully cloned the source code in your Cloud9 environment.
Resizing the environment
By default, AWS Cloud9 has 8 GB of storage attached. The example service needs various Docker containers that the provided scripts build. They consume more than default storage. Therefore, you have to resize the size of the attached storage. See the following code:
Bash $ > make resize
This resizes the attached storage to 20 GB, which is sufficient for the microblogging service. If you encounter an error that the /dev/nvme0n1
device doesn’t exist, you’re not running on a Nitro-based architecture. For instructions on replacing the devices, see Moving an environment or resizing an Amazon EBS volume.
The micoblogging service contains the option of a bastion host, which is a special-purpose computer on a network specifically designed to securely access resources in this network. You can access this host via SSH.
Creating an SSH key
To generate an SSH key, enter the following code (if needed, you can use this key to access Amazon MSK and your Fargate containers):
Bash $ > ssh-keygen
Choose Enter
three times to take the default choices. Next, upload the public key to your Amazon EC2 Region with the following code:
Bash $ > aws ec2 import-key-pair --key-name ${C9_PROJECT} --public-key-material file://~/.ssh/id_rsa.pub
Set the key as the KEY_PAIR
environment variable for the scripts to use. You’re now ready to deploy the example application to your AWS account.
Deploying the application
Before you can run the web application, you have to deploy the needed services to your AWS account. The deployment scripts create two CloudFormation stacks. One stack with the core services VPC, NAT Gateway, Internet Gateway, and Amazon MSK. The other stack is the application stack, with the Docker containers and Application Load Balancer.
To deploy these stacks, enter the following code:
Bash $ > make deploy
The deployment process may take some time. When the deployment process is complete, you can start the web application. It starts a local development server that is also exposed to the public. See the following code:
Bash $ > make start
The build process is finished when you see a Compiled successfully!
message and the URLs to the development server. You access the preview to the web application by choosing Preview, Preview Running Application in the toolbar. This opens a split view with a window and the web application.
Unfortunately, you can’t post or read any content in this preview because you didn’t configure a custom domain for the Application Load Balancer, and therefore the deployed service is only accessible via HTTP. However, AWS Cloud9 is a secure environment and expects content to be served via HTTPS.
To make the example work, you can either copy the full URL from the preview (https://12345678910.vfs.cloud9.eu-west-1.amazonaws.com/
) or choose the Pop Out Into New Window icon next to the browser bar.
In the URL, replace https
with http
. You can now access the service with HTTP.
You can test the example by creating a new item. Give it a title and add some content. When you’re finished, choose Create Post. If you see an error because of a connection problem, refresh your browser.
Cleaning up
When you’re finished exploring the example and discovering the deployed resources, the last step is to clean up your account. The following code deletes all the resources you created:
Bash $ > make delete
Running on your machine
If you want to run the example on your local machine, you need to install the required tools:
Installing them enables you to build the needed Docker containers for the pub/sub service, create the needed infrastructure and deploy the containers, and run the React app to create and read posts. For more information, see the GitHub repo.
To run the infrastructure and access the example, you also clone the repository to your machine and enter the folder. See the following code:
bash $ > git clone https://github.com/aws-samples/aws-msk-content-streaming aws-msk-content-streaming && cd $_
In the next step, you build the backend service, the envoy proxy for the gRPC calls, and the needed infrastructure for the service. The envoy proxy is needed to bridge the gRPC-Web client in the web app to the gRPC server in the backend service. The calls from the web app are text-encoded, while the backend service uses the binary protobuf format. You have to set the following environment variables for the deployment to work:
bash $ > export PROJECT_NAME=<YOUR_PROJECT_NAME>
bash $ > export AWS_ACCOUNT_ID=<YOUR_ACCOUNT_ID>
bash $ > export AWS_DEFAULT_REGION=<YOUR_AWS_REGION>
bash $ > export KEY_PAIR=<YOUR_AWS_EC2_KEY_PAIR>
Replace the <> with your details, the name of the project, the account you want to deploy the infrastructure to, and the Region you want to deploy the stack to.
To deploy the same CloudFormation stacks as in the AWS Cloud9 environment, enter the following code:
bash $ > make deploy
To start a development webserver at localhost:3030 and open a browser window with this URL in your default browser, enter the following code:
bash $ > make start
The client is configured with the environment variable REACT_APP_ENDPOINT
to the URL of the Application Load Balancer.
Create your articles
You can now create a new article with a title and content and publish it to the log. The list of articles should then automatically update as it’s pushed the new article. You can also test this by duplicating the tab and creating a new article in the new tab.
The following diagram illustrates the behavior of the solution from the client perspective:
The remote call to the pub/sub service subscribes the client to the list of articles in the article log. A unidirectional gRPC stream is created. The pub/sub services push all available articles and all new articles to the client. The envoy proxy filters the grpc-web calls, which are in text format, and translates them into the binary gRPC calls for the pub/sub service. The insert of an article is an additional unary gRPC call to the pub/sub service.
Summary
This post discussed log-based architectures and how you can use them to stream content on the web. We talked about the virtues of gRPC to stream the content to the browser or to mobile devices. Furthermore, you experimented with a log-based architecture with a microblogging service built on Amazon MSK. You also saw how to deploy the needed infrastructure and run it with the sample code.
You can use the principle idea and provided example and build your own solution. Please share what you built or your questions regarding running log-based architectures on the AWS Cloud.
About the Author
Sebastian Doell is a Startup Solutions Architect at AWS. He helps startups execute on their ideas at speed and at scale. Sebastian also maintains a number of open-source projects and is an advocate of Dart and Flutter.