How Behalf Met its Streaming Data Scaling Demands with Amazon Managed Streaming for Apache Kafka
By Nadav Sharir, VP Engineering at Behalf
By Liran Keren, Director of Engineering at Behalf
By Balaji Gopalan, Sr. Solutions Architect at AWS
To be a successful fintech startup, companies have to build solutions fast so the business can achieve its goals. However, you can’t compromise on security, reliability, or support.
As an AWS Financial Services Competency Partner, Behalf is committed to delivering reliable, secure, low-cost payment processing and credit options to business customers.
In this post, we will detail the decision-making process Behalf went through in evaluating cloud solutions for its core platform.
We’ll also outline why the engineering team chose Amazon Managed Streaming for Apache Kafka (Amazon MSK) to meet its increasing streaming data needs in a reliable and cost-efficient manner.
Behalf’s core platform offers loans to small businesses, and its online financing platform works as a line of credit so that organizations only pay interest on the amount of credit they use to fund their purchases.
By paying vendors up front, Behalf eliminates the vendors’ credit risk. Vendors get paid on day one, bringing their days sales outstanding (DSO) to zero, and removing their responsibility for collections.
Behalf’s cloud-based service is available at many points of sale, including e-commerce, managed sales, and invoicing.
The engineering team at Behalf recognized it needed to establish an architecture that would scale with the growing number of merchants, customers, and transactions the company was managing—all while providing the best user experience possible for its business customers.
Technically, this meant Behalf needed to quickly handle increasing demand on its system without compromising on performance and throughput. The team accomplished this by building a distributed architecture based on microservices.
Goals for the project included:
- Scalability: The solution needed to manage an increasing load of transactions while maintaining throughput and performance.
- Resilience: Behalf’s solution needed to be tolerant of failures, and the system needed to be designed to avoid collapsed failure in which surrounding services degrade when a single service is down.
- Consistency: Behalf is a financial services solution, so its data needed to be fully consistent upon action.
Behalf’s architecture is based on microservices, where each service drives a clear business domain such as orders, customers, payments, and risks. Each service has its own bounded context and, therefore, owns its own data model; other services just need access to the data.
The key to Behalf’s architecture is based on the need for autonomy, meaning that each service is independent at runtime from any other service. This calls for asynchronous and event-based communication between the different services. For example, service A produces a business event, while service B reacts to it.
In many modern software design patterns, data plays a vital role in driving architectural decisions. Since Behalf wanted to guarantee the independence of each service, it needed to have the public data of each service be available for other services, with minimum dependency on the availability of the given services.
Since it also needed to assure scalability, resiliency, and data consistency, Behalf’s engineering team had to find a way to make the data both available and current for the services. When a service acted upon an event, the data needed to be up to date.
Behalf also wanted to guarantee each service would be able to evolve its data model without breaking other services via an anti-corruption layer that assured its long-term compatibility. This goal was especially challenging because Behalf wanted to assure a real-time experience for end-users, without any lags.
Data Solution Alternatives
In terms of data, Behalf wanted to allow its internal business intelligence and risk/fraud analysis services to obtain a full copy of the entire distributed data model. To do this, the engineering team identified ways to address this data challenge.
Data Access by Synchronous REST API
REST API would achieve the data consistency Behalf needed, and with an anti-corruption layer that defends the inner data model of a bounded context. With this solution when microservice A, receives a business event E, it can fetch data using REST from another microservice.
Although this solution is simple, it couldn’t work for Behalf because it would have broken runtime autonomy, and eventually would have led to a system where all components are tightly coupled.
Business Events Carrying Entities
The engineering team at Behalf then explored a Domain-Driven-Design (DDD) solution. Upon publishing a business event, the producer side can paste a public model of entities related to that event.
With this solution, Behalf could easily achieve resiliency with true isolation between bounded contexts. However, the data was available to the consumer only when a business event arrived.
What if the data is needed for later use? It can “duplicate” the data, having it persisted into its own data model, which is completely fine. But then, what if microservice A gets an event from microservice B and now it needs data from microservice C?
Exploring these kinds of ideas, Behalf learned it was not practical and many new features would demand changes to many microservices.
In addition, systems like business intelligence (BI) and risk needed access to all the data (current and historic) regardless of the running business flow. Clearly, this approach was not feasible for such systems, as they would have a need to register to all business events and remodel all the public data models into their own data model.
Data Events Stream and Business Events Stream
Behalf concluded it needed to stream a public adaptation of the data models and have it pushed into consumer-side databases. This would achieve isolation and resiliency, and provide access to the public data at any point in time, as needed.
Having more than one stream means there’s no guarantee in data consistency, however. Let’s say microservice producer A publishes a business event and microservice B consumes it. Data that changed in A before publishing of the event may not have arrived to B yet. It could be processing the business event using stale data. Causal ordering is critical.
So, Behalf’s team started to think about how to synchronize between the business events stream and the data events stream. Several ideas cropped up—such as joining the streams on the consumer side—but then they would have to pay with latency caused by the windowing. Windowing is a heuristic approach, and without really knowing the right size of the window, they could lose some data.
Another approach was to have business events carry metadata of versions of the public data model. Consumers would be able to know their data (from other services) is stale and know to wait with the processing of events until the relevant data arrives.
This idea was soon dropped, as well, because of its complexity. Microservice A publishing an event E to microservice B might be related to data from microservice C. To publish the related entities versions, data versioning had to be global and that implies having a centralized component.
The Single Stream
Behalf ultimately concluded a single queue for both data events and business events was needed.
In this case, microservice A would run some business code and make changes to its data, resulting in the publishing of corresponding data events to the stream. When microservice A wants to publish the business event E, it publishes to the same stream. Microservice B consumes the single stream and processes the data events, storing them in its own schema.
When microservice B arrives to deal with event E, all the prior data would have already been processed. With this approach, Behalf achieved all of its design goals of resiliency, isolation, and data consistency. It also delivered a true event-based solution and no dummy latency from windowing.
In Figure 1, you see that when publishers publish data events D1 and D2 and business event E, all events are streamed on the same partition. As a result, consumers consume the events in the same order: D1 > D2 > E.
Figure 1 – Streaming data and message ordering.
Knowing that Behalf needed a single topic for both data and business events, it faced the problem of maintaining transactionality. During a run of some unit of work, this means that if there are failures both events should not be published.
At first, the engineering team thought of publishing data events from the repository layer, which would guarantee transactionality and ordering in a publishing unit of work. However, what happens with data migrations?
Behalf could make sure any data migration would go through the application data model. But that’s very limiting and would tie its hands from doing any data manipulation directly on the database.
Another way to accomplish this is by streaming change data capture (CDC). Every database has a notion of a transaction log describing all data manipulation events in a manner of event sourcing, for example:
Fetching the account with id “1234” will return the account with the name Jim, since the record was created with the name of John and then updated into Jim. This history exists in the database transaction log and is the core of the replication abilities of databases.
Instead of doing full replication of one database into another, Behalf’s team thought of an ability to “subscribe” to the log and stream all events so that each consumer would be able to deal with the events. This approach guaranteed every change was committed to the database, no matter if it would be streamed out manually or through the application.
But that brought the team right back to square one, where its business events were published without synchronization to the data events.
So, they said, “Why not consider the business events as data events also?” Instead of publishing them directly to the broker, Behalf could create a table named business_events in all microservices. Then, publishing a business event would mean storing a record in that table.
Until now, Behalf’s engineering team used RabbitMQ as the event broker. There was never a problem of scale, as they only published business events and each type was published to its own topic. Now, the team wanted to publish enormous volumes to just one topic and needed to guarantee data retention and high throughput of production and consumption.
Needing a fault-tolerant, durable streaming, and messaging platform, Behalf chose to adopt Kafka, an open-source event broker that also saves the data and can act as an event store.
In addition to Kafka, the engineering team decided to use Kafka Connect, which is an open-source component for Kafka that abstracts connection to data sources like databases, search indexes, or even files, streaming the data through Kafka into targets.
Kafka Connect abstraction consists of connectors that are also developed by the open-source community. Debezium is an implementation of such a connector, and it connects to the database replication slot and streams CDC events. Debezium also supports transformation and blacklisting of events that enabled Behalf to achieve an anti-corruption layer to guarantee isolation and loose-coupling.
Kafka can handle massive amounts of streaming data and supports partitioning, meaning that events in the topic can be partitioned by some key. Order within the partition is maintained by using a single listener (thread) per partition. One listener can handle several partitions, but one partition should always be handled by just one thread.
Kafka Connect also consists of source and sink (target). For this solution, Behalf uses only the source (Debezium) and its microservices users to consume directly from Kafka.
The team created an infrastructure layer for the microservices, both for the producer side and the consumer side.
The producer side features an anti-corruption layer and abstraction for sending a business event, which is actually saving it into the database. On the consumer side is a component in the microservices that distinguishes between data events and business events and delegates them to actors in the application layer.
In Figure 2, you see that customer-service (source) writes data to its own database, and Kafka-connect transforms changes into Kafka messages. From there, Kafka transfers data events and business events on the main single topic.
Finally, the order-service core layer (target) consumes the events and pushes customer-service data into downstream.customer (<schema>.<table name>) and triggers business event handlers for the business events.
Figure 2 – Message flow between components.
Amazon Managed Streaming for Apache Kafka (Amazon MSK)
Having a high-level solution, Behalf was ready to choose the technology and services that would help it build out the new streaming platform.
Its streaming platform was going to be the core engine of Behalf, so security was a key criterion. In addition, the team wanted to use a service that was cost-effective, easy to operate and monitor, had low latency, and handled high throughput. Most importantly, they wanted someone to pick up the phone if something failed.
Behalf chose Amazon MSK over other Kafka offerings for the following reasons:
- Security: With Amazon MSK for the Apache Kafka clusters, a virtual private cloud (VPC) provides network isolation. Data is encrypted at rest, while encryption in transit is enabled through TLS for inter-broker communications. The broker to client communication is also encrypted.
- High availability: Amazon MSK makes it easy to deploy clusters with multi-Availability Zone (AZ) replication between the Apache Kafka brokers.
- Operational considerations: Amazon MSK is fully managed and offers monitoring through Amazon CloudWatch. Behalf left the Apache Kafka infrastructure management to AWS, monitoring its own infrastructure and applications using SumoLogic, a fellow APN Technology Partner, and CloudWatch integration.
- Low latency: The team needed to ensure the message trip of Amazon Relational Database Service (Amazon RDS) > Kafka Connect > Kafka > Microservice would be as short as possible, and Amazon MSK guaranteed exactly that.
- Support 24×7: Behalf’s platform runs 24×7, so the team needed to be able to reach out to technical support whenever they were needed.
- Open source compatibility: Kafka is a scalable, fault-tolerant open source event broker that saves the data and acts as an event store. Amazon MSK supports native Apache Kafka APIs and existing open-source tools (like Debezium) built against those APIs.
- Cost-effectiveness: From its initial analysis, Behalf found the Amazon MSK solution to be cost-effective.
Having a secure and reliable data streaming platform has enabled Behalf to move data between microservices with ease. The effect was seen on the technical side, and it also had a meaningful impact in a way it delivers true value to the business.
Behalf’s engineering teams are now much more independent than before. They can easily develop cross-domain features and consume data across companion services without being dependent on them at runtime. This is possible because the needed data is accessible locally within their own sub-domain.
The streaming platform also made testing more efficient, as engineers don’t need to mock any data-level APIs because the services are self-contained. They’re now able to quickly develop, test, and roll out new features for customers in an agile manner.
Behalf was also able to rapidly rebuild two meaningful uses cases on top of this new infrastructure. The first is BI streaming at near real-time, and the second is its risk and fraud platform that enables Behalf to perform at scale and with low latency.
Having the data streaming platform built on top of Amazon MSK simplified Behalf’s integration efforts and increased its software development velocity. This allows the team to extract more value from their data than ever before without increasing total cost of ownership (TCO).
The durability and performance has been proven to be great as well, with an average latency of just a few milliseconds from event producing to consumption, while experiencing stability over time.
Behalf had based its architecture on distributed microservices when the interaction between its services is driven by asynchronous business events.
In addition, public data of each service is being replicated to any desired target service that needs to consume data from its co-service. This approach assures that each service is independent during its runtime. Therefore, availability and durability can be achieved.
Since Behalf is a financial service, it could not compromise on data consistency. Data must be consistent at the point of action, meaning that when services act upon an event the data that’s accessible to the service must be up to date at near real-time.
To achieve this, Behalf chose Kafka as its streaming platform. When the core of the solution is based on the fact that data events and business events are streamed via a joint partitioned stream, the ordering of both events is assured and therefore consistent. Besides guaranteed ordering, Kafka can deliver high throughput at scale, which cab result in improved user experience.
Behalf chose Amazon MSK as its managed Kafka provider to reduce the cost of establishing and operating a Kafka cluster. Amazon MSK also offers 24×7 support while providing better security, lowered latency, and multi-zone availability. This made the solution optimal for Behalf in terms of its cost-effectiveness.
Since making this move, Behalf has experienced great improvements in terms of velocity, throughput, and durability while leveraging the new streaming platform for its rapidly growing business needs.
Behalf – APN Partner Spotlight
Behalf is an AWS Financial Services Competency Partner. They offer short-term financing with flexible repayment terms to U.S.-based businesses, and are committed to delivering reliable, secure, low-cost payment processing and credit options to business customers.
*Already worked with Behalf? Rate this Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.