How WalkMe Solves Data Sequencing Challenges
Guest post by Yotam Spenser, Head of Data Engineering
In this article, I will discuss the nature of the WalkMe Insights product offering, some of the challenges that arose as we developed this technology, and how we came to our solution.
Digital Adoption Platform (DAP) pioneer WalkMe offers a 360-degree solution to leading organizations worldwide, helping them ensure employee and customer user adoption and securing a smooth digital transformation for the organization. One of the key elements of this offering is WalkMe Insights, a state-of-the-art analytics and BI solution designed for the business user that allows for detailed insight into the activity of individual end-users and different end-user groups.
It is important for analytics tools like Insights to maintain the sequential integrity of customer data. Data sequences are important so we can show our customers when events happened on their end-users’ client devices. We rely on sequential integrity so that product features whose operation depends on receiving sequential customer data (for example, as with Funnels, one of WalkMe’s analytics applications) will work properly and paint an accurate picture of end-user engagement.
Sending analytics data into Insights in sequence proves challenging because of the asynchronous nature of client-server architecture. Since the order in which the data was captured does not determine the order in which the data is sent, one can’t necessarily know when a given data point is going to arrive.
In the case of WalkMe, Insights allows for end-user client devices to send “late events,” that is, events which occurred in the past but that client devices were not able to send at the time of occurrence.
To make matters more complicated, WalkMe stores and partitions data in ‘read optimized, no append’ architecture (that is, data that is easily readable/queryable by the backend, but less easily appendable/updatable), in order to allow for fast querying. This confounds one of the potentially easier fixes (that is, re-sequencing and updating out-of-sequence data retroactively once new data arrives) to the challenges mentioned above.
WalkMe’s architecture is designed to process data in real time using Amazon Kinesis, and store that data persistently in Amazon S3 (as facilitated by Amazon Kinesis Data Firehose) for backups and batch processing. After data streaming, we run a Spark job on AWS Glue (for which the input is the output of the Amazon Kinesis Data Firehose), sorting the data by client timestamp, compressing and storing in a read optimized format. The Spark job creates new files on each run, even on old data, because of the additional late events which arrive, which can cause an influx of small files.
We wrote a custom partitioner which checks for optimal file sizes and can either create or append to file per given time frame. The partitioner can parse between timely and late events, and, if events are late, it can recalibrate the optimal size of a previously closed batch of data. To save on storage space, the partitioner can also weed out duplicate and irrelevant information and restructure the existing data to facilitate optimally efficient storage. Finally, it can re-create the partitions on Amazon S3.
We use AWS Glue Data Catalog as the meta store for our data schemas, which are updated automatically using an AWS Glue crawler. We also use one meta store for all of its services.
The nature of the solution makes it easier to do batch processing using Amazon EMR (Elastic MapReduce) based on client device timestamp, and data querying with Amazon Athena (which uses the above mentioned dynamically updated meta store).
Because of this process, WalkMe Insights stores the data in sequence and presents it to the customer in the proper order, making Insights the state-of-the-art analytics and BI solution, and a key element to WalkMe’s Digital Adoption Platform.