How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

This post is co-written with Olivia Michele and Dariswan Janweri P. at Ruparupa.

In this post, we show how Ruparupa implemented an incrementally updated data lake to get insights into their business using Amazon Simple Storage Service (Amazon S3), AWS Glue, Apache Hudi, and Amazon QuickSight. We also discuss the benefits Ruparupa gained after the implementation.

Ruparupa was built by PT. Omni Digitama Internusa with the vision to cultivate synergy and create a seamless digital ecosystem within Kawan Lama Group that touches and enhances the lives of many.

Ruparupa is the first digital platform built by Kawan Lama Group to give the best shopping experience for household, furniture, and lifestyle needs. Ruparupa’s goal is to help you live a better life, shown by the meaning of the word ruparupa, which means “everything.” We believe that everyone deserves the best, and home is where everything starts.

The data lake implemented by Ruparupa uses Amazon S3 as the storage platform, AWS Database Migration Service (AWS DMS) as the ingestion tool, AWS Glue as the ETL (extract, transform, and load) tool, and QuickSight for analytic dashboards.

Amazon S3 is an object storage service with very high scalability, durability, and security, which makes it an ideal storage layer for a data lake. AWS DMS is a database migration tool that supports many relational database management services, and also supports Amazon S3.

An AWS Glue ETL job, using the Apache Hudi connector, updates the S3 data lake hourly with incremental data. The AWS Glue job can transform the raw data in Amazon S3 to Parquet format, which is optimized for analytic queries. The AWS Glue Data Catalog stores the metadata, and Amazon Athena (a serverless query engine) is used to query data in Amazon S3.

AWS Secrets Manager is an AWS service that can be used to store sensitive data, enabling users to keep data such as database credentials out of source code. In this implementation, Secrets Manager is used to store the configuration of the Apache Hudi job for various tables.

Data analytic challenges

As an ecommerce company, Ruparupa produces a lot of data from their ecommerce website, their inventory systems, and distribution and finance applications. The data can be structured data from existing systems, and can also be unstructured or semi-structured data from their customer interactions. This data contains insights that, if unlocked, can help management make decisions to help increase sales and optimize cost.

Before implementing a data lake on AWS, Ruparupa had no infrastructure capable of processing the volume and variety of data formats in a short time. Data had to be manually processed by data analysts, and data mining took a long time. Because of the fast growth of data, it took 1–1.5 hours just to ingest data, which was hundreds of thousands of rows.

The manual process caused inconsistent data cleansing. After the data had been cleansed, some processes were often missing, and all the data had to go through another process of data cleansing.

This long processing time reduced the analytic team’s productivity. The analytic team could only produce weekly and monthly reports. This delay in report frequency impacted delivering important insights to management, and they couldn’t move fast enough to anticipate changes in their business.

The method used to create analytic dashboards was manual and could only produce a few routine reports. The audience of these few reports was limited—a maximum of 20 people from management. Other business units in Kawan Lama Group only consumed weekly reports that were prepared manually. Even the weekly reports couldn’t cover all important metrics, because some metrics were only available in monthly reports.

Initial solution for a real-time dashboard

The following diagram illustrates the initial solution Ruparupa implemented.

Initial solution architecture

Ruparupa started a data initiative within the organization to create a single source of truth within the company. Previously, business users could only get the sales data from the day before, and they didn’t have any visibility to current sales activities in their stores and websites.

To gain trust from business users, we wanted to provide the most updated data in an interactive QuickSight dashboard. We used an AWS DMS replication task to stream real-time change data capture (CDC) updates to an Amazon Aurora MySQL-Compatible Edition database, and built a QuickSight dashboard to replace the static presentation deck.

This pilot dashboard was accepted extremely well by the users, who now had visibility to their current data. However, the data source for the dashboard still resided in an Aurora MySQL database and only covered a single data domain.

The initial design had some additional challenges:

Diverse data source – The data source in an ecommerce platform consists of structured, semi-structured, and unstructured data, which requires flexible data storage. The initial data warehouse design in Ruparupa only stored transactional data, and data from other systems including user interaction data wasn’t consolidated yet.
Cost and scalability – Ruparupa wanted to build a future-proof data platform solution that could scale up to terabytes of data in the most cost-effective way.

The initial design also had some benefits:

Data updates – Data inside the initial data warehouse was delayed by 1 day. This was an improvement over the weekly report, but still not fast enough to make quicker decisions.

This solution only served as a temporary solution; we needed a more complete analytics solution that could serve more complex and larger data sources, faster, and cost-effectively.

Real-time data lake solution

To fulfill their requirements, Ruparupa introduced a mutable data lake, as shown in the following diagram.

Real time data lake solutions architecture

Let’s look at each main component in more detail.

AWS DMS CDC process

To get the real-time data from the source, we streamed the database CDC log using AWS DMS (component 1 in the architecture diagram). The CDC records consist of all inserts, updates, and deletes from the source database. This raw data is stored in the raw layer of the S3 data lake.

An S3 lifecycle policy is used to manage data retention, where the older data is moved to Amazon S3 Glacier.

AWS Glue ETL job

The second S3 data lake layer is the transformed layer, where the data is transformed to an optimized format that is ready for user query. The files are transformed to Parquet columnar format with snappy compression and table partitioning to optimize SQL queries from Athena.

In order to create a mutable data lake that can merge changes from the data source, we introduced an Apache Hudi data lake framework. With Apache Hudi, we can perform upserts and deletes on the transformed layer to keep the data consistent in a reliable manner. With a Hudi data lake, Ruparupa can create a single source of truth for all our data sources quickly and easily. The Hudi framework takes care of the underlying metadata of the updates, making it simple to implement across hundreds of tables in the data lake. We only need to configure the writer output to create a copy-on-write table depending on the access requirements. For the writer, we use an AWS Glue job writer combined with an AWS Glue Hudi connector frrom AWS Marketplace. The additional library from the connector helps AWS Glue understand how to write to Hudi.

An AWS Glue ETL job is used to get the changes from the raw layer and merge the changes in the transformed layer (component 2 in the architecture diagram). With AWS Glue, we are able to create a PySpark job to get the data, and we use the AWS Glue Connector for Apache Hudi to simplify the Hudi library import to the AWS Glue job. With AWS Glue, all the changes from AWS DMS can be merged easily to the Hudi data lake. The jobs are scheduled every hour using a built-in scheduler in AWS Glue.

Secrets Manager is used to store all the related parameters that are required to run the job. Instead of making one transformation job for each table, Ruparupa creates a single generic job that can transform multiple tables by using several parameters. The parameters that give details about the table structure are stored in Secrets Manager and can be retrieved using the name of the table as key. With these custom parameters, Ruparupa doesn’t need to create a job for every table—we can utilize a single job that can ingest the data for all different tables by passing the name of the table to the job.

All the metadata of the tables is stored in the AWS Glue Data Catalog, including the Hudi tables. This catalog is used by the AWS Glue ETL job, Athena query engine, and QuickSight dashboard.

Athena queries

Users can then query the latest data for their report using Athena (component 3 in the architecture diagram). Athena is serverless, so there is no infrastructure to provision or maintain. We can immediately use SQL to query the data lake to create a report or ingest the data to the dashboard.

QuickSight dashboard

Business users can use a QuickSight dashboard to query the data lake (component 4 in the architecture diagram). The existing dashboard is modified to get data from Athena, replacing the previous database. New dashboards were also created to fulfill continuously evolving needs for insights from multiple business units.

QuickSight is also used to notify certain parties when a value is reaching a certain threshold. An email alert is sent to an external notification and messaging platform so it can reach the end-user.

Business results

The data lake implementation in Ruparupa took around 3 months, with an additional month for data validation, before it was considered ready for production. With this solution, management can get the latest information view of their current state up to the last 1 hour. Previously, they could only generate weekly reports—now insights are available 168 times faster.

The QuickSight dashboard, which can be updated automatically, shortens the time required by the analytic team. The QuickSight dashboard now has more content—not only is transactional data reported, but also other metrics like new SKU, operation escalation for free services to customers, and monitoring SLA. Since April 2021 when Ruparupa started their QuickSight pilot, the number of dashboards has increased to around 70 based on requests from business users.

Ruparupa has hired new personnel to join the data analytic team to explore new possibilities and new use cases. The analytic team has grown from just one person to seven to handle new analytic use cases:

Merchandising
Operations
Store manager performance measurement
Insights of trending SKUs

Kawan Lama Group also has offline stores besides the ecommerce platform managed by Ruparupa. With the new dashboard, it’s easier to compare transaction data from online and offline stores because they now use the same platform.

The new dashboards also can be consumed by a broader audience, including other business units in Kawan Lama Group. The total users consuming the dashboard increased from just 20 users from management to around 180 users (9 times increase).

Since the implementation, other business units in Kawan Lama Group have increased their trust in the S3 data lake platform implemented by Ruparupa, because the data is more up to date and they can drill down to the SKU level to validate that the data is correct. Other business units can now act faster after an event like a marketing campaign. This data lake implementation has helped increase sales revenue in various business units in Kawan Lama Group.

Conclusion

Implementing a real-time data lake using Amazon S3, Apache Hudi, AWS Glue, Athena, and QuickSight gave Ruparupa the following benefits:

Yielded faster insights (hourly compared to weekly)
Unlocked new insights
Enabled more people in more business units to consume the dashboard
Helped business units in Kawan Lama Group act faster and increase sales revenue

If you’re interested in gaining similar benefits, check out Build a Data Lake Foundation with AWS Glue and Amazon S3.

You can also learn how to get started with QuickSight in the Getting Started guide.

Last but not least, you can learn about running Apache Hudi on AWS Glue in Writing to Apache Hudi tables using AWS Glue Custom Connector.

About the Authors

Olivia Michele is a Data Scientist Lead at Ruparupa, where she has worked in a variety of data roles over the past 5 years, including building and integrating Ruparupa data systems with AWS to improve user experience with data and reporting tools. She is passionate about turning raw information into valuable actionable insights and delivering value to the company.

Dariswan Janweri P. is a Data Engineer at Ruparupa. He considers challenges or problems as interesting riddles and finds satisfaction in solving them, and even more satisfaction by being able to help his colleagues and friends, “two birds one stone.” He is excited to be a major player in Indonesia’s technology transformation.

Adrianus Budiardjo Kurnadi is a Senior Solutions Architect at Amazon Web Services Indonesia. He has a strong passion for databases and machine learning, and works closely with the Indonesian machine learning community to introduce them to various AWS Machine Learning services. In his spare time, he enjoys singing in a choir, reading, and playing with his two children.

Nico Anandito is an Analytics Specialist Solutions Architect at Amazon Web Services Indonesia. He has years of experience working in data integration, data warehouses, and big data implementation in multiple industries. He is certified in AWS data analytics and holds a master’s degree in the data management field of computer science.