How ALICE uses AWS Glue to Solve Complex Data Migration Challenges
September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.
Guest post by Michael Dreikorn, Tech Lead, ALICE
In this article, we will discuss the ALICE product offering, how we leveraged AWS Glue to migrate customers from a large acquisition onto our platform, some of the challenges we faced using this technology, and how we solved them.
ALICE offers a platform for hotel staff to deliver exceptional hospitality to their guests. We do this with a suite of integrated products that help every department of a hotel get their work done. ALICE is used by hotel staff as both a task management and communication system. Staff use our messaging products to communicate work with other staff members and to engage guests using channels such as SMS or Facebook Messenger. ALICE is also used by hotel managers to ensure that service is being delivered in a consistent manner and that the hotel complies with the brand standards. ALICE is used by over 2,500 of the world’s leading hotels across 75 countries.
At the end of 2017, ALICE acquired a large competitor, GoConcierge with a global customer base of over 1k hotels. GoConcierge was built around the concierge experience, and we needed to migrate those customers over to ALICE so they could benefit from the platform. ALICE was offering a similar concierge product and our challenge was to upgrade our customers without getting in the way of their operation.
Concierges live and die by the data that they build up over the years about their guest preferences, guest activities and local contacts. This information was stored in a complex data hierarchy, which was analogous to but different than the one used in ALICE. Our challenge was to develop a toolchain that can migrate hotels and all their historical data into our platform in a resilient, scalable and fully automatic manner. We needed to do this quickly, within an hour, so that a concierge did not miss a beat. To make matters more complicated, GoConcierge offered two products and both needed their own data migration.
On both sides, relational databases are used as persistence layer. The database vendors are different as well as the database design and strong referential integrity constraints within the data structures. Additionally, on the ALICE side, Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) makes all data available for searching. As a consequence of all of the above a simple database-to-database dump was not a viable solution. Hotels are operating 24/7, so does the ALICE platform, thus downtime for any kind of migration was not possible. ALICE needed to have data loaded in bulk via API in a way that was resilient, traceable, and would not disturb operations.
An additional layer of complexity was brought on by the fact that it was not an option to have just one big migration day in which all data would be brought over in one large effort. Moving a customer required training, and so every migration was contingent on the availability of the hotel to be trained, and our success team to perform the training. Also, every customer has their own timeline: some are happy early adopters of the ALICE platform, while others required certain functionality to be built before migration. This made migration timelines unpredictable, and the compute power required to move data was not well understood ahead of time since scheduling was constantly in flux.
Lastly, the data in question had significant diversity, and mapping it to ALICE was no easy feat. We needed a system that would allow us to flexibly adjust mapping logic, as sometimes it was based on complex algorithms that extended to different parts of the data hierarchy. We needed a way to build this logic in a stable and testable way, which also allowed multiple engineers to collaborate so we could finish the project on schedule, and evolve it when necessary.
As a consequence of the outlined scenario it was clear that we would need a technology that was highly versatile and flexible in transforming one data structure into another. Our overall requirements were as follows:
● Capability to derive the foreign key values of the target system from the data in the source on-the-fly
● Elastic infrastructure that would dynamically scale and adapt to the ever-changing amount of data to be moved at a certain moment without any human intervention
● Possibility to store the pre-transformed data in an intermediate database so we would be able to revise it before loading it into the destination database and iterate over it independently
● Clear separation between the ETL logic that would create the data, and loading which would chunk that data and push it into the ALICE API in a throttled matter
● Ability to run an ETL in parts, or in a sequence, in the case that data issues were discovered and needed to be corrected
● Prebuilt tools to deal with data manipulation and libraries that made transposition simple and intuitive to do
● Ability to call all operations through an intuitive API so we could layer in a user interface for our success team to run the ETL and load the data via a task scheduling orchestrator
● Ability to run multiple hotels’ data loads concurrently
After looking into several different options it quickly became clear that AWS Glue would be a perfect fit to achieve our goals.
Using AWS Glue
The ALICE architecture leverages AWS Glue (shown in yellow above) to load the data from the source database, transform it into the target data model, and to build new foreign keys for re-establishing the relationship within data set and between data set. A data set may contain one or more tables and represents a logical concept such as a concierge task, hotel guest, or vendor. Building each data set requires establishing the appropriate parent-child relationships. data set may also be related to one another; for example, a concierge task is related to a hotel guest who requested the service and a vendor who is handling fulfillment.
The Glue code extracts 30+ multi-level data set in the format of the destination system, and deposits them into an intermediate database. We use the source system as a data-mart to compose a target data set which will be required to interact with the destination system API. Inserting into the destination API happens as a separate process, and the real complexity is the logic used to generate each data set.
Data set generation logic varies across the hierarchy. Some data sets are a one-to-one mapping between source and destination, others are a composition, and others are algorithmically defined. Some data sets even require enrichment from third-party sources via external APIs. Since generating the data sets requires flexible logic, we use a combination of SQL, Spark, and Python. We use SQL at the top of the Glue function to extract an initial view of data from the source database as the foundation for each data set. We found that this was simpler and more flexible than using the out-of-the-box catalog function. It was also more performant on large tables to start with a query. We then transpose and enrich that data with Spark and Python to follow various heuristics-based rules, and ultimately deposit it in an intermediate DB. The logic for each data set is isolated in separate Glue functions, and shared libraries allow data sets generated earlier in the chain to be associated to form relationships.
Our ETL algorithm starts with first generating the data sets that are fully independent. For each independent data set, we generate unique keys based on its attributes. Once all the independent data sets are generated and saved, we build a directional graph of data sets that refer to them. For instance, a hotel is an independent data set that others refer to such as hotel guest. We then walk the graph and proceed to build the related data set at each level, linking them to the data set created earlier based on keys found in the lookup tables. New data sets generate new foreign keys that are used as a lookup by subsequent levels. This procedure is repeated until we reach the leaf nodes of the graph.
Glue makes it easy to load the entire data set per customer into an underlying Apache Spark cluster and manipulate it at our convenience. The available computing resources can be adjusted by changing the requested Glue Data Processing Units (DPUs) for every execution of a job definition. We have yet to find a data set that is too large to process. This is where parent-child lookup logic is able to execute efficiently. Where possible, we create multiple Glue job instances, allowing for concurrent runs differentiated by passing the required DPUs as a parameter. This concurrency allows the whole integration to scale as we need it.
Once we generate all the necessary data sets in the intermediate Amazon Relational Database Service (RDS), we are ready to load them into ALICE via our API. Having the data sets fully built at rest allows us to load the hotel multiple times and into multiple destinations/environments. We first load the data into a QA environment and if we find issues, we adjust the Glue scripts and re-run the ETL for the relevant data sets. Not needing to re-import the full hotel saves compute time and money. The source data is quite diverse, and many changes in the algorithm were necessary to fix the problem, so having a decoupled code base allowed our team to iterate quickly on the project.
One additional benefit is that it was very easy to get a CI/CD pipeline set up with Glue. This ETL requires thousands of lines of code, and Glue allowed us to separate everything into functions and keep our logic coherently managed. This enables the team to collaborate seamlessly, and add engineers when the project needs to accelerate. As we were using Python to write our logic, it is as easy as pushing a zip archive with our code to a specific Amazon S3 bucket and using this as a Python library path for the job. All of this is fully automated, and we store the Python code in a repository that is pushed to Glue.
Running the actual ETL is orchestrated through a centralized Node.js layer internally — called the Orchestrator — which hits the Glue API to create the appropriate data sets in a specific sequence. In the Orchestrator we are also giving our users a way to track progress of the jobs, using the Glue API to provide visibility and logging details. The Orchestrator also allows our success team to run the ETL directly, without the need to involve our engineers. We developed a Task Scheduler system that allows for Jobs to be queued up and executed when appropriate for the customer.
Since we are dealing with a lot of data it was also important for us to build agility in adjusting for edge cases missed in the initial testing. As data issues occur, the orchestrator provides sufficient details for us to triage the problem, make the Glue fix, and deploy quickly with our CI/CD chain. AWS Glue Support is very responsive, and our ability to re-run certain data sets on-demand keeps the trains running on time, allowing our concierges to serve their guests without interruption.
It is amazing how seamlessly Glue Jobs can be integrated into a technology landscape. We have now migrated over 500 properties with this tool, and it continues to be stable. An average hotel takes one hour to run end-to-end. We began with the first GoConcierge product and then implemented the second, while sharing common utilities. Glue made this all a very simple process and has been at the foundation of successfully upgrading our customers to the ALICE platform.