CallRail Paves the Way for Further Cloud Modernization by Migrating to Amazon RDS for PostgreSQL

By Lucas Ward, DevOps/Cloud Engineer – Ippon Technologies
By Wanchen Zhao, Sr. Partner Solutions Architect, DB Specialist – AWS
By David Mbonu, Sr. Solutions Architect – AWS

Ippon

Customers who move to managed databases innovate faster, spending more time building new applications and less time managing infrastructure.

Managed databases integrate seamlessly with other services (such as artificial intelligence and data warehouse), reduce database administrative burden, allow migrations without the need to re-architect existing applications, and achieve greater performance, availability, scalability, and security.

Amazon RDS for PostgreSQL is a fully managed relational database engine that’s easy to set up, operate, and provides a scalable relational database in the cloud. It offers cost-efficient and resizable capacity while managing time-consuming database administration tasks, allowing you to focus on your applications and business.

Ippon Technologies is an AWS Specialization Partner and AWS Marketplace Seller with Competencies in Migration, DevOps, and Data and Analytics. Ippon supports the digital transformation of companies and helps customers accelerate their journey of database migration and modernization.

In this post, we’ll share how Ippon assisted a customer’s successful migration from self-managed PostgreSQL to Amazon RDS for PostgreSQL. This migration alleviated the strain caused by substantial workloads on the database and unblocked cost savings by eliminating manual database maintenance and management.

Customer Use Case

CallRail offers comprehensive return on investment (ROI) calculations for marketing campaigns through its advanced Call Tracking Products, constituting its primary business application. Additionally, the company possesses a diverse range of capabilities beyond this core service.

CallRail manages several extensive and vital databases, among which is a large database, terabytes in size, previously self-managed by engineers on Amazon Elastic Compute Cloud (Amazon EC2). As is typical for enterprise workloads, it has a primary database, read replica, and a second “maintenance” read replica.

This database is the backbone of several applications, including CallRail’s primary call tracking database. Every time a phone call was tracked, several data points would move through this database. CallRail needed the ability to perform maintenance in a more predictable and controlled manner while allowing further modernizations efforts. Because of these requirements, a migration was pertinent.

Migration Challenges

The intricacy of the migration arose from three main factors:

Database in question held the utmost importance for the business, allowing for minimal to no downtime during the migration process. The primary database could not be stopped for any significant period, even for a maintenance window. Furthermore, the secondary (read replica) also could not be stopped. The separate “maintenance” read replica also maintains a strong disaster recovery posture.
Migration entailed a version upgrade of PostgreSQL v9.6 to PostgreSQL Engine v12. Migrating to v12 added complexity because PostgreSQL-native logical replication does not accommodate a skip version migration.
Sheer volume of data and the immense daily throughput presented obstacles that had to be carefully managed throughout the migration. The database size is approximately 4.5 TB, with 10s of billions of rows with a single table being upwards of 1 TB.

Migration Design and Process

To minimize performance impact and downtime to the primary database, the database was migrated in two phases as illustrated below.

Figure 1 – High-level architecture for database migration process.

Phase I – Full Load

Use pg_dump and \copy to get the bulk of the data transferred from the “maintenance” read replica in a paused state to ensure the consistency of the snapshot. PostgreSQL-native APIs for dump and restore are utilized for streamlining the operations and index creation after the initial bulk load. Multiple dumps are run in parallel to maximize the migration throughput, thus reducing the number of changes to capture in the second phase.

Phase II – Change Data Capture with AWS Database Migration Service

AWS DMS is a managed migration and replication service that helps move your database and analytics workloads to AWS quickly, securely, and with minimal downtime and zero data loss. Multiple AWS DMS tasks were used to transfer the remaining “data delta” into Amazon RDS PostgreSQL. The “data delta” is simply any data that has come into the database after pausing the “maintenance” read replica. Phase II is performed against the primary database.

The following figure illustrates the migration timeline for the two phases discussed above. Once the replication latency becomes zero and the data is validated, engineers perform a cutover to point the application to the target Amazon RDS PostgreSQL.

Figure 2 – Migration timeline for full load and change data capture.

Migration Tips and Best Practices

Leverage a Read Replica for Full Load

Logical dump is an input/output-intensive workload for a database. To minimize performance impact to the production database, it’s recommended to use a read replica for the full load phase. To ensure a consistent snapshot can be taken from the source database, pause the “maintenance” read replica to capture the log sequence number (LSN) for the replication to start from. Note that once the read replica is paused, it will begin accumulating xlogs or wal_logs on disk. Be sure to monitor your database system to ensure free space.

Split Up the Tables to Maximize Migration Throughput

To make the full load operation run as fast as possible, split the large database into varied pieces to load in parallel. One way to split the database is according to the size of the tables. For example, take your five largest tables and put them each in a separate screen session—running your pg_dump commands. Then, put all of the remaining and relatively smaller tables into the sixth screen session and kick them off at the same time.

The table splits are also utilized for creating replication sets and replication slots, as well as AWS tasks accordingly. Each AWS DMS task handles a subset of the replication workloads, which is better in terms of performance, scalability, and issue isolation.

Figure 3 – Example of splitting up the tables.

It’s also recommended to turn on Amazon RDS Performance Insights in your Amazon RDS PostgreSQL to monitor potential performance bottleneck, top SQL commands, and how far along the full load process is.

Choose a Proper Logical Replication Plugin

AWS DMS supports test_decoding and pglogical plugins for logical decoding. pglogical can perform selective decoding based on the replication scope of an AWS DMS task. In comparison to test_decoding, the pglogical plugin consumes less network bandwidth, CPU, and I/O during the replication for a use case that just requires replication for selective tables.

As shown in Figure 3 above, multiple Aws DMS tasks run in parallel for A subset of the tables in each task. Instead of decoding the write ahead log (WAL) entirely for each AWS DMS task, pglogical only decodes and transfers the changes per each task scope.

Verify the LSN to Start Replication From

The LSN captured before full load needs to be verified from the primary database for AWS DMS/CDC-only task to run smoothly, otherwise the task fails after one hour complaining the LSN cannot be found. Here’s an example of how to verify if the LSN is valid to start replication from when using pglogical plugin.

For PostgreSQL earlier than version 10, run this command on the source database:

SELECT location,xid,jsonb_pretty(data::jsonb) FROM pg_logical_slot_peek_changes('dms_slot_01',null, null, 'min_proto_version', '1', 'max_proto_version', '1', 'startup_params_format', '1', 'proto_format', 'json', 'pglogical.replication_set_names', 'dms_slot_01,idms_slot_01') where location>='{LSN_FROM_MAINTENANCE_READ_REPLICA}' limit 5;

For PostgreSQL version 10 or higher, replace location in the above command with “lsn”.

Be sure to replace dms_slot_01, idms_slot_01 with the name of your replication slots and replication sets. Replace {LSN_FROM_MAINTENANCE_READ_REPLICA} with the LSN. If the command returns the details associated with the LSN supplied, then it’s a valid position to specify in the AWS DMS/CDC-only task. If the output does not contain the LSN supplied, use the next LSN that’s higher than the one supplied.

Properly Handle Large Objects

When performing change data capture with AWS DMS, tables with large objects (LOBs) are handled according to the LOB handling mode setting. It’s recommended you use limited LOB mode whenever possible for performance and effective resource utilization of the AWS DMS replication instance.

To set limited LOB mode properly, first output a list of tables that contain LOBs. More importantly, for each table that has LOBs, a value is required (in kilobytes) of the largest possible LOB for that table. In addition, validate if there are any tables where LOB truncation is acceptable. If not, use table-level LOB setting with inline LOB mode. Note that only the tables with primary key or unique key can be replicated with LOBs.

If a LOB that is bigger than the size specified by the limited lob mode setting, and the table does not have a special table-setting for LOBs in the table mapping, then the following warning will be generated.

[SOURCE_CAPTURE ]W: Value of column 'column_name_001' in table 'schema.table_name_001' was truncated to 131072 bytes, actual length: 143687 bytes (postgres_pglogical.c:2550)

Improve CDC Latency with Batch Apply

Besides running AWS DMS tasks in parallel, AWS DMS provides the task-level setting BatchApplyEnabled to minimize the changes to apply to the target using batches. BatchApplyEnabled is useful if there’s high workload on the source DB, and a task with high target CDC latency.

It’s important to note that for AWS DMS batch apply to work properly, the table must have a primary key or a unique key (if no primary key exists), and no other unique constraints on other columns. If a table has both primary key and unique constraints on other columns, missing data in the target database would occur during replication. Ensure to disable BatchApplyEnabled for these types of tables.

Monitoring During Replication

As managed services, Amazon RDS PostgreSQL and AWS DMS provide rich monitoring tooling and services, including Amazon CloudWatch metrics and logs, Amazon RDS Enhanced Monitoring, Amazon RDS Performance Insights, AWS DMS control tables.

For monitoring replication progress and issue resolution:

Monitor replication log for AWS DMS tasks from Amazon CloudWatch. Search for log signatures annotated with E: and W: and use Amazon CloudWatch Logs Insights to interactively search and analyze your replication log.
Monitor CloudWatch metrics CDCLatencySource and CDCLatencyTarget for replication latency of AWS DMS tasks. Monitor CPUUtilization, FreeableMemory, SwapUsage, and FreeStorageSpace of the AWS DMS replication instance for resource utilization.
Monitor AWS DMS control tables, especially awsdms_status and awsdms_apply_exceptions for replication status and issues. For example, if the LOB truncation occurs to a JSON column, besides a warning in the replication log as the log example above, an entry with the timestamp, the SQL statement failed, and the error message is recorded by AWS DMS control table awsdms_apply_exceptions. Choose to back fill the correct JSON from the source table to the target accordingly.

Plan for Migration Disaster Recovery

Best practices dictate the need to plan for worst-case scenarios. As such, the following migration disaster recovery options are advised:

Resume from a failed task: This is useful when the issue is intermediate, and the WAL stopped or failed at is still available in the source database.
Partial restart: This is useful when only a small set of the tables fails to replicate but is not able to resume. Using pglogical and AWS DMS, it’s feasible to migrate a slice of the database.
Full restart: This is the last resort of the recovery attempt. Preserve the failure state and conduct a root cause analysis before restarting from fresh.

Besides the above, doing a dry run prior to the actual production migration will be helpful to find any unforeseen issues and fix them in advance.

Conclusion

By collaborating with Ippon and AWS, CallRail was able to realize numerous benefits. The successful migration of CallRail’s final database to Amazon RDS marked a significant milestone in its cloud modernization endeavors.

The company can now proceed with optimizing the same critical database for enhanced operational efficiency. This newfound operational efficiency instills a sense of assurance and liberates valuable engineering resources, enabling them to focus on other projects.

.

.

Ippon Technologies – AWS Partner Spotlight

Ippon is an AWS Partner that supports the digital transformation of companies and helps customers accelerate their journey of database migration and modernization.

Contact Ippon | Partner Overview | AWS Marketplace