How Tradeshift boosted operational efficiency and scalability with Amazon RDS

This is a guest post by Mircea Bud, Alexandra Munteanu, and Daniel Urzica from Tradeshift.

In 2023, Tradeshift migrated one of its core PostgreSQL databases from self-managed Amazon Elastic Compute Cloud (Amazon EC2) instances to Amazon Relational Database Service (Amazon RDS) for PostgreSQL. The decision followed mounting operational risks and performance limits that made the existing setup increasingly unsustainable.

The database had grown to 18TB and supported key backend services. It ran on aging infrastructure, with high storage usage and time-consuming recovery procedures. Performance degradation, patching delays, and architectural drift from the rest of our platform made continued investment in the EC2 setup unviable.

Tradeshift needed a managed solution that could reduce downtime risk, improve observability, and simplify ongoing operations. Amazon RDS met those requirements. In this post, we explain why we migrated to Amazon RDS, how we executed the migration, and highlight the invaluable benefits it delivered in terms of safety, flexibility, and audit compliance.

The challenges of a self-managed PostgreSQL database on EC2

The self-managed EC2-based PostgreSQL setup had become increasingly difficult to maintain due to operational overhead. Recovery workflows were slow and largely manual. Recovery Time Objective (RTO) had risen to nearly 48 hours. Recovery Point Objective (RPO) hovered around one hour.

The cluster was provisioned on i3.metal instances with fixed NVMe storage. Capacity usage consistently exceeded 90 percent, and increasing storage required downtime and reconfiguration. Since the EC2 i3.metal uses local fixed-size storage by design (to deliver highest IOPS performance), extending that required reconfiguration. Alternative solutions were needed, like attaching EBS volumes and altering the database schema to use adjacent tablespaces, where the least critical tables could eventually be relocated.

Patch compliance was also a concern. Operating system and database patches weren’t applied frequently enough to satisfy audit standards. This was due to the complex and time-consuming manual process required for upgrades:

In the case of an OS upgrade, the legacy solution required spawning a new read-replica, allowing it to sync (which took hours), promoting it as the primary, and then replacing the former replica with another one. This workflow required a minimum of 20 minutes of downtime if all steps went well.
For a PostgreSQL version upgrade, downtime was even longer as the application service wouldn’t start without a read-replica present.

This cluster was the last in our fleet still managed with Puppet (customized manifests derived but based on the public puppetlabs-postreql repo), which increased risk and reduced our ability to standardize operational practices.

Why we chose Amazon RDS for PostgreSQL

After reviewing options, we selected Amazon RDS for PostgreSQL. It provided a fully managed PostgreSQL environment with high compatibility and mature tooling.

Improvements in availability and recovery

Amazon RDS delivered substantial improvements in availability and recovery capabilities. A Point-in-Time Recovery (PITR) after a data corruption event, which may require database restoration involving snapshot restoration (RTO in minutes) and WAL-replay from the archive (RTO in hours). For hardware failure, RTO is in the minutes range. RDS also delivered much shorter RPO intervals, enhancing our data protection capabilities. Automated failover, snapshotting, and backups (RDS backup) made recovery scenarios more predictable and less reliant on manual steps.

Simplified patching and audit readiness

Amazon RDS handles regular updates to both the PostgreSQL engine and the underlying operating system. This simplified our audit workflows and eliminated patch drift.

Better visibility into query behavior

The built-in Performance Insights dashboard helped us monitor workload patterns in real time. Our teams can identify and resolve slow queries more quickly with Amazon CloudWatch alarms and metrics. We also relied on several PostgreSQL extensions supported by RDS:

log_fdw and postgres_fdw for collecting OS-level logs within the database
pg_cron for scheduling internal database maintenance tasks
aws_s3 for interacting with Amazon Simple Storage Service (Amazon S3) as a long-term storage solution for audit logs and aging data

These tools improved our ability to detect and respond to performance issues without external automation or third-party agents.

Executing the migration with minimal downtime

Migrating a production system of this size required careful planning. The 18TB dataset had a steady stream of write traffic, and downtime had to be kept to a minimum.

We selected native PostgreSQL logical replication as our migration method. It offered full compatibility with our workloads and didn’t require OS-level access. With logical replication we synchronized data incrementally without blocking application writes.

Our team designed the initial load to minimize the performance impact on the source database by spreading that activity across a two-week period. This involved dividing the work over 20 publications (using various table grouping criteria) and limiting the number of replication workers per each active logical replication slot.

We created a new Amazon RDS cluster, then set up replication slots to mirror changes from the self-managed PostgreSQL cluster (running on EC2). Once the replication lag was under control and the data validated, we scheduled a short downtime window to perform the final cutover. Credentials and user configurations were migrated without changes. Kubernetes services were updated to point to the new database endpoints. IAM (database authentication) replaced manual credential handling for read access, which aligned with our existing standards for the rest of our Amazon RDS fleet.

Architectural improvements: A cleaner, more scalable architecture

The migration also helped us remove legacy components from our platform. We deprecated Consul-based service discovery, which had previously handled database endpoint resolution. Instead, Kubernetes-native service names now provide clean and consistent connectivity.

IAM authentication replaced manual user and password management for operational access. This improved security and simplified onboarding for new users and services.

We also introduced a new approach for querying databases across environments. By combining RDS IAM authentication with the built-in tools for PostgreSQL (aws-cli, jq, psql) we were able to issue cross-instance queries within and across VPCs. These scripts replaced fragmented custom tooling and now they provide visibility into the fleet in the form of a unified output record set.

Results and business impact

The migration produced measurable improvements in multiple areas of our platform operations.

Improved availability and resilience

Recovery from failure is now faster and more predictable. Recovery times decreased considerably, and the frequency of backup points increased, enhancing our resilience posture. The ability to scale compute and IOPS independently means we can adapt the database to real-time platform needs, especially during incidents or traffic spikes.

Operational efficiency

Maintenance tasks such as partition management and archiving are now handled internally using pg_cron. While task scheduling could be done in the EC2 setup using bash-crontab, adopting the pg_cron extension means scheduling, monitoring, and maintaining scheduled tasks are now reduced to pure SQL, eliminating the need for OS interaction. Performance monitoring has improved due to deeper integration between database metrics and our observability stack.

Better developer experience

IAM-based access control has replaced manual credentials, making developer onboarding and user provisioning simpler and more secure. Query tuning and issue diagnosis are faster due to the visibility provided by RDS Performance Insights and PostgreSQL telemetry extensions.

Platform-wide benefits

The RDS fleet now supports SSO-based cross-environment query execution. This allows our team to inspect and inventory database instances at scale. Previously, this level of access required separate tools and often multiple manual steps.

The migration has aligned one of our most critical backend services with the rest of our platform architecture and removed operational debt so we can scale more confidently as the business grows.

Conclusion

By migrating to Amazon RDS for PostgreSQL we replaced legacy infrastructure with a managed service that delivers higher availability, better observability, and easier scaling. The combination of logical replication and Amazon RDS tools gave us a low-risk path to modernization, even with a large, active dataset.

The project eliminated long-standing technical debt, improved compliance, and reduced the operational overhead of maintaining a custom PostgreSQL EC2-based solution. We now have a standardized foundation for PostgreSQL that can evolve with the needs of the business, without compromising reliability or performance.

Visit Amazon RDS to learn more.

AWS Database Blog

How Tradeshift boosted operational efficiency and scalability with Amazon RDS

The challenges of a self-managed PostgreSQL database on EC2

Why we chose Amazon RDS for PostgreSQL

Improvements in availability and recovery

Simplified patching and audit readiness

Better visibility into query behavior

Executing the migration with minimal downtime

Architectural improvements: A cleaner, more scalable architecture

Results and business impact

Improved availability and resilience

Operational efficiency

Better developer experience

Platform-wide benefits

Conclusion

About the authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help