AWS Database Blog
Manage long-running transactions for AWS DMS performance
In this post, we show you how long-running transactions affect AWS Database Migration Service (AWS DMS) change data capture (CDC) latency, walk through monitoring approaches for Oracle, PostgreSQL, MySQL, and SQL Server, and provide ready-to-use scripts to identify and resolve problematic transactions before they impact your replication performance.
Long-running transactions are one of the most common and often overlooked causes of rising CDC source latency during AWS Database Migration Service (AWS DMS) migrations. When a database transaction stays open for hours instead of seconds, DMS must hold its replication position, buffering all subsequent changes until that transaction commits. The result is a cascading delay across all pending changes that can stall your migration pipeline right when low latency matters most, during your final cutover window. At that point, every minute of replication delay translates directly to extended downtime and higher migration risk.
By detecting these transactions early, you can avoid unnecessary replication instance scaling, reduce cutover downtime, and keep migration on schedule.
DMS source latency
When working with AWS DMS, source latency (CDCLatencySource) is a key metric. It measures the delay between the commit time of the last event captured from the source endpoint and the current system timestamp of the replication instance. In other words, it measures how far behind the replication instance is from the latest committed change at the source. Lower latency indicates healthier replication performance.
Several factors contribute to elevated source latency: an uncommitted transaction prevents DMS from applying/forwarding changes to the target, network throughput limitations between source and replication instance, source database performance bottlenecks, and heavy workload during peak processing times. These issues often interconnect, compounding their impact on replication performance.
To troubleshoot effectively, analyze latency patterns over time at hourly, daily, and weekly intervals. This helps you distinguish between temporary spikes during expected peak periods and systematic issues that indicate underlying problems.
The path to resolving source latency isn’t always straightforward, but it typically involves investigating six key areas:
- High database workload
- Network connectivity issues
- Insufficient system resources
- Large transaction volumes
- Long-running transactions
- Archival log processing delays
Each of these factors requires a different troubleshooting approach. Pinpointing the exact root cause before initiating remediation is essential for an effective fix. For the scope of this post, we focus on long-running transactions.
Long-running transactions
Long-running transactions are database operations that remain open for minutes or even hours. In AWS DMS, they can cause significant replication delays.
When a transaction stays open, DMS must retain all transaction log entries from the point that transaction began. For example, if a transaction opens at 9:00 AM and does not commit until 11:00 AM, DMS preserves two hours of log data. This increases storage consumption and memory usage on the replication instance.
The open transaction also blocks DMS from advancing its log-reading position. All changes that commit after the open transaction must wait in queue, causing CDC source latency to climb steadily. In high-transaction environments, a single long-running transaction can create a backlog that takes hours to clear after the transaction finally completes.
Common causes include complex queries processing large datasets, batch operations that run longer than expected, application code that holds transactions open unnecessarily, and missing commit or rollback statements. Each cause calls for a different troubleshooting approach.
To reduce the impact, set up automated monitoring to detect open transactions that exceed a defined threshold. Keep transaction duration as short as possible by optimizing application code and committing frequently. Avoid user interaction within open transactions. Also make sure the replication instance has sufficient storage and computing resources to handle temporary backlogs.
Some long-running transactions are unavoidable for certain business operations. The goal is not to eliminate them entirely but to detect them early and prevent them from stalling your replication pipeline.
How long-running transactions impact DMS source latency
Long-running transactions directly impact your AWS DMS CDC source latency. Identifying them is the first step to troubleshooting performance issues.
How DMS processes transactions
When DMS operates in CDC mode, it captures changes from the source database’s transaction logs (such as Oracle redo logs, SQL Server transaction logs, PostgreSQL WAL, or MySQL binlogs). However, DMS can only capture and replicate committed transactions to maintain data consistency.
A bottleneck forms because DMS cannot skip past the open transaction:
CDCLatencySourceincreases from seconds to potentially hours.CDCIncomingChangesaccumulates.CDCChangesMemorySourceorCDCChangesDiskSourceincreases as DMS buffers pending changes.
For instance, consider a batch job that starts a transaction at 10:00 AM and runs for 2 hours without committing. Any changes captured during that window cannot be applied to the target until the transaction commits at 12:00 PM. The DMS replication task is the unit of work responsible for reading the source transaction log and applying changes to the target. During this period, the task buffers all subsequent changes it has read, regardless of whether those changes involve related tables. It maintains transactional ordering within its scope, so no captured changes can be applied until the open transaction completes at noon. This creates a significant replication delay that grows proportionally with the transaction’s duration.
Diagnostic approach
Identifying long-running transactions helps you differentiate between DMS configuration issues and database-level problems when diagnosing source latency. This identification process follows a systematic approach:
- Initial investigation: Monitor
CDCLatencySourceto identify when replication is falling behind, then investigate replication instance metrics and source database open transactions to determine the root cause. - Root cause analysis: Use monitoring scripts to identify active long-running transactions and correlate transaction start times with
CDCLatencySourcespikes. Track specific identifiers such as Oracle SCN (such as 12345678) or SQL Server LSN (such as 0000000A:00000B00:0001). - Targeted solutions: Once identified, you can request a commit or rollback from the application team, optimize the logic to use smaller transaction batches, or implement transaction timeout policies.
Without identifying these long-running transactions, teams often waste time tuning DMS task settings (like
MemoryLimitTotalor parallel threads) or scaling up replication instances unnecessarily, when the actual solution requires addressing the source database’s transaction management practices. The monitoring scripts thus serve as a diagnostic bridge between observing DMS performance symptoms and pinpointing their database-level causes.
Queries to identify long-running transactions
To proactively manage this issue, implementing alerts for long-running transactions is essential. Here are the monitoring queries for each database engine:
For Oracle source engine:
For PostgreSQL source engine:
For MySQL source engine:
For SQL Server source engine:
Monitoring scripts
We’ve developed monitoring scripts for Oracle, PostgreSQL, MySQL, and SQL Server. These scripts alert when long-running transactions stall DMS replication. They detect and alert on long-running transactions (default: 15 minutes). Each script is designed for its specific database engine while following a consistent monitoring framework.
The full scripts are available on GitHub: sample-dms-long-running-transaction-detection.
Infrastructure requirements
- Linux / Unix environment (bash shell required)
- Appropriate database client tools (
sqlplus,psql,mysql, orsqlcmd) - An Amazon Simple Notification Service (Amazon SNS) topic for receiving alert notifications
- AWS Command Line Interface (AWS CLI) configured with IAM permissions for SNS publish
- Proper database access permissions
- Network connectivity from the monitoring host to the database endpoint
Common features across all scripts
All four scripts share a consistent design pattern with the following features:
| Feature | Description |
| Interactive Prompts | All connection parameters are collected through interactive prompts at runtime. Credentials are securely retrieved from AWS Secrets Manager |
| TLS/SSL Encryption | All database connections are encrypted by default: TCPS for Oracle, -N flag for SQL Server, sslmode=require for PostgreSQL, --ssl-mode=REQUIRED for MySQL |
| AMAZON SNS Alerts | Alerts are published to a configurable AMAZON SNS topic, enabling distribution to multiple subscribers (email, SMS, AWS Lambda, and more) |
| Prerequisite Checks | Each script validates that required tools (AWS CLI, database client) are installed before proceeding |
| False Positive Filtering | Internal/system processes and client warnings are filtered out to prevent false alerts |
Core functions
Each script implements these core functions:
collect_inputs(): Interactive prompt-based parameter collection.validate_inputs(): Input validation and prerequisite checks.check_long_sessions()/check_long_transactions(): Core monitoring query execution.send_alert(): SNS alert publishing with formatted message.log_message(): Timestamped logging to file.build_*_connstr()/build_*_args(): TLS/SSL connection string construction.main(): Orchestration with configuration summary and log directory setup.
Script details by engine
Oracle long-running session monitor
This bash script automatically detects Oracle database sessions running longer than the configured threshold and publishes alerts to Amazon SNS. It connects using the TCPS protocol for TLS/SSL encryption, with optional Oracle Wallet support for certificate-based authentication.
Usage:
Interactive prompts:
TLS/SSL configuration: The script uses TCPS protocol by default (port 2484). For Oracle Wallet-based SSL, set the following environment variables before running:
Key filters: Excludes BACKGROUND processes and the RDSADMIN user. Only monitors ACTIVE sessions.
PostgreSQL long-running transaction monitor
This bash script automatically detects PostgreSQL transactions exceeding the configured threshold and publishes alerts to Amazon SNS. It enforces TLS/SSL via the sslmode connection parameter.
Usage:
Interactive prompts:
TLS/SSL configuration: Default sslmode=require. For stricter verification, set environment variables:
False positive prevention: The script separates stderr from stdout to prevent psql client warnings (such as libpq.so version messages) from being treated as transaction data.
MySQL long-running transaction monitor
This bash script automatically detects MySQL transactions running longer than the configured threshold, providing detailed reports and publishing alerts to Amazon SNS. It enforces TLS/SSL via the --ssl-mode parameter.
Usage:
Interactive prompts:
TLS/SSL configuration: Default --ssl-mode=REQUIRED. For stricter verification with the RDS CA bundle:
False positive prevention: The script filters out MySQL internal processes (event_scheduler, rdsadmin, system user, Daemon commands) and strips the mysql: [Warning] Using a password on the command line interface can be insecure warning from output.
SQL Server long-running transaction monitor
This bash script automatically detects SQL Server transactions exceeding the configured threshold and publishes alerts to Amazon SNS. It enforces TLS/SSL via the -N flag with sqlcmd.
Usage:
Interactive prompts:
TLS/SSL configuration: The script uses -N (encrypt) and -C (trust server certificate) by default. For strict certificate verification, install the RDS CA bundle into the system trust store and set:
False positive prevention: Filters to is_user_process = 1 only and strips (N rows affected) noise from sqlcmd output. SQL text is truncated to 200 characters.
Sample SNS alert output
When a long-running transaction is detected, the script publishes a formatted alert to the configured SNS topic. Here is an example alert from the Oracle monitor:
All subscribers to the SNS topic (email, SMS, Lambda, and more) receive this alert, so the operations team can respond quickly.
Sample log output
Scheduling with crontab
For continuous monitoring, configure the scripts as cron jobs. Since the scripts use interactive prompts, you can create a wrapper script that pipes the inputs, or modify the scripts to retrieve credentials from AWS Secrets Manager. Alternatively, for crontab usage, you can set the variables directly in a non-interactive wrapper:
Note: The scripts use bash-specific syntax (such as, [[ =~ ]] for regex matching). Always invoke them with bash script_name.sh rather than sh script_name.sh.
Demonstration
We are going to perform a demo of the script in action by choosing Oracle as our source endpoint and check if the script is indeed able to capture the long-running sessions. For the demo we have used Amazon Relational Database Service (Amazon RDS) for Oracle 19c Enterprise Edition as source:
- Create a full load and CDC task to load the
EVENTS_LOBtable. Table structure: - Schedule the monitoring script in crontab to execute every 15 minutes:
- Start the DMS task and have it completed the full load and move on to the CDC phase of the migration task.
- Generate high volume of DML in the source with commit and wait for the latency spike to happen:
- Observe the DMS task Amazon CloudWatch metrics. You will see an increase in
CDCLatencySourceas the long-running transaction holds up replication. - Execute the monitoring script. The script detects the long-running session and publishes an alert to the SNS topic with full session details including SID, username, duration, SQL text, and wait events.
- SNS subscribers receive the alert. All subscribers to the configured SNS topic (email, SMS, Lambda, and more) receive the formatted alert, so the operations team can investigate and take action. A sample alert is included in the following section for reference:
Security considerations
The monitoring scripts incorporate several security best practices:
- No command-line credentials: All connection parameters are collected via interactive prompts, preventing password exposure in
psoutput, shell history, or process listings. - TLS/SSL encryption: All database connections are encrypted by default across all four engines.
- IAM-based alerting: Alerts use Amazon SNS with IAM authentication, so no email credentials are stored in the scripts.
- Password: Credentials are collected via silent interactive prompts (
read -s), preventing exposure in shell history or process listings. For automated scheduling, the scripts support--secret-idto retrieve credentials directly from AWS Secrets Manager, eliminating plaintext passwords entirely. - SQL ID only: For security, scripts report only
SQL_ID/query_idreferences, not actual SQL text. Use these identifiers to look up queries in database monitoring views if needed. - Temporary file cleanup: All temporary SQL files and stderr capture files are cleaned up after use.
Conclusion
Long-running transactions can increase your CDC source latency from seconds to hours and cause replication delays that grow until the transaction commits. The monitoring scripts provided for Oracle, PostgreSQL, MySQL, and SQL Server offer a proactive solution to identify these problematic transactions before they become critical issues. As demonstrated in our Oracle example, these tools effectively detect and alert administrators about long-running sessions, enabling timely intervention and optimization.
The scripts enforce TLS/SSL encryption on all database connections, use Amazon SNS for scalable alert distribution, and include false-positive filtering to make sure alerts are actionable. By implementing this monitoring suite, you transform from reactive troubleshooting to proactive management. This post helps you identify potential latency issues before they impact database operations or AWS DMS replication, ensuring smooth, efficient database operations across all your platforms.