AWS DMS implementation guide: Building resilient database migrations through testing, monitoring, and SOPs

AWS Database Migration Service (AWS DMS) simplifies database migration and replication, offering a managed solution for customers. Our observations across numerous enterprise deployments indicate that investing time in proactive database migration planning yields substantial returns. Organizations that embrace comprehensive setup strategies consistently experience fewer disruptions and achieve better migration outcomes.

In this post, we present proactive measures for optimizing AWS DMS implementations from the initial setup phase. By using strategic planning and architectural foresight, organizations can enhance their replication system’s reliability, improve performance, and avoid common pitfalls.

We explore strategies and best practices across the following key areas:

Planning and running proof of concept (PoC)
Implementing systematic failure testing
Developing standard operating procedures (SOPs)
Monitoring and alerting
Applying AWS Well-Architected Framework principles

Planning and executing a PoC

Performing a PoC helps to discover and remediate environmental issues early. It also helps generate information that you can use to estimate your overall migration time and resource commitments.

Following are the high-level steps for conducting a successful PoC:

Plan and deploy testing environment with adequate AWS DMS replication instances, tasks, and endpoints. For more information on planning and provisioning resources, you can refer to Best practices for AWS Database Migration Service.
Use a production-like workload. It’s imperative to simulate as closely as possible your production environment to increase the likelihood of encountering a variety of issues.
Execute failure testing based on the scenarios discussed in the table in the next section.
Keep track of resource utilization and bottlenecks that occur during the PoC and revisit planning and deployment of resources accordingly.
Document your observations and perform a migration assessment by comparing it with your business outcomes. This includes evaluating the migration recovery times and the application Service Level Agreements (SLAs) for both migration activities and ongoing business operations. If these migration and operational requirements aren’t met, revisit the planning phase to ensure alignment with your business needs.

Implementing systematic failure testing

All systems, regardless of their robustness, can experience failures and downtime. For organizations running critical workloads, proactive planning becomes essential to maintain business continuity and meet SLAs. This section provides a strategic framework for developing SOPs that establish clear recovery protocols and minimize operational impact during system disruptions.

When implementing AWS DMS, understanding potential failure points becomes crucial for building resilient systems. The following table outlines common failure scenarios encountered in AWS DMS replication systems, serving as a foundation for your testing strategy. Although it’s comprehensive, we encourage you to expand upon these scenarios based on your specific architecture, compliance requirements, and business objectives to achieve complete coverage of potential failure modes in your environment.

Points of failure	Scenarios with potential downtime	Test	Potential mitigation strategy
Source and target database	Performance bottleneck on database server such as high CPU, memory constraints.	Generate high load using a benchmarking tool such as sysbench to simulate high load on the database server.	You can provision a read-only database node for the engines that AWS DMS supports, using a read replica as source. For more information, refer to the Sources for data migration. You can also scale the database resources and optimize database parameters.
	Data access issues due to insufficient privileges	Use a database user for AWS DMS with insufficient privileges.	Create a database user following the least-privilege principle. Refer to the respective AWS DMS source endpoint documentation for required privileges by DMS for respective database engines.
	Database failover (if using a primary or standby setup)	Perform a database failover from primary node to secondary node.	In situations where AWS DMS might attempt to connect to the old primary after failover, the behavior depends on the Time to Live (TTL). You would need to restart the task after the TTL is refreshed. Refer Why is my connection redirected to the reader instance when I try to connect to my Amazon Aurora writer endpoint?
	Database shutdown OR failure	Stop the database with ongoing DMS replication.	Record the observations for DMS task behavior for creating the SOP and resume the task after fixing the database issue.
	Unavailability of transaction logs	Purge the logs with less retention period when the task is offline or running behind.	Record the observations of DMS task for creating the SOP and resume the task after making the transactional logs available or perform a fresh full load if the logs are unavailable.
	Performing structural changes such as schema, table, index, partition, and data types	Run different data definition language (DDL) statements for relevant table modifications.	See the list of supported DDL statements and task settings.
Network failures (applicable for source and target)	Connectivity issues including network, DNS, and SSL failures	Remove source IP from the AWS DMS security group OR modify iptables; Remove certificates from the DMS endpoint; Modify the MTU(maximum transmission unit) values.	Refer to troubleshooting AWS DMS endpoint connectivity failures and Issues connecting to Amazon RDS.
Network failures (applicable for source and target)	Packet loss	Use traffic control (tc) command on Linux systems OR use AWS Fault Injection Simulator (FIS).	Refer for Troubleshooting networking issues during database migration with the AWS DMS diagnostic support AMI and Working with the AWS DMS diagnostic support AMI
AWS DMS failures	Rebooting Single-AZ replication instance	Reboot the AWS DMS replication instance.	DMS automatically resumes the tasks after rebooting the replication instance.
	Rebooting Multi-AZ Replication instance with failover during ongoing replication	Reboot the AWS DMS replication instance, selecting the “Reboot with planned failover?” option.	DMS automatically resumes the tasks after the Multi-AZ failover of the replication instance.
	EBS storage full	Enable detailed debug logging for multiple log components leading to storage full due to AWS DMS logs.	Set up alerting when the storage capacity is at 80% and scale the storage volume associated with the DMS replication instance. For more information refer to Why is my AWS DMS replication DB instance in the storage-full status?
	Apply changes in maintenance window	Modify a configuration for your DMS replication instance resulting in downtime and select “Apply during the next scheduled maintenance window” option.	DMS resumes the tasks automatically after maintenance activity.
	Resource contention on replication instance (high CPU, memory contention, higher than baseline IOPS)	Create multiple DMS tasks with a high value for MaxFullLoadSubTasks on a small DMS replication instance.	Setup monitoring and alerting on critical CloudWatch metrics, as discussed in the Monitoring and alerting section. Scale up the instance class or you can move tasks to a new replication instance.
	DMS replication instance version upgrade	Upgrade DMS replication instance version. DMS deprecates older DMS versions, which requires user to upgrade the replication instance version. For more information, refer to AWS DMS release notes.	To minimize the downtime associated with this activity, we recommend conducting a thorough PoC. After PoC tests, you can plan to create new replication instances running on the latest DMS version and move all your tasks during low peak hours when Change data capture (CDC) latency is 0 or minimal. For more information, refer to Moving a task. You can also refer to Perform a side-by-side upgrade in AWS DMS by moving tasks to minimize business impact.
Data issues	Data duplication	Run a full load only task twice with first time stopping the task in between and second time running the task with “DO NOTHING” configuration for Target table prep mode.	Use DMS validation for supported database engines. If validation reports any mismatch, you need to investigate it based on the exact error. To mitigate the issue, you can perform a backfill by creating a full load task or a table reload (if available) for specific tables followed by creating ongoing replication task.
	Data loss	Create triggers on target to delete or truncate random records.	We recommend using DMS validation to avoid these types of issues. You can perform a table or task reload, or you can create a new full load and change data capture task for preforming a fresh data load for affected table(s).
	Table errors	Acquire an access exclusive lock on tables before DMS task is started OR use unsupported data types.	This can be caused due to an unsupported feature or configuration with DMS. This will require investigation based on the exact error. For more information, refer to Why is my AWS DMS task in an error status?
Latency issues	Swap file accumulation on replication instance	Start long-running transactions with high number of changes and monitor. the CloudWatch metric CDCChangesDiskSource.	Monitor the CDCChangesDiskSource and CDCChangesDiskTarget metrics. Refer to this Knowledge Center article for creating SOP: What are swap files and why do the files use space on my AWS DMS instance?
Latency issues	BatchApply failures	Delete a record on the target and update the same record on the source using BatchApply on the task.	Set up alerting on DMS CloudWatch logs for Bulk apply operation failed, refer to the Monitoring and alerting section for detailed instructions. For troubleshooting and creating SOPs, refer to this Knowledge Center article: How can I troubleshoot why Amazon Redshift switched to one-by-one mode?
Data validation issues	Missing source	These can be emulated as a result of missing data on source and target.	Review the supported use case and limitations with AWS DMS data validation and refer to the following Knowledge Center article for more information: Why did my AWS DMS task validation fail, or why isn’t the validation progressing?
	Missing target
	Record differences	You can create different table schemas in source and target to emulate this scenario.
	No eligible primary/unique keys found	Validation requires having a primary key or unique key in a table. LOB and a few other data types are unsupported with DMS validation. For more details, refer to validation limitations.

By systematically testing these scenarios and documenting the results, organizations can develop robust recovery procedures that address both common and unique failure modes. This proactive approach not only helps maintain system reliability but also provides operations teams with clear protocols for addressing issues when they arise.

Developing standard operating procedures (SOPs)

During failure testing scenarios, carefully document the impact of each issue on your replication system. This documentation forms the foundation for creating customized SOPs that your team can rely on when managing AWS DMS implementations. The mitigation strategies outlined in our failure testing framework serve as an excellent starting point for developing these procedures.

Your initial SOPs will emerge during the early phases of PoC testing. These procedures should be considered living documents, requiring regular updates and refinements as you gain more operational experience and encounter new scenarios. The dynamic nature of database migrations means that your SOPs will evolve alongside your understanding of system behavior and emerging challenges.

For additional guidance on handling complex migration scenarios, we recommend reviewing our comprehensive three-part blog series on debugging AWS DMS migrations. These resources provide valuable insights that can help you develop systematic approaches to troubleshooting, even for situations not covered by your existing SOPs. You can find these detailed guides at:

By documenting and testing these procedures, organizations can accurately measure and validate their replication system’s ability to meet SLAs, particularly during critical failure events. This proactive approach helps identify potential bottlenecks and areas for improvement in your disaster recovery strategy, ultimately strengthening your data replication architecture’s resilience and reliability.

When designing your data replication strategy using AWS DMS, it’s crucial to establish comprehensive contingency plans for scenarios involving service unavailability or data discrepancies. A thorough evaluation of your business’s RTO and RPO should drive the development of SOPs. This strategic planning not only promotes business continuity but also provides valuable insights into the actual performance metrics of your replication system during failure scenarios.

Monitoring and alerting

Maximizing AWS DMS effectiveness requires a strategic approach to monitoring and reporting capabilities. A robust monitoring framework is essential for maintaining seamless replication operations and promoting data integrity throughout the migration process.

Configuring appropriate alerts during initial setup provides real-time visibility into replication tasks and enables quick response to anomalies. These monitoring capabilities act as an early warning system, helping maintain the health and efficiency of your database migration infrastructure.

Proactive monitoring and alerting implementation enhances operational reliability while providing insights into resource utilization and performance patterns. This systematic approach enables data-driven decisions and maintains optimal replication performance throughout the migration lifecycle.

AWS DMS provides the following monitoring features:

Amazon CloudWatch metrics – These metrics are automatically populated by AWS DMS for users to get insights into resource utilization and related metrics for individual task and at replication instance level. For a list of all the metrics available with AWS DMS, refer to AWS Database Migration Service metrics.
AWS DMS CloudWatch logs and Time Travel logs – AWS DMS generates error logs and populates them based on the logging level set by the user for each component. For more information refer to Viewing and managing AWS DMS task logs. When CloudWatch logs are enabled, AWS DMS by default enables context logging as well. Additionally, DMS has a feature of Time Travel logs to assist with debugging replication tasks. For more information about Time Travel logs, refer to Time Travel task settings. For best practices for using Time Travel logs, refer to Troubleshooting replication tasks with Time Travel.
Task and table status – AWS DMS provides a near real-time dashboard for reporting the states of the task and tables. For a detailed list of task status, refer to Task status. For table states, refer to Table state during tasks.
AWS CloudTrail logs – AWS DMS is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user, role, or an AWS service in AWS DMS. CloudTrail captures all API calls for AWS DMS as events, including calls from the AWS DMS console and from code calls to the AWS DMS API operations. For more information on setting up CloudTrail, refer to the AWS CloudTrail User Guide.
Monitoring dashboard – The enhanced monitoring dashboard provides comprehensive visibility into critical metrics related to your monitoring tasks and replication instances; filter, aggregate, and visualize metrics for specific resources you want to track. The dashboard directly publishes existing CloudWatch metrics to monitor resource performance without altering data point sampling times. For more information, refer to Overview of the enhanced monitoring dashboard.

We recommend setting up CloudWatch alarms on essential metrics and logs events to proactively identify potential issues before they escalate into system-wide disruptions. Although this foundational monitoring approach serves as a starting point, it’s essential to expand your monitoring strategy based on your specific use case requirements and business objectives.

Metric type	Metric name	Remediation
Host metrics	CPU utilization	It’s recommended to set up alarms on these metrics to alert the operator because resource contention will affect your DMS task’s performance. Based on the resource limitation on the host, you need to upgrade either the DMS instance class if there is CPU and memory contention or increase the storage if there is low storage or baseline IOPS are being throttled. For further information on how to choose the right replication instance, you can refer to Selecting the best size for a replication instance.
	Free memory
	SwapUsage
	FreeStorageSpace
	WriteIOPS
	ReadIOPS
Replication task metrics	CDCLatencySource	Based on your SLA requirements, you can set up an alarm threshold for latency metrics. In DMS, latency can be caused by multiple reasons. For troubleshooting and creating SOPs, you can refer to Troubleshooting latency issues in AWS Database Migration Service.
Replication task metrics	CDCLatencyTarget
DMS events for replication instance	Configuration change	Each of these is a category with different events associated with them. You can set up notifications on specific events based on your requirement. For a detailed list and descriptions of these events, refer to AWS DMS event categories and event messages for SNS notifications.
	Creation
	Deletion
	Maintenance
	LowStorage
	Failover
	Failure
DMS events for replication tasks	Failure
	State change
	Creation
	Deletion

For a complete list of available metrics, you can refer to the AWS DMS User Guide for AWS Database Migration Service metrics.

You can use Amazon EventBridge to provide notification when an AWS DMS event occurs or use Amazon Simple Notification Service (Amazon SNS) to create alerts for critical events. For more information on EventBridge events in DMS, refer to the EventBridge events User Guide. For more information on using Amazon SNS with AWS DMS, refer to Amazon SNS events User Guide.

In addition to setting up CloudWatch alarms, you can create custom alerts based on AWS DMS CloudWatch error logs using metric filters. For a detailed, step-by-step guide on implementing these custom alerts, refer to the blog post titled Send alerts on custom AWS DMS errors from Amazon CLoudWatch Logs. This resource provides comprehensive instructions for enhancing your custom error monitoring capabilities.

Applying AWS Well-Architected Framework principles

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the framework teach you architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems.

Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

For more expert guidance and best practices for your cloud architecture—reference architecture deployments, diagrams, and whitepapers—refer to the AWS Architecture Center

Conclusion

In this post, we presented a comprehensive framework for building resilient AWS DMS implementations. The effectiveness of this guide directly correlates with the depth of their implementation and adaptation to your specific environment. We strongly encourage organizations to thoroughly review each section of this guide and use it as a foundation for developing a customized migration strategy that aligns with your unique use case.

By carefully evaluating and incorporating these recommendations into your migration planning process, you can develop a comprehensive and reliable approach for using AWS DMS, facilitating long-term success in your data movement strategies.

For additional support and resources, visit the AWS DMS documentation and engage with AWS Support.

About the Authors

Sanyam Jain is a Database Engineer with the AWS Database Migration Service team. He collaborates closely with customers, offering technical guidance for migrating on-premises workloads to the AWS Cloud. Additionally, he plays a pivotal role in enhancing the quality and functionality of AWS data migration products.

Sushant Deshmukh is a Senior Partner Solutions Architect working with Global System Integrators. He is passionate about designing highly available, scalable, resilient, and secure architectures on AWS. He helps AWS Customers and Partners successfully migrate and modernize their applications to the AWS Cloud. Outside of work, he enjoys traveling to explore new places and cuisines, playing volleyball, and spending quality time with family and friends.

Alex Anto is a Data Migration Specialist Solutions Architect with the Amazon Database Migration Accelerator team at Amazon Web Services. He works as an Amazon DMA Advisor to help AWS customers migrate their on-premises data to AWS Cloud database solutions.

Select your cookie preferences

AWS Database Blog

AWS DMS implementation guide: Building resilient database migrations through testing, monitoring, and SOPs

Planning and executing a PoC

Implementing systematic failure testing

Developing standard operating procedures (SOPs)

Monitoring and alerting

Applying AWS Well-Architected Framework principles

Conclusion

About the Authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help