AWS Application Migration Service best practices

Introduction

Large-scale cloud migrations present challenges such as multiple tasks, scaling complexities, manual processes, numerous tools, and stakeholders involvement that could be difficult to integrate at times. AWS Application Migration Service (AWS MGN) is designed to overcome these challenges for large and complex migrations that require re-hosting (also referred to as “lift and shift” migration). AWS MGN is a highly automated lift-and-shift (re-host) solution that simplifies, expedites, and reduces the cost of migrating applications to AWS. It enables companies to lift-and-shift a large number of physical, virtual, or cloud servers without compatibility issues, performance disruption, or long cutover windows.

You have to consider factors like bandwidth availability and consistency, source environment constraints, routine operations, total number of servers migrated per wave, and MGN quota limits to realise the full potential of lift-and-shift migration using AWS MGN service.

This blog will share best practices for accelerating and successfully implementing your migration using a highly automated lift-and-shift solution AWS MGN. In addition we will be addressing key apprehensions involved while doing the migration.

Key considerations:

1. AWS MGN Initial Settings:

Detailed guide for requisites and initial setup can be found here.

2. Replication planning:

Replication settings determine how data will be replicated from your source servers to AWS, and it is governed by the replication template, which you must configure before adding your source servers to Application Migration Service.

Best practices:

Choose the right replication server instance type: Do not change the default replication server instance type unless there is a business need for doing so. You can change the replication server instance type for servers that are replicating too slowly or servers constantly busy or experiencing frequent spikes. Consider m5a.xLarge or higher to meet this kind of requirement. Use AMD instances to keep costs low.
Use a dedicated instance for the replication server: When the source server is very write-intensive, the replication of data from its disks to a shared replication server can interfere with the data replication of other servers. In these cases, it would be best to choose the “Use dedicated Replication Server” option.

Note: Using a dedicated replication server may increase the EC2 cost you incur during replication.

Fig.1 Replication Server Settings

Fig.2 Replication Server Settings

Pausing a migration:To pause data replication, stop the flow of data between the source server and MGN by closing TCP port 1500 outbound on your firewall or by closing TCP port 1500 inbound in the replication server security group.

Fig.3 Replication Server Settings

3. Launch Template creation:

The Launch template lets you control how MGN launches instances in AWS. It is important to define the launch template as per your expected infrastructural requirement over the cloud. You need to decide as per technical consideration of architecture, Launch template consist of these details:

Instance type right-sizing
Start instance upon launch
Copy private IP
Transfer server tags
Operating system licensing

Best practices:

Right-sizing: The MGN right-sizing feature determines the best match instance type, as compared to your source servers. A best practice is to use this feature and allow MGN to select the instance type for each server instead of manually selecting it server by server.
Create account level launch template as per your requirement. This is for your whole account. If any specific server needs to be changed, use Launch Setting for selective servers.
Setup your 3 MGN templates before you start installing agents and migrating. This includes Replication Template, Launch Template and Post-Launch template. After you have your templates setup, install your agents and begin the migration. Each server inherits its default settings from the templates. The order is important here. Template changes are applied to new agents installed, not to your existing servers inventory: if you first install your agents, and then update the templates, you will need to go back and update each of the servers, one by one. Utilise the templates feature by first configuring the templates, then installing the agents.

4. Bandwidth availability and planning:

If the migration is being done over a shared internet bandwidth using VPN, there will be variation in available network bandwidth during different times of the day to prioritise production workloads. Furthermore, replication happening over the shared bandwidth can choke available bandwidth, this can lead to AWS MGN going into rescan state(which leads to rescanning of source servers and servers going into stalled / unhealthy / lag state) from which recovery would be difficult, thereby risking the entire wave cutover plan.

Best practices:

Provision dedicated constant bandwidth: To determine a baseline replication speed, you should perform a control test between your target AWS region and the nearest region to your source workloads. For example, if your source workloads are in a data centre in Rome and your target region is Paris, run a test between eu-south-1 (Milan) and eu-west-3 (Paris). This will give a theoretical upper bandwidth limit, as replication will occur over the AWS backbone. If the target region is already the closest Region to your source workloads, run the test from within the same region.
Factors to consider while sizing bandwidth: Calculate bandwidth by working backwards from the cutover/go-live date and with consideration of
- Total data that has to be transferred
- Disk write speed at source
- Source disk I/O
- Environment (production / UAT/ test etc.)

The required bandwidth for transferring the replicated data over TCP Port 1500 is based on the write speed of the participating source machines. You need to know the write speed of your source machines. Recommended minimum bandwidth is the sum of the average write speed of all replicated source machines. For example, let’s say you are replicating two source machines. One has an average write speed of 5 MegaBytes per second (i.e. it writes 5 megabytes of data every second), while the other has a write speed of 7 MegaBytes per second (MBps). In this case, the recommended minimum bandwidth is 12 MBps.

Note: Disk write speed is represented in MBps while bandwidth connectivity is represented in MegaBits per second (Mbps). 100 Mbps internet speed = 12.5 MBps (conversion – 1 Mbps = 0.125 MBps)

Tools to find the write speed of your source machines:
Linux:

Use the iostat command-line utility, located in the systat package. The iostat utility monitors system input/output device loading and generates statistical reports. you can install iostat utility with yum (RHEL/CentOS), via apt-get (Ubuntu), and zypper (SUSE.)
To use iostat for checking the write speed of a source machine, enter the following:iostat -x <interval>
- -x - displays extended statistics.
- <interval> - the number of seconds iostat waits between each report. Each subsequent report covers the time since the previous report.
- For example, to check the write speed of a machine every 3 seconds, enter the following command: iostat - x3
- We recommend that you run the iostat utility for at least 24 hours, since the write speed to the disk changes during the day, and it will take 24 hours of runtime to identify the average running speed.

Windows:

Install and use the DiskMon application. DiskMon logs and displays all hard disk activity on a Windows system. For installing DiskMon refer to Installing DiskMon.
DiskMon presents read and write offsets in terms of sectors (512 bytes). You can either time out events for their duration (in microseconds), or stamp with the absolute time of initiation.

5. Source server utilisation impact:

Utilisation over 90% of the source server CPU and RAM can limit replication speed. In addition background processes like antivirus scans and backups tend to slow down the source server which leads to replication tasks being deprioritised, leading to slower replication.

Best practices:

Stop background processes: Keep a tab on the resource utilisation while antivirus scan and backups are running. Please avoid scanning windows and backups during the test/final cutover.

6. Source server disk speed impact:

AWS MGN doesn’t write any cache or do any journaling to disk. The agent holds a buffer that is large enough to map all volume’s blocks ~250MB in memory. The agent then acts like a write filter and will replicate changed blocks directly from memory to the replication server. In cases where the data is no longer in memory, the agent will directly read the block from the volume. This is the case where you may see a backlog in the MGN console. The cause is that the change volume is greater than the bandwidth available.

Best practices:

If facing replication speed issues, the first place to look is the network bandwidth. From a source machine within your internal network, run a speed test to calculate your bandwidth out to the internet; common test providers include Cloudflare, Ookla, and Google. This is your bandwidth to the internet, not to AWS.
Next, to confirm the data flow from within your data centre, run a traceroute
(Windows) or tracert (Linux). Identify any unusual network hops or potentially throttling bandwidth (due to hardware limitations or configuration). To measure the maximum bandwidth between your data centre and the AWS subnet used for data replication, while accounting for Security Sockets Layer (SSL) encapsulation, use the CloudEndure SSL bandwidth tool. MGN utilises LZ4 compression during transit resulting in 60-70% compression depending on the type of data. Use a data transfer / bandwidth calculator, like here.
The output of this is an ideal scenario you must consider network overheads, disk busy at source, CPU utilisation at source and add adequate time for a realistic replication time.

7. Source server I/O impact:

The underlying storage of servers can be a point of contention. If the storage is maxing out its read speeds, it will impact the data replication rate. If storage I/O is utilised beyond its threshold, it can impact block replication by AWS MGN. The threshold will vary depending on the type of the storage at the source (HDD, SSD, etc).

Best practices:

Ensure the source server disk is not busy and is under constant I/O so that replication can happen smoothly.

8. Source server rate of change of data impact:

If the write percentage of the read-write ratio is high, then that tends to create a bottleneck for the overall replication.

Best practices:

Choose throttle bandwidth: You can control the amount of network bandwidth used for data replication per server. By default, MGN will use all available network bandwidth utilising five concurrent connections. Choose Throttle bandwidth if you want to control the transfer rate of data sent from your source servers to the replication servers over TCP port 1500. Otherwise, choose ‘Do not throttle bandwidth’.

Fig.4 Replication Server Settings

Fig.5 Replication Server Settings

9. Migration methodology for databases:

Databases are often write-intensive, and in the absence of the necessary network bandwidth between the source and staging area, MGN might not accommodate a high rate of write-only transactions. In addition sometimes databases have scheduled backup jobs running locally, which means that periodically, they dump a large amount of data locally onto one of the replicated disks and that data represents all of the changes that took place in the database since the previous backup. AWS MGN generates an additional write load on the agent, if configured to replicate the backup disk.

Best practices:

Use methods other than MGN for databases: MGN should not be considered for high read/write intensive workload like Databases, native replication tools can be utilised. Step-by-step walkthroughs on how you can use AWS Database Migration Service (AWS DMS) to migrate your data to and from most widely used commercial and open-source databases.
Do not replicate backup: In case you are using MGN for replicating databases, do not replicate the database instance that stores backups. Backups tend to generate large amounts of ‘writes’, which slows down replication speed.

10. Monitoring:

Having visibility into your migration is important while planning your large migration. Monitoring is an important part of maintaining the reliability, availability, and performance of Application Migration Service as well, to report when something is wrong and take automatic actions with appropriate use of tools like Amazon CloudWatch and AWS CloudTrail.

Best practices:

Configure notifications for AWS MGN events using Amazon EventBridge .
Sample MGN events:

- MGN Source server launch result
- MGN Source server lifecycle state change
- MGN Source server data replication stalled change
  Consider event server data replication stalled change: Configure notifications whenever an AWS MGN server source data replication state changes to STALLED. Detailed steps for Registering Event Rules.
  - Step1: Create Amazon EventBridge rule with following attributes. Steps for Creating a rule that reacts to events.

Fig.6 Sample Rule in EventBridge

- - Step 2: Create Amazon SNS topic and subscribe to email, detailed steps here.
  - Step 3: Add SNS topic as target in Amazon EventBridge, Rule “MGN_dataStalledState”.

Fig.7 Add target as SNS topic

- - Step 4:Use the SNS topic created above in step 2. Now every time a replicating VM goes into stalled state, a notification will be sent. If you want further maintain the complete log for activities, you can capture logs using CloudWatch log group and create Metric Filters. Also, you can create a CloudWatch alarm, that will be triggered.

Fig.8 Add target as CloudWatch log group

11. AWS MGN Security:

Cloud security at AWS is the highest priority. As an AWS customer, you benefit from a data centre and network architecture, these are built to meet the requirements of the most security-sensitive organisations. The customer is responsible for making sure that no misconfigurations are present during and after the migration process, including:

Access to replication servers should be allowed only from source servers CIDR range by applying proper security groups rules on replication servers.
After the migration, expose only allowed ports, to the public internet.
Hardening of OS packages and other software deployed in the servers is completely under the customer’s responsibility and we recommend the following:
- Packages should be up to date and free of known vulnerabilities.
- Only necessary OS/application services should be up and running.
Enabling the Anti-DDOS protection (AWS Shield) in the customer’s AWS Account to eliminate the risk of denial of service attacks on the replication servers and the migrated servers.

Best practices:

Use MGN security group: The best practice is to have MGN automatically attach and monitor the default MGN Security Group. This group opens inbound TCP port 1500 for receiving the transferred replicated data. When the default MGN Security Group is enabled, MGN will constantly monitor the enforcement of rules through security group in order to maintain uninterrupted data replication. MGN will automatically fix the issue, if these rules are altered.
Select the Always use Application Migration Service security group option to enable data to flow from your source servers to the replication servers, and that the replication servers can communicate their state to the MGN servers. Otherwise select the Do not use Application Migration Service security group option. Selecting this option is not recommended.
- Application Migration Service> Source server> edit replication setting.

Fig.9 Security Group selection in Replication Setting

12. Application management and wave planning.

AWS MGN allows the user to represent a group of servers by associating them with an Application. Users can manage their migration by grouping Source Servers and applications in Waves. To determine the wave size and total time required for replication to complete, consider total number of servers and storage in one wave and number of waves to be replicated at once.

Best practices:

Start small and grow in time: Plan your migration and break down your application migration plan into waves. The first wave should be relatively small – few applications and low-priority applications. Use the first wave to learn how to manage the migration, and ramp up your team. In future waves, increase the number of applications you migrate, as you grow in confidence. We recommend leaving critical applications to later waves, after you’ve given your team time to ramp up on the migration process. In later waves, you can also increase the size of the wave, i.e. the number of applications you plan to migrate together.
Work backwards from the planned cutover date, available bandwidth, and service or application owner availability.
Source server environment and criticality of servers (prod / UAT / test) as production servers will have additional considerations such as a more defined cutover and rollback window, application owner availability to validate cutover, etc.
Avoid migrating two or more sets of applications from one application owner running critical/production servers in the same cutover window that can challenge the application owner’s ability to complete their tasks within the assigned time.
Monitor the migration status and progress of an application and its associated servers. Establish multiple check points to monitor replication as part of governance to avoid/reduce last-minute slippages.
Complete test cutover 2 weeks prior to the actual cutover date. This time frame enables you to identify potential problems, resolve any issues, and ensure the cutover will be successful before the actual migration takes place.

13. AWS MGN service quota limits:

Service quota limits can impact migration capacity and velocity. E.g. At a time MGN can only replicate a specific number of servers. Hence it is imperative to understand the limits, their implications and potential actions you can take to overcome these limits. To learn more about MGN service quota limits, click here.

14. Governance:

Even with all technical considerations addressed, lack of well-defined procedures, alignment of all team members on roles and responsibilities, proactive monitoring of end-to-end processes from agent installation to cutover, and required engagement of service owners to approve and validate the cutover can all result in delays and failures.

Best practices:

It is critical to establish necessary governance and have a team or individual accountable to ensure its effectiveness.
Train a field technical team & assign an Application Migration Service SME.
Have a clear project timeline.
Coordinate cutover windows clearly with all the teams involved.

Conclusion

This post, explored several key factors, best practices of AWS MGN for Migration. This blog covers considerations for major apprehensions of AWS MGN and mitigation for blockers with help of best practices. From Agent installation, Source server actions impact on replication speed, Networking, Security, Monitoring and Wave Planning. We encourage you to consider this guidance for accelerating and successfully implementing your migration.

AWS Cloud Operations & Migrations Blog

AWS Application Migration Service best practices

Introduction

Key considerations:

1. AWS MGN Initial Settings:

2. Replication planning:

Best practices:

3. Launch Template creation:

4. Bandwidth availability and planning:

5. Source server utilisation impact:

6. Source server disk speed impact:

7. Source server I/O impact:

8. Source server rate of change of data impact:

9. Migration methodology for databases:

10. Monitoring:

11. AWS MGN Security:

12. Application management and wave planning.

13. AWS MGN service quota limits:

14. Governance:

Conclusion

Learn more

About the author:

Resources

Follow