AWS Storage Blog

Direct Supply bolsters availability migrating to Amazon FSx for Windows File Server

Direct Supply, the leading provider of products and services to the Long-Term Care industry, migrated the bulk of our IT systems to AWS in early 2019. In the run-up to our cut over, we had five Server Message Block (SMB) file systems that needed to live alongside the applications they support. This meant that roughly 25 terabytes of file data needed a home in the cloud.

At the time, the options for a cloud-based, Windows-friendly, highly available file system that met our security and compliance requirements were limited. Although FSx was available at this time, we needed multiple Availability Zone (multi-AZ) capabilities as well as access audit logging before we could use it to house production data. At the time, our only viable option was to select a Network Attached Storage (NAS) product from the AWS marketplace. We made the best of the tools available. In the two years that followed, Amazon FSx for Windows File Server matured and AWS DataSync launched support for SMB data. Our file footprint grew to 40 TB.

In mid-2022, we made the decision to go all-in on FSx for Windows File Server and leave our old NAS vendor behind. In this post, I walk you through the reasons we found Amazon FSx for Windows File Server to be the best file storage platform for our file data, how we used DataSync to migrate our data, and some lessons we learned along the way.

Making the case for FSx for Windows File Server

As the dust settled from our initial cloud migration (5 file systems to migrate: 3 prod, 1 QA, 1 dev environment – about 40 TB in total), our focus shifted to optimizing our cloud environment. We were investing in automation, standardization, and projects that would reduce administrative overhead and cognitive load. We had also built out a financial ops practice helping us govern and manage our cloud spend. In analyzing our cloud file system usage, it became clear that we had work to do to improve efficiency, improve resilience, reduce operational toil, and reduce cost. The business case for switching to FSx for Windows File Server practically wrote itself.

Unlike our previous self-managed storage solution, FSx for Windows File Server is a fully managed service. This meant that we would no longer have to worry about building and managing our own RAID-protected storage pools or scheduling down-time to deploy firmware updates. With the use of Infrastructure-as-Code (IaC), we were able to build out standard templates for deploying new FSx for Windows file systems, making the patterns available for new use cases without requiring deep storage knowledge.

FSx for Windows File Server has a multi-AZ capability that automates the fail-over and fail-back of our service when hiccups occur. With our marketplace solution, we had a manual fail-over process that needed human intervention. Outages were measured in minutes, rather than seconds, and we knew this had to be improved.

Shifting to a fully managed file server meant we no longer had to assume the financial burden of managing our own RAID protection and redundancy. This meant a 40% reduction in provisioned storage. After some in-depth performance testing, we were confident that the Amazon FSx for Windows hard drive disk (HDD) storage type would exceed our file system performance needs, while lowering costs. The move to FSx for Windows File Server could deliver a cost savings greater than 90% as indicated in Figure: 1.

monthly storage cost

Figure:1 – Monthly storage cost

As part of our simplification efforts, we wanted to standardize our data protection process using AWS Backup. The marketplace solution had a proprietary backup and restore mechanism that needed separate tooling and training. The shift to FSx for Windows pulled all of our backup activity into a single tool.

Using DataSync to migrate file data to FSx for Windows File Server

In the past, I’ve relied on command line copy tools to migrate data from one file server to another. These copy jobs can be automated with some scripting and task scheduling, but error handling and logging can be difficult. Because of my preference for using managed services, event logging to Amazon CloudWatch, and the ability to manage the entire configuration through IaC, DataSync was the right choice for moving data into Amazon FSx for Windows.

To set up the DataSync copy process, I needed to:

  1. Create DNS records for the source and destination file systems.
  2. Deploy a DataSync agent – the compute resources that the copy job uses.
  3. Define a source location and a destination location based on the DNS entries I created.
  4. Create a DataSync task based on source and destination locations.
  5. Create a fail-back task in case we had to roll back a migration to Amazon FSx for Windows.

I built a series of reusable Terraform modules to make agent and task creation a direct process.

The following script calls a Terraform module we built that stands up an AWS Datasync agent running on EC2:

module "datasync_agent_1" {
  source           = "../../../modules/agent"
  application      = “aws_datasync_agent_1" #for naming
  environment      = local.environment      #for naming 
  tags             = module.base_tags.tags               
}


The next operation calls a Terraform module that executes 3 functions:
1. Defines the source and destination locations for the data.
2. Creates an AWS Datasync task, running on the agent created above, that copies from source to destination.
3. Creates an AWS Datasync task, running on the agent created above, that copies from destination to source in case a rollback is needed.



module "datasync_task" {
  source             = "../../../modules/task"
  filesystem         = local.filesystem_name
  datasync_agent_arn = [module.agent_user_shared.agent_arn]
  source_path        = "/c$/user/data/"
  dest_path          = "/d$/user/data/"
  filesystem_arn     = module.filesystem.filesystem_arn
  schedule           = "cron(5 */4 * * ? *)". #from crontab.guru
  task_name          = "user_data"
  environment        = local.environment
  bytes_per_second   = "500000000"
  verify_mode        = "NONE"
  tags 			 = merge(module.base_tags.tags, {
    Name 			 = "${local.application}-user-data-to-fsx"
  }
 )

Of the five file systems I migrated to FSx for Windows File Server, four of them were under five terabytes each, with the average file size being greater than one MB. As such, I was able to accomplish the migrations with just a few DataSync agents and tasks. However, the fifth file system was close to 20 TB in size and contained more than a billion kilobyte sized files. This needed the creation of dozens of agents, bundled together in groups of three or four, to run 150 DataSync tasks.

It’s difficult to give specifics for how long the initial data sync took to complete. Throughout the process, we were tweaking the file system to increase throughput. We optimized DataSync task settings with the goal of running more DataSync tasks in parallel. I can say this initial sync was measured in days and not hours or minutes.

I had a very limited window in which to have the largest file system offline. Therefore, the final cut-over was choreographed to synchronize the most critical data during a two hour outage. The less critical data synchronized over the next 14 hours after we were “live” on FSx for Windows.

The final step of our migration, which we expected to happen during a separate change window, was to convert the file system from solid-state drive (SSD) to HDD type storage. This step is what would boost our savings from 30% to greater than 90%. At the time this post was written, we have not yet made this change.

Lessons learned from the migration to FSx for Windows File Server

Data copy jobs, especially when they aren’t throttled, can negatively impact active file system/source file system and network performance. In my case, I noticed that my end users would experience latency with the source file system if I was running more than a few DataSync tasks at the same time. I was able to throttle my DataSync tasks to reduce this latency experienced by the end users.

On the topic of performance, I’d advise you to set your Amazon FSx for Windows throughput as high as you can for the final steps of your migration and if you need to contend with a short cut-over window. I know that my file system needs less than 128 MB/s of throughput to support normal file operations, yet I bumped it up to 2048 MB/s so that DataSync tasks could write as fast as possible. FSx for Windows File Server can now support significantly higher bandwidth, which could help accelerate migration times.

Deduplication and compression need to factor into your copy plans. If you are migrating compressed or deduped data, understand that the data will be fully rehydrated before being copied to FSx for Windows. Note that deduplication and compression happen as a scheduled process on FSx for Windows, meaning that data is stored fully hydrated at first and then compressed after the fact. If you sized your FSx for Windows file system based on the size of compressed data on the source file system, expect that you have to pause your data copies migration to trigger deduplication tasks on FSx for Windows to avoid having to over-provision capacity to accommodate the fully hydrated data. This process is documented in the FSx for Windows File Server documentation.

File system complexity affects copy performance and should factor in to how DataSync tasks and agents are organized. When I first started my largest migration, I was expecting to have multiple DataSync tasks run on a single agent. In some cases, this worked out well for me. In other cases, especially where there were millions of tiny files, I ended up deploying multiple DataSync agents for a single DataSync task. In the end, the ratio of tasks to agents, or agents to tasks, was tied to both file count and copy time (for more info see the blog How to accelerate your data transfers with AWS DataSync scale out architectures). Because I wanted my file system offline for my users during the final cut-over, having tasks that completed quickly was critical.

One final lesson is tied to how Windows access control lists (ACLs) are migrated to FSx for Windows through DataSync. In my case, my largest file system had been moved seven times prior to moving to FSx for Windows, so there were several instances where the built-in administrators group was used to grant system administrators access to the data. DataSync copies this along with the data, but it also gets mapped to the local host built-in administrators group for the systems hosting your file system. It won’t automatically update to the administrators group defined when you launched FSx for Windows File Server. To prevent challenges managing data post-migration, add the created FSx administrators group to your file system ACLs before you start the DataSync tasks.

Conclusion

I found that FSx for Windows File Server reduced complexity and cost (from $37K to $24K) when replacing my legacy marketplace-based file server solution as indicated in Figure: 2 Actual savings.

Actual savingsFigure:2 Actual savings

As a managed service, it’s eliminated the toil to patch and update the file system, and it provides quick access to my users and applications. DataSync made the migration process much simpler than hand-crafting scripts for command line copy tools. Deploying the entire environment through IaC has allowed me to adapt the DataSync task and agent design to optimize my data copies even though we have more work to do to get our data on to HDD storage, we’ve managed to achieve a significant reduction in the cost to store and manage our file data.

I’d love to hear from you – have you used DataSync with FSx for Windows File Server? How was your experience? What did you find most challenging? What lessons did you learn from the process? Feel free to reach out to me on my social platforms linked after this post to continue the conversation!

Note – The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

Dave Stauffacher

Dave Stauffacher

Dave Stauffacher is a Chief Platform Engineer at Direct Supply. With a background in data storage and protection, Dave has helped Direct Supply navigate a 30,000% data growth over the last 15 years. In his current role, Dave is focused on helping drive Direct Supply’s cloud migration, combining his storage background with cloud automation and standardization practices. Dave has showcased his cloud experience in presentations at the AWS Midwest Community Day, AWS Re:Invent, HashiConf, the Milwaukee Big Data User Group, and other industry events. Dave is a certified AWS Solutions Architect Associate.