Addressing I/O latency when restoring Amazon EBS volumes from EBS Snapshots
From a storage volume perspective, latency is the time elapsed between sending an I/O request to a volume and receiving an acknowledgement from the volume that the I/O read or write is complete. Latency is a key measurement for applications which are sensitive to the round trip time (RTT) of I/O operation. An example of this is transaction-intensive workloads and databases.
In AWS, Amazon Elastic Block Store (EBS) volume is a durable, block-level storage device that you can attach to your instances and is designed to deliver single digit millisecond latency.
You can back up the data on your Amazon EBS volumes to Amazon Simple Storage Service (S3) by taking point-in-time EBS snapshots. Each snapshot contains all of the information that is needed to restore your data (from the moment when the snapshot was taken) to a new EBS volume.
When you create an EBS volume from an EBS snapshot, data from the EBS snapshot is lazy loaded into an EBS volume. If the volume is accessed where the data is not loaded, the application accessing the volume encounters a higher latency than normal while the data gets loaded. This higher latency due to lazy loading could lead to a poor user experience for latency-sensitive workloads.
In this blog post, I explain how a EBS volume is restored from an EBS snapshot. I also explain how to improve I/O latency when fetching existing data from a new EBS volume that has been created from a EBS snapshot.
How an Amazon EBS volume is restored from an Amazon EBS Snapshot
Let’s understand the default method of restoring an Amazon EBS volume from a Amazon EBS snapshot.
When you create an EBS volume based on a snapshot, the new volume begins as an exact replica of the original volume that was used to create the snapshot. The replicated volume loads data in the background so that you can begin using it immediately. If you access data that hasn’t been loaded yet, the volume immediately downloads the requested data from Amazon S3, and then continues loading the rest of the volume’s data in the background.
For most applications, amortizing the initialization cost over the lifetime of the volume is acceptable. However, there are some cases where your workload is latency sensitive. In such cases, you might need to look into different methods of hydrating the volume other than lazy loading in order to minimize latency.
Let’s review the following methods that deliver improved latency for your EBS volume hydration:
- EBS initialization
- Fast snapshot restore (FSR)
EBS volume initialization
One of the methods to avoid the initial performance hit in a production environment while accessing the volumes that were created from snapshots is to initialize all the blocks of the Amazon EBS volume. This is done by pulling down the storage blocks from Amazon S3 and writing to the volume before you can access them. This process is called EBS initialization, (formerly known as pre-warming).
Below, I’ll run a couple of tests to show you how Amazon EBS initialization can be achieved using dd and fio utilities. I’ll also establish which utility should be preferred to trigger force initialization. For the purpose of testing, an EBS volume with the following configuration is mounted on an EC2 instance of size m5.4xlarge:
|Type||Size||Space used||Provisioned IOPS||Provisioned throughput|
|gp3||1.5 TB||948 GB||5000||500 MB/s|
Testing Amazon EBS initialization with dd utility
dd is a command-line utility for Unix and Unix-like operating systems, the primary purpose of which is to convert and copy files. Using the following sample command, I’m enforcing dd utility to read every block of the EBS volume with a block size of 1 MB and copy the data to a null device that discards all data written to it.
dd if=/dev/<device> of=/dev/null bs=1M
After initializing the EBS volume using the dd command mentioned previously, the average read bandwidth is around 9 MB/s, as shown in the following screenshot. Also, the total time it took to completely initialize the EBS volume was around 35 hours.
The average latency observed between EC2 instances and EBS during initialization is around 70 ms, as shown in the following screenshot:
Testing EBS volume initialization with fio utility
Flexible IO tester (fio) is a benchmarking and workload simulation tool. fio is able to simulate various types of I/O loads, such as sequential or random read/write. It lets you define many threads, each performing IO, and benchmark the result. Using the following sample command, I’m executing 32 threads, each reading a block size of 128 KB sequentially from the mounted EBS volume.
fio --filename=/dev/<device> --rw=read --bs=128k --iodepth=32 -- ioengine=libaio --direct=1 --name=volume-initialize
As you can see in the chart below, when I executed the fio command, the average read bandwidth was around 45 MB/s. The total time it took to completely initialize the EBS volume was around 7 hours.
The average latency observed between EC2 instances and EBS volumes during initialization is around 90 ms.
Based on the test results, read bandwidth achieved by fio is five times the bandwidth achieved by dd. As a result, the time it took to pre-warm the EBS volume using fio is five times faster than the time taken by dd. Also, as the pre-warming progresses, the average read latency between EC2 instances and EBS starts decreasing. Once the pre-warming is complete, the average read latency reaches the level of single-digit milliseconds, which is within EBS volume latency limits.
Fast snapshot restore (FSR)
In the previous section, I showed how Amazon EBS volume initialization helps address the I/O latency when restoring EBS from an EBS snapshot. However, with EBS volume initialization, the volume does not deliver the provisioned performance until it is fully initialized. For a large EBS volume, it can take many hours, as evident from the tests performed earlier in this post.
If your use case has a requirement where a EBS volume needs to instantly deliver all of their provisioned performance after creation from a snapshot, you can use Amazon EBS Fast Snapshot Restore (FSR).
Some of the use cases where fast snapshot restore is helpful are virtual desktop infrastructure (VDI), backup and restore, test/dev volume copies, and booting from custom AMIs (Amazon Machine Images). By enabling FSR on your snapshot, you will see improved and predictable performance whenever you need to restore data from that snapshot.
EBS volumes created from FSR enabled snapshots are fully initialized upon creation and instantly deliver all of their provisioned performance and eliminate the latency of I/O operations on a block when it is accessed for the first time. An important point to note is that you need to enable FSR per Availability Zone for your EBS snapshot. Each EBS snapshot and Availability Zone pair refers to one fast snapshot restore.
FSR enables you to restore multiple EBS volumes from a snapshot without the need to initialize volumes yourself. The number of EBS volumes that can be created with the full performance benefit of FSR is determined by EBS volume creation credits for the snapshot. Each EBS volume that you create from a snapshot with FSR enabled consumes one credit from the credit bucket. If you create a EBS volume, but there is less than one credit in the bucket, the EBS volume is created without the benefit of fast snapshot restore. For more details on volume creation credit, please read volume creation credits.
Testing FSR performance
To benchmark the performance of the volume created from the FSR-enabled snapshot, we used fio utility.
The average read latency during the test is 1.5 ms, as shown in the following screenshot:
As shown in the preceding “Time of Test (Hours) image,” while bench marking the volume created from FSR enabled EBS snapshot using fio, I was able to achieve the expected volume performance: 500MB/s throughput and single digit millisecond latency.
In this post, I explained the reason why you might observe I/O latency to fetch existing data from a new EBS volume created from a EBS snapshot. Then I demonstrated how the I/O latency can be improved by pre-warming an EBS volume to avoid initial performance degradation. I also demonstrated how Fast Snapshot Restore (FSR) helps in ensuring that the restored EBS volume is fully-initialized at the time of creation, so it can instantly deliver provisioned performance.
Thank you for reading this blog post! I hope these “best practices” are helpful in assuring your application performance when creating a new EBS volume from a EBS snapshot.
If you have any comments or questions, don’t hesitate to leave them in the comments section.