Archiving Data to Amazon S3 Glacier using PowerShell

The final update of the AWS Tools for PowerShell in 2018 added support for Amazon S3 Glacier. Amazon S3 Glacier is a secure, durable, and extremely low-cost cloud storage service for data archiving and long-term backup. The update to the AWS PowerShell modules added cmdlets to support both control and data plane APIs for the service. These can be used from Windows, Linux, and macOS hosts.

In this blog post, I explore how to take advantage of the extremely low-cost and durability of S3 Glacier to store large amounts of data that I don’t need immediate access to. I’m using the new support in the AWS PowerShell modules to place data from images I capture in my hobby of astrophotography into cold storage in S3 Glacier.

An astrophotography primer

Outside of my work at AWS I have an interest in astrophotography. This involves using telescopes attached to cameras to photograph objects in deep space such as galaxies, nebulae, and star clusters. These objects are relatively faint in the night sky. To generate a sufficient signal so that we can image them requires lots of data. Once this data has gone through initial processing it may not be needed again but still has value. Storing this data is therefore an ideal scenario for S3 Glacier.

For an example of the kinds of images you can obtain this is one of my recent images of the Triangulum Galaxy, also known as Messier 33 (or M33).

This galaxy is approximately three million light years away from Earth, and it’s the third largest of what’s known as the “local group” of galaxies. In the local group, the Andromeda galaxy (aka Messier 31, or M31) is largest, followed by our own galaxy, the Milky Way. All are on courses to collide with us a few billion years from now — so we have time to prepare!

To capture images of these deep space objects can require a lot of exposures, each usually several minutes long, especially if the object is faint. This can equate to a lot of data.

In my setup I use a telescope mounted to a filter wheel containing clear, red, green, and blue filters. Sitting behind the filter wheel is a monochrome camera. For any given object (or “target”) I capture multiple exposures through each filter. In this specific example for M33, I captured 25 three-minute exposures (or “frames”) through the clear filter — this is known as “luminance” data — followed by 10 three-minute frames through each of the red, green, and blue filters. It’s possible, and usually desirable, to capture many, many frames to increase the signal to noise ratio on faint objects, sometimes over many nights, and so improve the final composited image. I might also use additional filters beyond simple luminance, red, green, and blue to capture different wavelengths of light.

In addition to these “light” frames (the frames in which you can literally “see” something), I also need to capture calibration frames. These calibration frames enable me to remove noise from the images and to handle any dust motes or light falloff through what’s known as the “optical train” (telescope to camera and all equipment in between).

In total, the captured images consume almost 34 GB of disk space. And I could easily have captured more data, and more targets, on a recent stretch of clear weather here in the Seattle. Once this data is “integrated” into the single luminance, red, green and blue images that are then composited and further processed to form the single color image shown here, the data isn’t needed immediately. Placing this capture data into cold storage so that I can access it for reprocessing with additional capture data in future is, therefore, an ideal usage scenario for S3 Glacier. In my setup, I work on both Windows and macOS operating systems (usually depending on whim), so I need an archiving solution targeting S3 Glacier that works in both systems. PowerShell 6 with the AWS Tools for PowerShell Core module is ideal for me.

Getting started with the Amazon S3 Glacier cmdlets

You need version 3.3.428.0 or later of the AWS Tools for PowerShell to take advantage of the new cmdlets for Amazon S3 Glacier. This update was released in December 2018. Later versions may be available by now.

You can install this or later versions from the PowerShell Gallery. If you’re using PowerShell v5.1 or earlier on Windows, you should install the AWS Tools for Windows PowerShell module. Because I’m using PowerShell v6 on both Windows and macOS, I install the AWS Tools for PowerShell Core module.

After the tools are installed, you need to configure them with AWS access and secret key credentials (if you don’t already have credentials set up); for example, from earlier use of the AWS Toolkit for Visual Studio or the AWS CLI. See this article for instructions on how to set up credentials using the tools.

The AWS PowerShell tools use a naming scheme in which a three-letter or four-letter prefix is applied to the noun portion of the cmdlet names for a given service. For those unfamiliar with PowerShell, a cmdlet is the term given to commands that are used in the PowerShell environment. The prefix for the Amazon S3 Glacier cmdlets is GLC. You can list the complete set of cmdlets available for the service (or any service, if you have the service name or prefix) using the Get-AWSCmdletName cmdlet. I’ve shown this below. (I’ve removed the ServiceName column from the output for clarity.)

PS C:\> Get-AWSCmdletName -Service GLC
CmdletName                          ServiceOperation                
----------                          ----------------                
Add-GLCTagsToVault                  AddTagsToVault                  
Complete-GLCVaultLock               CompleteVaultLock               
Get-GLCDataRetrievalPolicy          GetDataRetrievalPolicy          
Get-GLCJob                          DescribeJob                     
Get-GLCJobList                      ListJobs                        
Get-GLCMultipartUploadList          ListMultipartUploads            
Get-GLCProvisionedCapacityList      ListProvisionedCapacity         
Get-GLCVaultAccessPolicy            GetVaultAccessPolicy            
Get-GLCVault                        DescribeVault                   
Get-GLCVaultList                    ListVaults                      
Get-GLCVaultLock                    GetVaultLock                    
Get-GLCVaultNotification            GetVaultNotifications           
Get-GLCVaultTagsList                ListTagsForVault                
New-GLCProvisionedCapacityPurchase  PurchaseProvisionedCapacity     
New-GLCVault                        CreateVault                     
Read-GLCJobOutput                   GetJobOutput                    
Remove-GLCArchive                   DeleteArchive
Remove-GLCMultipartUpload           AbortMultipartUpload                                                                
Remove-GLCTagsFromVault             RemoveTagsFromVault                                                                 
Remove-GLCVaultAccessPolicy         DeleteVaultAccessPolicy                                                             
Remove-GLCVault                     DeleteVault                                                                         
Remove-GLCVaultNotification         DeleteVaultNotifications                                                            
Set-GLCDataRetrievalPolicy          SetDataRetrievalPolicy                                                              
Set-GLCVaultAccessPolicy            SetVaultAccessPolicy                                                                
Set-GLCVaultNotification            SetVaultNotifications                                                               
Start-GLCJob                        InitiateJob                                                                         
Start-GLCVaultLock                  InitiateVaultLock                                                                   
Stop-GLCVaultLock                   AbortVaultLock                                                                      
Write-GLCArchive                    UploadArchive;InitiateMultipartUpload;UploadMultipartPart;CompleteMultipartUpload

The output shows you the cmdlet name and the underlying S3 Glacier service API that the cmdlet maps to. In the case of the Write-GLCArchive cmdlet, which is used to upload data to an S3 Glacier vault, multiple APIs are wrapped by the cmdlet. This is because the cmdlet inspects the size of the file to upload and switches between a single UploadArchive API call or the APIs that comprise a multipart upload automatically. As the user, you don’t need to think about which set of APIs to use.

Write-GLCArchive is the cmdlet we need to upload data to a vault for archiving. It can upload a single file or a folder hierarchy (as individual archive files). The output from the cmdlet gives you the archive ID for each uploaded file. We need that ID to retrieve a given archive at a later date.

To retrieve data, we must first request that S3 Glacier start a data retrieval job. When the job completes (usually within a few hours), we can retrieve the associated data content. To request the start of a retrieval job, we use the Start-GLCJob cmdlet specifying the ID of the archive we want to subsequently retrieve, and a value of archive-retrieval for the -JobType parameter. The cmdlet outputs the ID of the started job.

To check the status of the retrieval job we use the Get-GLCJob cmdlet. When the job status in the output indicates the job is complete and the data is ready for retrieval, you use the Read-GLCJobOutput cmdlet to download the archived file to your local file system. Note that I could also automate being notified that the job is complete and subsequently downloading the content, by specifying an Amazon SNS topic to the Start-GLCJob cmdlet. But that scenario is out of scope for this blog post.

Archiving the data

Now that we know the basics of using S3 Glacier from PowerShell, let’s look at the specific usage scenario using my data capture from one or more evening’s astrophotography on a particular target.

The image captures are organized into a set of folders based on their type, as shown below. To astrophotographers, these are collectively known as “subs”, and so the overall folder name, Subs.

My workflow calls for all my “light” (luminance/red/green/blue) images to be placed into one folder named Light, with the calibration frames in other folders (Bias, Dark, Flat), based on their particular use. After I process this data and integrate it into the single images of each type that I’ll use in further processing, I can archive these folders and the files they contain in S3 Glacier and remove them from my file system to save space. Integration means to process the individual frames into a single master image, so I have one master bias, one master dark, etc.

For my archival workflow I chose to compress the individual calibration folders and archive the resulting .zip files, but to preserve the “lights” as individual files. This is because the processed calibration frames can mostly be reused. If I capture additional light frames in future, I can then retrieve only the archived light frames I need (potentially for a particular filter only) for reprocessing, and ignore the calibration data for which I already have integrated master frames.

To hold my data I need a vault, which I can create easily from PowerShell. My shell is configured to use my credentials and the US West (Oregon) Region by default, so I don’t need to supply any credentials or AWS Region parameters in the examples below.

PS C:\> New-GLCVault -VaultName astrodata
/123456789012/vaults/astrodata

The output is the vault path in my account, which I can ignore. I could also have chosen to use multiple vaults, one per target, but eventually settled on putting all data into one vault for my workflow.

After I have a vault, I’m ready to upload my content. As detailed above, I’ve compressed the various calibration folders and removed the individual calibration image files, but have left the files in the Light folder uncompressed. In total, I have three calibration .zip files and 54 “light” image files to upload for this particular object.

To upload the archives I use the Write-GLCArchive cmdlet mentioned earlier. I use this in a pipeline to capture the output into an inventory file list. This file will allow me to equate a local filename with an archive in the vault, as well as its checksum, which I’ll want to reference when I decide I need to retrieve one or more archives in future. We’ll look at my use of this file in the retrieval process shortly. Note that my working folder for this command is as shown in the image above.

PS C:\> Write-GLCArchive -VaultName astrodata -FolderPath .\ -Recurse | 
             Export-Csv -Path "..\m33.archive_inventory.csv"

While the upload runs, progress output is displayed.

If I had not chosen to capture the cmdlet output into a file, you’d see the file name (including path), the corresponding archive ID, and file checksum being output from the cmdlet to the pipeline as each file is processed.

Because I captured this into a file, we need to open the file to see the data. I chose to use Export-Csv instead of Out-File because the columnar output can be truncated with Out-File. This is especially noticeable when long file paths (as in my case) are involved. Notice that the archive IDs are also quite long, making it very likely I’ll see data truncation in the output from Out-File.

Because I might want to have access to this inventory data reasonably quickly, I don’t upload it to S3 Glacier. Instead I upload it to an Amazon S3 bucket. I also keep a copy on my local file system, but the availability of the cloud backup helps me sleep at night.

PS C:\> Write-S3Object -BucketName myastrodata -File "..\m33.archive_inventory.csv"

Retrieving data for additional work

When I want to retrieve some data for a particular target, likely because I’ve captured additional images that I want to mix in with existing content, I simply download the inventory .csv file I mentioned earlier (using Read-S3Object) and use it to determine the list of archive IDs of the files I need to retrieve. Then for each archive I use the Start-GLCJob cmdlet, passing it the individual archive ID, to request that S3 Glacier prepare the content for download. I need to start one job for each archive I want to retrieve, for example:

PS C:\> Start-GLCJob -VaultName astrodata -JobType 'archive-retrieval' `
            -ArchiveId "v3JwAZ9fGPq...1aQ8OGzgc6Llg" `
            -JobDescription "M33_Light_Blue_180sec__1x1_frame1.fit"
JobId               JobOutputPath Location
-----               ------------- --------
Zt6Ree...yAiV5wwPj                /123456789012/vaults/astrodata/jobs/Zt6Re...5wwPj

You can see that the cmdlet returns an ID for the job. Note how I also set the original file name associated with the archive into the job’s description. I can get the original filename from which the archive was created from my inventory file. Putting the original filename into the jib description will help me when I actually download the content later on.

Given that I didn’t specify an SNS topic to receive notifications I can keep track of the job’s progress using the Get-GLCJob cmdlet. I’m using standard retrievals so am expecting it will be 3-5 hours before Get-GLCJob notes that the data is available to download.

PS C:\> Get-GLCJob -VaultName astrodata -JobId "Zt6Ree...yAiV5wwPj"
Action : ArchiveRetrieval
ArchiveId : v3JwAZ9fGPq...1aQ8OGzgc6Llg
ArchiveSHA256TreeHash : 36e6802a75fd6339e72a3ac8a05fcb5160194f55221ec5a761648e97d3835b22
ArchiveSizeInBytes : 5962393136
Completed : False
CompletionDate : 1/1/0001 12:00:00 AM
CreationDate : 1/3/2019 12:45:10 PM
InventoryRetrievalParameters :
InventorySizeInBytes : 0
JobDescription : M33_Light_Blue_180sec__1x1_frame1.fit
JobId : Zt6Ree...yAiV5wwPj
JobOutputPath :
OutputLocation :
RetrievalByteRange : 0-5962393135
SelectParameters :
SHA256TreeHash : 36e6802a75fd6339e72a3ac8a05fcb5160194f55221ec5a761648e97d3835b22
SNSTopic :
StatusCode : InProgress
StatusMessage :
Tier : Standard
VaultARN : arn:aws:glacier:us-west-2:123456789012:vaults/astrodata

Once the job (or jobs, as needed) complete, I’ll use the Read-GLCJobOutput cmdlet to actually download the content. I need to run the cmdlet once per completed job. Because I set the original file name into the job description, I know the target file name to supply to the -FilePath parameter on download for a given archive.

PS C:\> Read-GLCJobOutput -VaultName astrodata -JobId "Zt6Ree...yAiV5wwPj" `
            -FilePath "C:\astro-retrieval\M33_Light_Blue_180sec__1x1_frame1.fit"

In the example above, I’m retrieving and downloading a single file. As with uploading, a progress bar is displayed in the shell.

In practice, I’m more likely to be downloading a batch of files, so I would use a pipeline iterating over completed jobs.

PS C:\> Get-GLCJobList -VaultName astrodata -Completed $true | 
             % { Read-GLCJobOutput -VaultName astrodata -JobId $_.JobId `
                     -FilePath "C:\astro-retrieval\$($_.JobDescription)" }

I’m using a filter on the service API call (-Completed $true) to yield only the completed jobs as output. I could achieve the same with client-side filtering by piping the data for all jobs returned from an unfiltered Get-GLCJobList call to Where-Object { $_.Completed } and then into the download command. Choices are always good!

One thing to be aware of is that completed jobs don’t expire for 24 hours. In other words, I can download the job output as many times as I like in the 24-hour period after the job completes. So in real-world usage of my pipeline above, I’d also check that I had not already downloaded the file associated with a completed job.

What if the upload to S3 Glacier fails?

Home networks can sometimes fail when processing large data uploads, so I need to be prepared for an upload to the vault exiting prematurely. If this happens, I can potentially be left with partially completed multipart uploads that are chargeable. To clean up these partial uploads so that I’m not charged I can use two other cmdlets in conjunction.

The first is Get-GLCMultipartUploadList. This returns a collection of incomplete part uploads that I will want to remove. The second is Remove-GLCMultipartUpload to actually cancel the part. I can pipe these together like so:

PS C:\> Get-GLCMultipartUploadList -VaultName astrodata | 
             Remove-GLCMultipartUpload -VaultName astrodata -Force

I need to specify the vault name for the removal cmdlet because the data elements in the list output from Get-GLCMultipartUploadList contain the vault’s Amazon Resource Name (ARN) identifier, not the vault name. The result of running this pipeline is that any residual incomplete parts due to earlier upload errors are canceled, and I don’t get charged for them.

Where to go from here?

The workflow I’ve outlined here is my first pass at an archival strategy for my capture files, and its automation is still evolving. Some of the things I’m thinking about:

Constructing an optimal upload strategy for when I add fresh data to an existing target, by using the inventory archive file and the file names and checksums it contains.
Automating the retrieval of content by a script that processes a given inventory archive file and schedules the necessary jobs.
Using notifications to automatically download content when an archive retrieval job signals the content is ready.

The (deep) sky really is the limit!