AWS Storage Blog

Automated cost-effective archiving and on-demand data restoration

Organizations across industries need automated cost-effective archiving and on-demand data restoration solutions to manage explosive data growth driven by digital transformation, regulatory compliance, and operational insights. This data—often stored as unstructured files—is frequently retained for extended periods to meet internal, legal, or analytical requirements. As storage volumes grow into the petabyte range, businesses face a dual challenge: how to cost-effectively archive massive amounts of infrequently accessed data, and how to retrieve specific datasets quickly when business needs arise.

Many organizations store billions of small files in Amazon S3 for compliance and analytics. While Amazon S3 provides excellent scalability and durability for these use cases, organizations with billions of small files face a unique cost optimization opportunity. Each individual file incurs per-object costs for metadata storage, API requests, and storage class transitions. For datasets with extremely high object counts, intelligent batching and compression can dramatically reduce these per-object costs.

This post demonstrates how to implement a serverless archival pipeline that addresses the specific challenges of massive small file datasets. By batching thousands of small files into compressed archives before moving to S3 Glacier Deep Archive, you can reduce per-object costs by up to 98% while maintaining granular, on-demand file restoration capabilities through API-driven workflows.

To address these challenges, we implemented a fully serverless, event-driven solution using AWS that aligns with modern data management needs. The architecture consists of three main components: scalable storage for data archival, automated processing workflows for intelligent data management, and secure APIs for data access and restoration. This serverless design allows organizations to archive data intelligently and restore it on demand—efficiently, securely, and at scale

In this post, we walk you through how to implement a fully automated, cost-effective archival and restoration solution using a serverless architecture on AWS. We demonstrate how to design and deploy a scalable pipeline that:

  • Archives billions of files intelligently, reducing transition costs
  • Supports on-demand, API-driven restoration of specific files or datasets
  • Enhances operational efficiency with minimal manual intervention
  • Scales effortlessly with data growth and business needs

This solution is ideal for organizations looking to reduce storage costs, meet compliance requirements, and make sure that data remains accessible without compromising performance or budget.

Solution overview

This solution provides a fully automated, serverless pipeline for archiving and restoring data. Core services such as Amazon S3, AWS Lambda, Amazon API Gateway, and Amazon Simple Notification Service (Amazon SNS) provide the foundation for scalable storage, automation, and event-based workflows. Supporting components such as AWS Batch and Amazon Elastic Container Service (Amazon ECS) with AWS Fargate, Amazon DynamoDB, Amazon Elastic Container Registry (Amazon ECR), and Amazon CloudWatch enable large-scale processing. The architecture is designed to scale efficiently, handle large volumes of files, and send notifications upon completion, making it well-suited for enterprise-scale data lifecycle management. The solution is deployed using the AWS Serverless Application Model (AWS SAM), enabling repeatable and end-to-end automation of infrastructure and application components.

block diagram of serverless data archiving and restoration architecture                                                                      

Figure 1: Serverless data archiving and restoration architecture                                                    

Cost optimization strategy

This architecture implements several strategies to reduce costs while maintaining performance. We group and compress thousands of files into single .zip archives before moving them to S3 Glacier Deep Archive, thereby significantly reducing the number of S3 objects and associated request, transition, and retrieval costs. The solution uses AWS Batch for intelligent job orchestration, which reduces reliance on costly individual GetObject calls and optimizes container runtime by processing larger payloads efficiently. Furthermore, automated lifecycle management eliminates manual workflows, minimizing operational costs. The solution allows configurable .zip thresholds for total size and number of files, so that organizations can tune their archival process for optimal cost savings.

How the solution works

Both archival and restore processes can be integrated with any external system through an API Gateway endpoint.

Archiving process

  1. Trigger archival
    • Any external system invokes the “/archive” API Gateway endpoint with a file manifest (including count and total size)
  2. Archive Lambda function
    • Trigger from API Gateway
    • Validates the input
    • Submits a job to AWS Batch with file details
  3. AWS Batch (Fargate) processing
    • Create the Amazon ECS task to start archival process from backend
    • Retrieves files from source S3 bucket
    • Groups and compresses them into .zip archives based on defined thresholds
  4. Extract data from source bucket into container local folder for processing
  5. As per input, .zip file is created and stored in Glacier Deep Archive
    • Uploads the .zip file to S3 Glacier Deep Archive for cost-effective storage
  6. Track metadata in DynamoDB
    • Each archived file’s metadata (including its location inside the .zip) is stored in DynamoDB for future lookup

Note the Logfile location: /aws/lambda/archive-lambda

Restoration process

  1. Trigger restoration
    • The “/restore” API is triggered with the list of files to retrieve
  1. Restore Lambda function
    • Looks up file-to-zip mapping from DynamoDB
  2. For the first time request, the Lambda function starts the file restoration process by moving from DEEP_ARCHIVE to STANDARD, which needs 12 hours to restore
  3. User is notified on the same day by Amazon SNS configuration regarding the start of the restoration process
  4. When the user submits same request after 12 hours, the lambda function submits a restore job to AWS Batch (refer to the following process), which extracts the necessary files from the .zip file and moves them into the restore bucket
  5. AWS Batch (Fargate) for restore
    • Retrieves relevant .zip files from S3 Glacier Deep Archive
    • Extracts only the requested files
    • Places them in a dedicated restore S3 bucket
  6. The Amazon ECS task extracts the necessary files from the Destination bucket and stores them in the Restore bucket for further use
  7. Notification through Amazon SNS
    • When the restore process initiated, the user is notified through email using Amazon SNS
    • The user can check the status of the restore process anytime by triggering the restore process again
    • The restoration process takes approximately 10-12 hours to restore from deep archival
    • When the restore has the completed status, users can receive the files from the restore bucket

Prerequisites

Before deploying the solution, make sure that you have the following installed and configured:

Implementation

  1. Clone the repository:
    1. Open your terminal or command prompt.
    2. Navigate to the directory where you want to clone the repository.
    3. Run the following command to clone the repository into the local system.
      git clone https://github.com/aws-samples/sample-Serverless-Cost-Effective-Data-Archiving-Restoration.git
  1. Deploy infrastructure using AWS SAM:
    1. Open your terminal or command prompt and navigate to the code repository.
    2. Use the AWS SAM CLI to deploy the infrastructure with guided prompts.$ cd sample-Serverless-Cost-Effective-Data-Archiving-Restoration$ sam deploy --guided -t infra_deploy.yaml
  1. Input parameters (as prompted during AWS SAM guided deployment):
  • Stack Name: A unique name for your CloudFormation stack (for example serverless-archive-stack)
  • Project Name: Project name
  • AWS Region: The AWS Region where the stack is deployed (for example us-east-1).
  • VPCSelect: The ID of the VPC where AWS Batch jobs run
  • SubnetSelect: Comma-separated list of subnet IDs within the VPC
  • SGSelect: Security group to associate with the AWS Batch compute environment
  • JobvCPU: Number of vCPUs for each Job to allocate per AWS Batch job (for example 1 vCPU)
  • JobMemory: Memory allocation for each Job (for example 2048 MB RAM)
  • MaxCPU: Maximum CPU units available for scaling AWS Batch jobs (for example 10)
  • ENVFargatetype: Set to FARGATE to run container jobs in a serverless compute environment
  • RestoreNotification: Email address to receive job completion notifications through Amazon SNS
  • Confirm changes before deploy [Y/n]: Y(shows you resource changes to be deployed, and requires a ‘Y’ to initiate deploy)
  • Allow SAM CLI IAM role creation [Y/n]: Y(AWS SAM needs permission to create roles to connect to the resources in your template)
  • Save arguments to configuration file [Y/n]: Y
  • SAM configuration file [samconfig.toml]: samconfig.toml
  • SAM configuration environment [default]:
  1. Build and deploy Docker images to Amazon ECR:

Navigate to the batch application directory and run the provided script:

 $ cd batch-apps/scripts

 $ ./deploy-object-images.sh

The script prompts for the following:

  • Project name: <<Enter Project Name>>
  • AWS Account ID: <<Enter Account ID>
  • AWS Region: <<Enter AWS Region>

Testing the solution

These are manual test steps. In a production scenario, these API endpoints can be integrated into other applications, workflows, or microservices to automate data archival and restoration requests.

  • Identify API Gateway endpoint

When the solution is deployed, locate the API Gateway endpoint:

  1. Go to the CloudFormation console.
  2. Choose your deployed stack.
  3. Under the Outputs tab, find the RootUrl value—this is your API Gateway endpoint.

Archive API request:

curl --location --request POST '<<API Gateway Endpoint>>/live/archive' \
--header 'Content-Type: application/json' \
--data-raw '{
"project_name": "<Project Name>",
"src_bucket_name": "<<Source Bucket Name>>",
"dest_bucket_name": "<<Destination Bucket Name>>",
"prefix": "<<S3 Prefix>>",
"file_count": <<Number of file count you like to zip>>,
"archive_size": <<Size Cap in Bytes>>,
"account": "<<ECR Account number>>",
"region": "<<AWS Region>>",
"storage_class": "<<Storage Class>>",
"output_prefix": "<<Output FileName>>"
}’

Command with sample data:

curl --location --request POST 'https://test123.execute-api.us-east-1.amazonaws.com/LATEST/archive' \
--header 'Content-Type: application/json' \
--data-raw '{
"project_name": "amazon"
"src_bucket_name": "amazon-1234567890-us-east-1_src",
"dest_bucket_name": “amazon-1234567890-us-east-1_dest”,
"prefix": "20250410",
"file_count": 50000,
"archive_size": 500,
"account": "1234567890",
"region": "us-east-1",
"storage_class": "GLACIER",
"output_prefix": "amazon-20240410"
}'
Restore API request:

curl --location --request POST '<<API Gateway Endpoint>>/latest/restore' \
--header 'Content-Type: application/json' \
--data-raw '{
"project_name": "<Project Name>",
"filename": "<<file name to Restore>>",
"dest_bucket_name": <<destination bucket>>,
"restore_bucket_name": <<Restore bucket>>,
"account": "<<ECR Account number>>",
"region": "<<AWS Region>>",
"retrieval_tier": "<<Retrieval Tier>>",
"sns_topic_arn": "<<sns topic arn to receive email>>"
}'
Command with sample data:

curl --location --request POST 'https://test123.execute-api.ap-south-1.amazonaws.com/LATEST/restore' \
--header 'Content-Type: application/json' \
--data-raw '{
"project_name": "amazon",
"filename": "5000_20240517_003.csv",
"dest_bucket_name": “amazon-1234567890-us-east-1_dest”,
"restore_bucket_name": “amazon-1234567890-us-east-1_restore”,
"account": "1234567890",
"region": "us-west-1",
"retrieval_tier": "Expedited",
"sns_topic_arn": "ARN"
}'

If multiple files needs to restore, then provide the filename with comma separated (,).

Cleanup

To avoid incurring ongoing charges, delete the AWS resources created during this walkthrough when they are no longer needed.

  •  Delete the SAM application:

Use the AWS SAM CLI to delete the stack:

sam delete --stack-name serverless-archive-stack

  •  Delete S3 buckets manually
  1. Empty the S3 buckets (source, destination, and restore buckets) of all objects
  2. Delete the empty S3 buckets through the AWS Management Console or AWS CLI

Key benefits and outcomes

  •  Cost savings: Up to 98% reduction in S3 object storage and transition costs through intelligent compression and batch processing
  •  Fully automated: No manual intervention, fully event-driven pipeline
  •  Scalable: AWS Batch handles millions of files concurrently using Fargate
  •  Flexible and on-demand: API-driven workflows enable targeted file restoration without full archive scans
  • Secure: Fine-grained IAM policies and VPC configurations for all components
  • Production-ready: Designed for real-world data volumes, and optimized for reliability and observability

Cost impact analysis

In this section we consider an organization with 2 billion historical files (total size of 2 PB), where 50 million files (total size of 50 TB) are archived each month. While organizations at this scale typically recognize the need for optimization, implementing a production-ready solution with automated batching and granular restoration capabilities presents technical challenges. Due to the large number of small files and Amazon S3’s per-object charging model, it is more cost-effective to batch and compress files before archival to optimize transition and storage costs. This is true for both storage metadata and transition requests. Although objects under 128 KB would not naturally transition to S3 Glacier Deep Archive, this solution assumes that files are first compressed and batched into larger .zip archives. This batching step qualifies the data for archival, reduces the total object count by several orders of magnitude, and thus significantly lowers costs.

This analysis focuses on transition costs. More savings from storage compression are not included.

Before implementing the solution

As per the following table , the transition cost that includes the one-time transition cost $100,000 and $2,500 monthly recurring archival cost needs to archive 2 billion files.

table showing transition costs before implementing the batching solution

Figure 2: Transition costs before implementing the batching solution

After implementing the solution

The following table calculates the archival cost for 2 billion historical files and 50 million monthly recurring files. Considering a batch size of 500,000 files per .zip, this creates 4,000 .zip files, resulting in 4,000 Amazon S3 transition requests. Refer to the following calculation.

table showing: transition costs after implementing intelligent batching and compression.

Figure 3: Transition costs after implementing intelligent batching and compression

Solution component costs

The following costs are estimated based on typical usage patterns for processing 50 million files monthly. Actual costs may vary depending on specific workload requirements and usage patterns.

table showing Serverless solution component costs breakdown.

Figure 4: Serverless solution component costs breakdown

After comparing the transition costs before and after implementing the solution, the costs were reduced for both historical files (2 billion) and monthly recurring archival operations. This approach achieves an approximately 98% cost reduction in transition requests, with further savings possible from reduced storage overhead and compression.

The solution components add approximately $2.76 per month in operational costs, which is minimal compared to the $2,497 monthly savings in transition costs alone. The one-time historical migration achieves 99.8% cost reduction ($100,000 to $0.20) in S3 transition costs alone.

Conclusion

In this post, we demonstrated how to implement a fully serverless, event-driven solution for intelligent data archiving and on-demand restoration. We showed how AWS services such as Amazon S3, AWS Lambda, Amazon API Gateway, AWS Batch with Fargate, and DynamoDB work together to automate archival, batch and compress large datasets, and enable secure, API-driven retrieval of individual files. The architecture was deployed using AWS SAM, making it repeatable and production-ready.

By applying strategies like file aggregation and automated lifecycle management, this solution can reduce transition costs by up to 98% while ensuring compliance, scalability, and operational efficiency. It is particularly valuable for organizations that need to retain massive volumes of data for long periods while still being able to retrieve specific datasets quickly when business needs arise.

To try this out, you can deploy the reference implementation with AWS SAM and test the archive and restore APIs in your environment. This hands-on approach wil help you evaluate the cost savings and operational benefits of the solution in your own use case.

Pradip Kumar Pandey

Pradip Kumar Pandey

Pradip Pandey is a Lead Consultant – DevOps at Amazon Web Services, specializing in DevOps, Containers, GitOps, and Infrastructure as Code (IaC). He works closely with customers to modernize and migrate applications to AWS using DevOps best practices. He helps design and implement scalable, automated solutions that accelerate cloud adoption and drive operational excellence

Pratap Nanda

Pratap Nanda

Pratap Kumar Nanda is a Lead Consultant – DevOps at Amazon Web Services, specializing in designing and implementing Container-based Applications, GitOps, and Infrastructure as Code (IaC). He helps customers migrate monolithic applications to microservices on container platforms using DevOps best practices.