Automate data synchronization between AWS Outposts racks and Amazon S3 with AWS DataSync

Many organizations generate large quantities of data locally, including digital imagery, sensor data, and more. Customers require local compute and storage to ingest and enable real-time predications based on their data, and often preprocess this data locally before transferring to the cloud to unlock additional business value such analysis, reporting, and archiving. Automating transfers to the cloud at scale can often lead to undifferentiated synchronization jobs and result in compute overhead and costs.

AWS Outposts enables you to use familiar AWS APIs at the edge. Outposts helps achieve low latency and data residency requirements including the ability to make the real-time decisions. S3 on Outposts offers a familiar API model and cost-effective object storage solution. You are able to sync data from S3 on Outposts to AWS using AWS DataSync for additional processing and archiving.

In this post, we walk through how to automate an AWS Outposts deployment using infrastructure as code (IaC). We demonstrate how to provision an S3 bucket on Outposts rack, deploy and configure a DataSync agent to replicate data to an S3 bucket in Region, and test the solution. This solution allows you to move data to AWS in a simple, automated fashion after you have finished processing it locally, helping you optimize cost and continue appropriate offsite processing for further insights depending on your use case.

Solution overview

The following is the reference architecture used to represent the solution described, including an S3 on Outposts bucket, DataSync agent, and S3 bucket in the same AWS Region. The DataSync agent acts as a proxy, providing access to the S3 on Outposts bucket to the DataSync service is able to sync data in the S3 on Outposts bucket back to the destination S3 bucket in Region.

Figure 1 - Reference architecture for Amazon S3 on Outposts data transfers to Amazon S3 with AWS DataSync

Figure 1: Reference architecture for Amazon S3 on Outposts data transfers to Amazon S3 with AWS DataSync

Prerequisites

For the following overview, an Outposts rack with S3 on Outposts capacity that is connected to a chosen AWS Region is required. It is recommended to review the networking section of AWS Outposts High Availability Design and Architecture Considerations and User Guide for racks for reference. Amazon S3 on Outposts delivers object storage to your on-premises Outposts Racks environment to help with local data processing and data residency needs. Using the S3 APIs and features, S3 on Outposts makes it simple to store, secure, tag, retrieve, report on, and control access to the data on your Outposts.

This walk through will assume a Virtual Private Cloud (VPC) has been extended to the Outpost with the following configuration:

Internet gateway deployed in the Region.
Public subnet deployed on Outposts which will be used to host the DataSync agent to simplify activation.
Destination S3 bucket in a Region as DataSync destination.
A workstation with Terraform.
A private subnet in the corresponding AWS Region with Internet Access through a NAT Gateway for a test instance.

Walkthrough

To automate the data transfer of data between Amazon S3 on Outposts to an AWS Region, complete the following steps:

Configure S3 on Outposts to complete the initial source configuration.
Deploy a DataSync Agent on Outposts to facilitate data replication.
Test the DataSync Task to validate functionality.

Step 1: Configure Amazon S3 on AWS Outposts

S3 on Outposts provides S3 compatible object storage on Outposts to help meet data residency and latency requirements. S3 buckets on Outposts require S3 endpoints to access objects across the VPC. To create a bucket named example-outpost-bucket in Terraform, modify the following sample code with your Outposts ID. Utilize the AWS Outposts Console to obtain the Outposts ID.

# Create Outposts Bucket
resource "aws_s3control_bucket" "outpost-bucket" {
  bucket     = "example-outpost-bucket"
  outpost_id = data.aws_outposts_outpost.example.id
}

To access a bucket on Outposts, create an S3 bucket on an Outposts endpoint and an Amazon S3 on Outposts access point. Requests are routed to the access points through an S3 on Outposts endpoint, and the endpoint is accessible from within a VPC and from on-premises via the local gateway. The following code will create a security group, S3 endpoint, access point, and associate these with the bucket. You will need to modify the code for your specific use case.

# Create https security group
resource "aws_security_group" "sg_https" {
  name = "sg_https"
  description = "allow https connectivity"
  vpc_id = aws_vpc.vpc.id
  
  # https over tcp from vpc cidr
  ingress {
    from_port = 443
    to_port = 443
    protocol = "tcp"
    # var.cidr = your vpc cidr
    cidr_blocks = [ "${var.cidr}"]
  }

  # allow all out
  egress {
    from_port = 0
    to_port = 0
    protocol = "-1"
    cidr_blocks = [ "0.0.0.0/0" ]
  }
}

# Create S3 endpoint on outposts
resource "aws_s3outposts_endpoint" "s3_endpoint_outpost" {
  outpost_id = data.aws_outposts_outpost.example.id
  security_group_id = aws_security_group.sg_https.id
  subnet_id         = aws_subnet.outpostsAzA.id
}

# Create S3 Accesspoint and associate with bucket
resource "aws_s3_access_point" "outpost-bucket-access-point" {
  bucket = aws_s3control_bucket.outpost-bucket.arn
  name   = "example-bucket-access-point"

  vpc_configuration {
    vpc_id = aws_vpc.vpc.id
  }
}

Step 2: Deploy an AWS DataSync Agent on AWS Outposts

DataSync can be used to send data from an on-premises environment to an AWS Region. The deployment pattern for Outposts is similar to on-premises migrations using DataSync. This requires a local deployment of a resource known as the DataSync agent. The DataSync agent requires network connectivity to the source storage system and DataSync service endpoint. Configuring a DataSync agent on Outposts requires HTTPS connectivity to the S3 endpoint on Outposts and DataSync service endpoint. The DataSync agent requires activation to enable communication with the AWS DataSync service. In order to activate programmatically, the Terraform workstation will require http connectivity to the DataSync agent on Outposts and https connectivity to the DataSync service endpoint in the Region. For the purposes of this post, the DataSync agent will be deployed using a public IP to minimize the amount of infrastructure required. Private connectivity from the Terraform workstation to the DataSync agent is recommended in production environments. The following security group allows the Terraform management IP access to the DataSync agent.

#Datasync Agent Security Group
resource "aws_security_group" "sg_datasync_agent" {
  name = "datasync agent security group"
  description = "Datasync Agent Security Group"
  vpc_id = aws_vpc.vpc.id
  
  # INBOUND
  # HTTP for datasync activation from management terraform workstation
  ingress {
    from_port = 80
    to_port = 80
    protocol = "tcp"
    # var.cidr = your terraform management workstation public IP
    cidr_blocks = [ "${var.terraform_mgmt}" ]
  }

  # OUTBOUND  
  egress {
    from_port = 443
    to_port = 443
    protocol = "tcp"
    cidr_blocks = [ "0.0.0.0/0" ]
  }
}

Next, we will create the IAM permissions to enable the DataSync agent access to the source and destination S3 buckets.

#trust policy
data "aws_iam_policy_document" "datasync_assume_role" {
  statement {
    actions = ["sts:AssumeRole",]
    principals {
      identifiers = ["datasync.amazonaws.com"]
      type        = "Service"
    }
  }
}

#datasync source access policy
data "aws_iam_policy_document" "s3_source_access" {
  statement {
    actions = ["s3-outposts:ListBucket", "s3-outposts:ListBucketMultipartUploads",]
    resources = [
      "${aws_s3_access_point.outpost-bucket-access-point.arn}",
      "${aws_s3control_bucket.outpost-bucket.arn}",
    ]
  }
  statement {
    actions = ["s3-outposts:AbortMultipartUpload",  "s3-outposts:DeleteObject", "s3-outposts:GetObject", "s3-outposts:ListMultipartUploadParts", "s3-outposts:GetObjectTagging", "s3-outposts:PutObjectTagging",]
    resources = [
      "${aws_s3_access_point.outpost-bucket-access-point.arn}/*",
      "${aws_s3control_bucket.outpost-bucket.arn}/*",
    ]
  }
  statement {
    actions = ["s3-outposts:GetAccessPoint",]
    resources = [
      "${aws_s3_access_point.outpost-bucket-access-point.arn}",
    ]
  }
}

#datasync destination policy
data "aws_iam_policy_document" "s3_destination_access" {
  statement {
    actions = ["s3:GetBucketLocation", "s3:ListBucket", "s3:ListBucketMultipartUploads", "s3:AbortMultipartUpload", "s3:DeleteObject", "s3:GetObject", "s3:ListMultipartUploadParts", "s3:GetObjectTagging", "s3:PutObjectTagging", "s3:PutObject",] 
    resources = [
      "${aws_s3_bucket.Region-bucket.arn}",
      "${aws_s3_bucket.Region-bucket.arn}/*",
    ]
  }
}

#create source iam role
resource "aws_iam_role" "datasync-s3-source-access-role" {
  name               = "datasync-s3-source-access-role"
  assume_role_policy = "${data.aws_iam_policy_document.datasync_assume_role.json}"
}

#attach source iam policy
resource "aws_iam_role_policy" "datasync-s3-source-access-policy" {
  name   = "datasync-s3-source-access-policy"
  role   = "${aws_iam_role.datasync-s3-source-access-role.name}"
  policy = "${data.aws_iam_policy_document.s3_source_access.json}"
}

#create destination role
resource "aws_iam_role" "datasync-s3-destination-access-role" {
  name               = "datasync-s3-destintation-access-role"
  assume_role_policy = "${data.aws_iam_policy_document.datasync_assume_role.json}"
}

#attach destination policy
resource "aws_iam_role_policy" "datasync-s3-destination-access-policy" {
  name   = "datasync-s3-destination-access-policy"
  role   = "${aws_iam_role.datasync-s3-destination-access-role.name}"
  policy = "${data.aws_iam_policy_document.s3_destination_access.json}"
}

Next, we will get the most recent DataSync agent Amazon Machine Image (AMI) to deploy on Outposts.

data "aws_ami" "datasync_ami" {
  most_recent = true
  owners      = ["amazon"]
  
  filter {
    name = "name"
    values = [
      "aws-datasync-*-x86_64-gp2"
    ]
  }
  
  filter {
    name = "owner-alias"
    values = [
      "amazon",
    ]
  }
}

resource "aws_instance" "datasync_agent" {
  lifecycle { prevent_destroy = "false" }
  ami = data.aws_ami.datasync_ami.id
  instance_type = "c5.large"
  key_name = var.key_name
  vpc_security_group_ids = [ "${aws_security_group.sg_datasync_agent.id}" ]
  subnet_id = aws_subnet.outpostsAzA.id
  associate_public_ip_address = true
  
  tags = {
    Name = "outpost-datasync-agent"
  }

  root_block_device {
    volume_type = "gp2"
    volume_size = "100"
  }

  metadata_options {
    instance_metadata_tags = "enabled"
    http_endpoint          = "enabled"
    http_tokens            = "required"
  }
}

Once deployed, the DataSync agent requires activation. Since the Terraform workstation has connectivity to the DataSync agent, the Terraform workstation can activate the DataSync agent. Following activation, we define the source and destination locations as well as the DataSync task.

#data sync agent registration
resource "aws_datasync_agent" "datasync-agent" {
  ip_address = aws_instance.datasync_agent.public_ip
  name       = "datasync-agent"
}

resource "aws_datasync_location_s3" "source" {
  agent_arns    = [aws_datasync_agent.datasync-agent.arn]
  s3_bucket_arn = aws_s3_access_point.outpost-bucket-access-point.arn
  subdirectory  = "/"

  s3_config {
    bucket_access_role_arn = "${aws_iam_role.datasync-s3-source-access-role.arn}"
  }
  s3_storage_class = "OUTPOSTS"

  tags = {
    Name = "datasync-agent-location-s3-source",
  }
}

resource "aws_datasync_location_s3" "destination" {
  s3_bucket_arn = aws_s3_bucket.Region-bucket.arn
  subdirectory  = "/"

  s3_config {
    bucket_access_role_arn = "${aws_iam_role.datasync-s3-destination-access-role.arn}"
  }

  tags = {
    Name = "datasync-agent-location-s3-source",
  }
}

resource "aws_datasync_task" "datasync-task" {
  name                     = "Demo Datasync Task"
  source_location_arn      = aws_datasync_location_s3.source.arn
  destination_location_arn = aws_datasync_location_s3.destination.arn
  
  options {
    bytes_per_second = -1
    posix_permissions = "NONE"
    uid = "NONE"
    gid = "NONE"
    verify_mode = "POINT_IN_TIME_CONSISTENT"
  }
}

Step 3: Test the DataSync Task

The S3 on Outposts bucket is only accessible through the access point on Outposts. In order to demonstrate functionality, an EC2 test instance will be deployed into the private subnet in the Region. The following also creates the required IAM roles, policies and instance profile for access to the S3 on Outposts bucket.

# iam role/policy/instance profile for test instance
data "aws_iam_policy_document" "ec2_s3_source_access" {
  statement {
    actions = ["s3-outposts:ListBucket", "s3-outposts:ListBucketMultipartUploads",]
    resources = [
      "${aws_s3_access_point.outpost-bucket-access-point.arn}",
      "${aws_s3control_bucket.outpost-bucket.arn}",
    ]
  }
  statement {
    actions = ["s3-outposts:AbortMultipartUpload",  "s3-outposts:DeleteObject", "s3-outposts:GetObject", "s3-outposts:ListMultipartUploadParts", "s3-outposts:GetObjectTagging", "s3-outposts:PutObjectTagging", "s3-outposts:PutObject",]
    resources = [
      "${aws_s3_access_point.outpost-bucket-access-point.arn}/*",
      "${aws_s3control_bucket.outpost-bucket.arn}/*",
    ]
  }
  statement {
    actions = ["s3-outposts:GetAccessPoint",]
    resources = [
      "${aws_s3_access_point.outpost-bucket-access-point.arn}",
    ]
  }
}
data "aws_iam_policy_document" "ec2_assume_role" {
  statement {
    actions = ["sts:AssumeRole",]
    principals {
      identifiers = ["ec2.amazonaws.com"]
      type        = "Service"
    }
  }
}
resource "aws_iam_role" "iam_role_test_ec2_role" {
  name               = "ec2_test_instance_role"
  assume_role_policy = "${data.aws_iam_policy_document.ec2_assume_role.json}"
}
resource "aws_iam_role_policy" "ec2-s3-source-access-policy" {
  name   = "datasync-s3-source-access-policy"
  role   = "${aws_iam_role.iam_role_test_ec2_role.name}"
  policy = "${data.aws_iam_policy_document.ec2_s3_source_access.json}"
}
resource "aws_iam_instance_profile" "iam_role_test_ec2_profile" {
  name = "iam_role_test_ec2_profile"
  role = aws_iam_role.iam_role_test_ec2_role.name
}
resource "aws_iam_role_policy_attachment" "iam_role_attach_1" {
   role = "${aws_iam_role.iam_role_test_ec2_role.name}"
   policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM"
}
resource "aws_iam_role_policy_attachment" "iam_role_attach_2" {
   role = "${aws_iam_role.iam_role_test_ec2_role.name}"
   policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
resource "aws_iam_role_policy_attachment" "iam_role_attach_3" {
   role = "${aws_iam_role.iam_role_test_ec2_role.name}"
   policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonSSMAutomationRole"
}

# Data resource to determine the latest Amazon Linux2 AMI
data "aws_ami" "amazon_linux_ami" {
  most_recent = true
  owners      = ["amazon"]
  
  filter {
    name = "name"
    values = [
      "amzn-ami-hvm-*-x86_64-gp2",
    ]
  }
  
  filter {
    name = "owner-alias"
    values = [
      "amazon",
    ]
  }
}

# Private instance in Region
resource "aws_instance" "test-ec2-instance" {
  lifecycle { prevent_destroy = "false" }
  ami = data.aws_ami.amazon_linux_ami.id
  instance_type = "t2.micro"
  key_name = var.key_name
  vpc_security_group_ids = [ "${aws_security_group.sg_https.id}" ]
  subnet_id = aws_subnet.privateAzA.id
  iam_instance_profile = "${aws_iam_instance_profile.iam_role_test_ec2_profile.name}"
  tags = {
    Name = "private-test-Region"
  }
  root_block_device {
    volume_type = "gp2"
    volume_size = "10"
  metadata_options {
    instance_metadata_tags = "enabled"
    http_endpoint = "enabled"
    http_tokens = "required"
  }
}

Once the instance is deployed, navigate to the S3 on Outposts console and select on the S3 on Outposts bucket. Next, navigate to the Outposts access points tab and select on the access point. From there, copy the access point alias as shown below.

Figure 2 – S3 on Outposts console with access points alias shown

Figure 2: S3 on Outposts console with access points alias shown

Navigate to the EC2 console and select the test instance. Then select Connect to connect through Session Manager as shown below.

Figure 3 - Amazon EC2 console showing test instance

Figure 3: Amazon EC2 console showing test instance

One connected, modify the commands below with the access point alias. The code will download the latest AWS CLI and upload an object to the S3 on Outposts bucket for replication with DataSync.

#connect to private instance via session manager:
cd tmp
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

export AP=<PASTE_ACCESSPOINT_ALIAS_HERE>
/usr/local/bin/aws s3api put-object --bucket ${AP} --key install_upload --body awscliv2.zip
/usr/local/bin/aws s3api list-objects-v2 --bucket ${AP}

Navigate to the DataSync console and select the task created above. Select the Start button and choose Start with defaults as shown in the following screenshot.

Figure 4 – DataSync console demo task

Figure 4: DataSync console demo task

The task status can be monitored through the history tab. The status will progress to Success as stated in AWS DataSync task statuses.

Figure 5 – AWS DataSync console task history

Figure 5: AWS DataSync console task history

Once the task is completed, list the contents of the destination bucket through the S3 console to confirm the test object has been transferred.

Figure 6 – S3 console with test object

Figure 6: S3 console with test object

Cleaning up

Before cleaning up the EC2 instances which were created during this walk-through, it is recommend to delete the objects in both the Region S3 bucket and the S3 on Outposts bucket. Objects stored in the AWS Region can be removed via the S3 console. The object within the S3 on Outposts bucket requires a Session Manager connection to the EC2 test instance used earlier. To remove the test object, run the following commands replacing the place holder Access Point Alias:

export AP=<PASTE_ACCESSPOINT_ALIAS_HERE>
/usr/local/bin/aws s3api delete-object --bucket ${AP} --key install_upload

Once the test objects are removed from the Region and S3 on Outposts buckets, the infrastructure created during this walk through can be removed using terraform destroy.

Conclusion

Customers can automate the transfer of large quantities of object data from S3 on Outposts to AWS Regions for further processing and integration with other AWS Services. This post provides an example of using Terraform to create an S3 on Outposts bucket, an S3 on Outposts endpoint, and an access point associated with it. It then illustrates how to create a DataSync agent and task. Finally, the test instance demonstrates the functionality of the solution.

Using DataSync to automate data movement from S3 on Outposts to the Region is simple and cost effective. Doing so enables you to archive and free up storage capacity in S3 on Outposts for future real-time processing and reduce total storage costs. Once in the Region, you can analyze, query, and visualize the data to unlock virtually unlimited insights. Customers require local compute and storage to ingest and enable real-time predications based on their data before transferring to the cloud to unlock additional business value through analysis with Amazon SageMaker, reporting with Amazon QuickSight, and archiving with Amazon S3 storage classes.

We appreciate you reading this blog post and propose you to deploy this solution if you are trying to automate data transfers from S3 on Outposts to S3 in a cost effective and efficient way. Leave your feedback and suggestions in the comments section.