AWS Partner Network (APN) Blog

Automating Lifecycle Rules for Multipart Uploads in Amazon S3

by Erin McGill | on | in Amazon S3, AWS Lambda, AWS Partner Solutions Architect (SA) Guest Post | | Comments

Erin McGill co-authored this blog with Mike Ruiz. Both Erin and Mike are AWS Partner Solutions Architects. 

Amazon Simple Storage Service (Amazon S3) provides a secure, durable, highly scalable, and simple way to store your object data. It’s easy to create new buckets to contain data, and to store and retrieve any amount of data from those buckets at any time, from anywhere on the web, so the number of buckets and objects stored in an S3 account can quickly become substantial. To help you manage objects during their lifetime, AWS has developed lifecycle configurations for S3 buckets. If you are not familiar with S3 lifecycle rule creation, we recommend that you read Jeff Barr’s blog post on creating the lifecycle rules in the console. Jeff’s post walks through applying the multipart upload and deleting markers for expired objects. In this post, we’ll go beyond manual operations and explore automation for enforcement of S3 default policies.

Automating S3 bucket policy management

Why automate policy management? An APN Partner recently asked us, “How can we apply S3 lifecycle rules across all our S3 buckets in connection with IT-directed requirements?” In the example from this APN Partner, their IT department wanted to have the AbortIncompleteMultipartUpload rule applied to all their S3 buckets, regardless of when the bucket was created. This APN Partner works with large objects (objects larger than 5 GB), and uploads hundreds of these objects every day across hundreds of S3 buckets by using the multipart upload API. They explained that manually operating their scaled environment was very time consuming and prone to error and waste, so to we set to work identifying all the steps required to automate the process.


The maximum size for a single upload of an object in Amazon S3 is 5 GB. To store larger objects (up to 5 TB), you use the multipart upload API to split objects into smaller components and upload them independently or in parallel. When using the multipart upload API, the application interacting with S3 starts by breaking the large object into smaller parts. Next, it initiates a multipart upload to a specific bucket and provides a name for the final object, uploads all parts, and then signals completion of the multipart upload by sending a successful request to S3. Upon receiving the successful request, S3 assembles all the parts and creates the final object in the bucket. S3 stores the parts until it receives the successful request or until the job is aborted, and then frees up the parts storage. If S3 doesn’t receive a successful request, the object parts are stored in the bucket and are not visible in the console, but still incur a cost. To see any stored object parts, you can use the list-parts S3 API call. To delete the object parts that are left over from an incomplete multipart upload after a specific number of days, you can set a lifecycle rule with the AbortIncompleteMultipartUpload action.


Working with existing policies

It’s important to understand if a policy, or multiple policies, are already applied to the whole bucket or to a subset of objects within the bucket. The subset is defined by a policy prefix, designating an S3 bucket directory location or the beginning of an object name or key. For example, if the prefix is xyz, the rule applies only to the objects whose key name begins with the xyz string. If a prefix is not stated, the policy is applied to the whole bucket.

When designating a policy that applies to a whole S3 bucket, you do not provide a Prefix. If you attempt to apply the same action to a whole bucket policy that is already applied to xyz, you will get an error. For example, if there is an existing policy applied to the prefix xyz, with the action AbortIncompleteMultipartUploads and you want to use the same action in a whole bucket policy, it will fail:

Image_1_Lifecycle_Rules

This limitation is the same regardless of whether you’re applying a lifecycle rule through the command line or programmatically. If you want to be able to programmatically apply your default bucket lifecycle policy, you should keep this in mind, as this error will stop the rule from being applied to the bucket.

Using AWS Lambda to automate lifecycle policies

Automating enforcement of a default bucket policy is beneficial to many APN Partners and customers as manually applying a policy to every S3 bucket can be a daunting task and possibly prone to error. While there are many ways to automate enforcement of a default bucket policy, we are going to walk through doing so by creating a scheduled AWS Lambda function. For those who aren’t familiar with it, AWS Lambda is a compute service that lets you run code (written up as “Lambda functions”) without provisioning or managing servers. These functions execute in response to a wide variety of AWS service events and custom events, and can be used in a variety of scenarios including big data processing and automation scenarios like these. In this scenario, for the Lambda function, you can use Boto3, the Python SDK, and your AWS account. Using the SDK, you can easily interact with the AWS S3 API. Boto provides an easy to use, object-oriented API, as well as low-level direct service access.

The Lambda function will read a default configuration file in a designated policy bucket , and using the S3 API, check the lifecycle policies applied to all buckets, and verify that the action is applied in the policy. If the configuration is missing, then it will apply it. If there is a variance, it will log an error. The Lambda function can be scheduled to execute at a certain time interval to periodically run to ensure that all new buckets created in your account will have the default lifecycle policy applied even if the creator neglects to apply the policy when creating the bucket.

Applying a lifecycle policy using the S3 API will replace all bucket policies already applied to the S3 bucket. The documentation for the put_bucket_lifecycle_configuration states, “If a lifecycle configuration exists, it replaces it.” Whatever you put in the bucket via the SDK will overwrite any existing lifecycle configuration. Because of this limitation, we need to check all buckets and see if they already have a policy applied. If they do, then the policy applied needs to be taken into consideration when applying the default bucket policy.

You also need to consider the location of the S3 bucket when evaluating applied lifecycle policies. You might receive the following error message if you don’t provide the location of the S3 bucket:

botocore.exceptions.ClientError: An error occurred (PermanentRedirect) when 
calling the GetBucketLifecycleConfiguration operation: The bucket you are 
attempting to access must be addressed using the specified endpoint. Please 
send all future requests to this endpoint. 

To resolve this issue, you can make two connections to the S3 client API: one to get a list of all the buckets in the account, and the other for all subsequent requests that include the LocationConstraint as the region argument. By doing so, you are always sending the request using the specified endpoint as required by the API. However, when working with a bucket in the STANDARD region, the location constraint returned is None. We need to check the location constraint for values. If None is returned, we will replace it with us-east-1.

In our example, we are using this desired rule:

    { u'AbortIncompleteMultipartUpload': { u'DaysAfterInitiation': 7},
      u'ID': 'AbortIncompleteMultipartUploadAfter7Days',
      u'Prefix': '',
      u'Status': 'Enabled'}

Automation considerations

In the course of our testing, we found that there are six possible states that we need to consider for an S3 bucket when comparing a desired policy to policies already applied to the bucket:

  1. There are no policies applied to the S3 bucket.
  2. There are some policies that are applied, but they apply only to prefixes within the bucket, and do not match the action from the desired bucket policy within those policies.
  3. There are some policies that are applied, but they apply only to prefixes within the bucket, and one of the actions matches the desired action.
  4. There is a whole bucket policy (where Prefix = ”), and the desired action is not applied.
  5. There is a whole bucket policy and the desired action is applied, and its value is equal to the desired action value.
  6. There is a whole bucket policy and the desired action is applied and its value is not equal the desired action value.

If you want to automatically apply the desired action by default to all S3 buckets, then you need to consider all six of these possibilities.

Here is a table that describes these states:

Policies? Same Prefix? Same Action? Same Action Value? Example
No No No No N/A
Yes No No No
    { …[snip]… 
       u'Rules': [ { u'Expiration': { u'Days': 395}, 
                    u'ID': 'MGFmMTI5MmUtYWNlNi',
                    u'Prefix': 'abc/', u'Status': 'Enabled', 
                    u'Transitions': [ { u'Days': 30, 
                                        u'StorageClass': 'STANDARD_IA'}]} 
    ]}
Yes No Yes Yes
    { …[snip]…
       u'Rules': [ { u'AbortIncompleteMultipartUpload': { 
    u'DaysAfterInitiation': 7},
                   u'ID': 'abc123',
                   u'Prefix': 'Xguy/',
                   u'Status': 'Enabled'}]
}
Yes Yes No No
    { …[snip]…
        u'Rules': [ u'Expiration': { u'Days': 395},
                    u'ID': 'xxx234',
                    u'Prefix': '',
                    u'Status': 'Enabled',
                    u'Transitions': [ { u'Days': 30,
                                        u'StorageClass': 'STANDARD_IA'}]}]
}
Yes Yes Yes Yes
    { …[snip]…
        u'Rules': [ { u'AbortIncompleteMultipartUpload': { 
    u'DaysAfterInitiation': 7},
                    u'Expiration': { u'Days': 365},
                    u'ID': 'zzz111',
                    u'Prefix': '',
                    u'Status': 'Enabled'}]}
Yes Yes Yes No
    { …[snip]…
        u'Rules': [ { u'AbortIncompleteMultipartUpload': { 
    u'DaysAfterInitiation': 10},
                     u'ID': 
    'abcde12345',
                    u'Prefix': '',
                    u'Status': 'Enabled'}]
    }

Listing account S3 buckets

Request a listing of the S3 buckets in the account. From the returned list, you request the lifecycle policy for each bucket. The following is an example of a response from a get_bucket_lifecycle_configuration request (this has been reformatted for readability):

{ 'ResponseMetadata': { 'HTTPHeaders': { 'date': 'Wed, 20 Jul 2016 19:43:05 GMT',
                                         'server': 'AmazonS3',
                                         'transfer-encoding': 'chunked',
                                         'x-amz-id-2': 'XYZ',
                                         'x-amz-request-id': 'XXXX'},
                        'HTTPStatusCode': 200,
                        'HostId': 'xxxxxx',
                        'RequestId': 'XXXX'},
  u'Rules': [ { u'AbortIncompleteMultipartUpload': { u'DaysAfterInitiation': 7},
                u'ID': 'AbortIncompleteMultipartUploadAfter7Days',
                u'Prefix': '',
                u'Status': 'Enabled'}]}

The request returns a dictionary. The key in the dictionary of interest to us is the Rules key—more importantly, the list within the Rules key:

[ { u'AbortIncompleteMultipartUpload': { u'DaysAfterInitiation': 7},
      u'ID': 'AbortIncompleteMultipartUploadAfter7Days',
      u'Prefix': '',
      u'Status': 'Enabled'}]}

In this case, the list contains one dictionary entry with 4 keys:

AbortIncompleteMultipartUpload, ID, Prefix, and Status.

Since each bucket lifecycle policy can have multiple rules, you must check each rule and compare it to the desired rule. In this case, you can ignore the ID of the rule, as it is optional and has no effect on the rule itself. You can assume that the Status for all applied lifecycle rules is Enabled.

Verifying existing lifecycle policies

If policies exist in the bucket, you can verify that the desired policy is applied, check for potential conflicts, and identify the bucket as a candidate for policy update.


Checking lifecycle policies on an S3 bucket that does not have any policies raises a ClientError:

botocore.exceptions.ClientError: An error occurred (NoSuchLifecycleConfiguration) when 
calling the GetBucketLifecycleConfiguration operation: The lifecycle configuration 
does not exist

To get around this error, create an exception to catch ClientError and keep track of buckets that have no lifecycle policies applied.


For buckets that have lifecycle policies, we do the following:

  1. Check if the bucket has the desired action. In this example, the action is AbortIncompleteMultipartUpload.
  2. If the bucket has the desired action, then check the value. If the value doesn’t match, log a message of the mismatch.
  3. If the value does match, check the policy Prefix value.
    • If Prefix = '' for the policy, then the default policy has been applied to the whole bucket. This is the desired state.
    • If Prefix has a value, then log a message because there is a conflicting lifecycle policy applied to a subset of objects within the bucket.
  4. If the bucket has policies that do not match the desired default action, then check the Prefix value for each policy. This value will determine how the new lifecycle policy rule list to apply to the bucket is generated so it is in the desired state.
    • If there is a prefix value, then append the desired action to the Rules dictionary.
    • If Prefix = '', then add the action to the existing whole bucket policy.

Programmatically applying the default bucket policy using Python

Now, you want to apply the default whole bucket policy to those buckets that do not have overlapping or conflicting policies.

Because putting a lifecycle configuration via the SDK overwrites existing lifecycle configurations, simply using the put_bucket_lifecycle_configuration method without any logic will remove any previously applied bucket lifecycle configuration.

If there is a bucket policy that already exists for a prefix that also contains the desired action or has the wrong value for the desired action, log a message to request a manual review of these bucket lifecycle configurations. There can be a valid reason why the bucket policies are different from the desired default, so this will give you a chance to highlight the differences and easily review them.

If the bucket already has a lifecycle policy applied to a subset of objects and the action doesn’t overlap, you can use the Python append method on the Rules list. For example, if the bucket_policy json contains the following list as the value of the Rules key:

{ 'Rules': 
	[{
	   u'Status': 'Enabled', 
	   u'Prefix': 'IA-destined/', 
	   u'Transitions': [{
		u'Days': 30, 
		u'StorageClass': 'STANDARD_IA'}], 
	   u'ID': 'transition-30day'
	 }]
}

and the desired_whole_bucket_policy dictionary is:


{
  u'Status': 'Enabled', 
  u'Prefix': '', 
  u'AbortIncompleteMultipartUpload': 
    {
      u'DaysAfterInitiation': 7
    }
}

you run bucket_policy['Rules'].append(desired_whole_bucket_policy) and apply a newly constructed policy to the bucket:

[
  {
    u'Status': 'Enabled', 
    u'Prefix': 'IA-destined/', 
    u'Transitions': 
      [{
        u'Days': 30, 
        u'StorageClass': 'STANDARD_IA'
      }], 
    u'ID': 'transition-30day'
  }, 
  {
    u'Status': 'Enabled', 
    u'Prefix': '', 
    u'AbortIncompleteMultipartUpload': 
      {
        u'DaysAfterInitiation': 7
      }
  }
]

To be able to apply the default action to existing lifecycle policy applied to the whole bucket, you can compare the keys of the applied whole bucket lifecycle rule to the keys that are in the desired default policy. Once you find the key that is in the default rule but missing from the applied rule, we can add it to the applied bucket rule dictionary.

For example, if the following policy is applied to the bucket:

{ 
  u'Expiration': 
    { 
      u'Days': 365
    },
  u'ID': 'WholeBucketPolicy',
  u'Prefix': '',
  u'Status': 'Enabled'
}

You can check to see if each key in the desired default policy exists in the currently applied policy. If missing, then we can add the key and value to the dictionary whole_bucket_lifecycle_rule[action] = desired_policy[action].

The newly constructed whole bucket policy is:

{ 
  u'Expiration': 
    { 
      u'Days': 365
    },
  u'ID': ‘WholeBucketPolicy',
  u'AbortIncompleteMultipartUpload': 
  { 
      u'DaysAfterInitiation': 7
  },
  u'Prefix': '',
  u'Status': 'Enabled'
}

Output/Logging

Your scheduled Lambda function updates buckets with the desired default policy where it was straightforward to do so, as well as logs a list of buckets that weren’t able to be updated due to an incorrect value or where an action was already applied to a subset of objects.

Conclusion

In this post, we’ve walked through various S3 lifecycle policies, how to deal with pre-existing policies on buckets across various regions in AWS, and how to automate the application of lifecycle policies across hundreds of buckets, specifically focusing on the AbortIncompleteMultipartUpload action. We hope you’re able to use this information to optimize the management of multipart uploads as you scale.

For more information, we encourage you to visit the S3 documentation page.