Managing access to your Amazon S3 objects with a custom authorizer

Data protection is critical for most customers seeking to safeguard information, maintain compliance, secure applications, and more. Protecting data can become challenging when different entities or personas need different levels of access to data. In Amazon S3, access control can be managed with tools like AWS Identity and Access Management (IAM) policies, bucket policies, access control lists (ACL), and AWS Lake Formation. However, some customers need more granular access control methods that allow them to dynamically parse data, so that specific data-requestors only get the exact data that they need.

Many third-party solutions for granular data-access management required the creation of multiple copies of data or hosting source data on other platforms for proper parsing before delivering results back to your users. These solutions can be costly, inefficient, and complicated, taking away from your ability to get the most out of your data and optimize its value. With S3 Object Lambda, you can add code to the S3 GET requests to process data as it is returned to an application. For example, you can dynamically resize images, redact confidential data, authorize requests, and much more.

In this blog post, I demonstrate a solution that uses S3 Object Lambda to parse S3 GET requests and deliver the proper output to a data-requestor. Additionally, this output will include only the columns authorized for a specific user or role, optionally considering other limits, like time of day. Using this solution, you can granularly limit access to your data based on specific criteria, allowing you to protect your data through principles of least privilege. In addition, keeping a record of access permissions and policies for an audit trail can bolster security posture, compliance, and regulatory standards within your enterprise.

S3 Object Lambda and data protection

S3 Object Lambda uses AWS Lambda functions to give you the added control of augmenting, modifying, or removing information from an S3 object before sending it back to the requestor. You can also use the Lambda function to add granular custom authentication, and with structured data, take the schema into account to only give users access to certain parts of data.

I use Amazon DymamoDB for access rules and permissions. The source data for this solution will be Apache Parquet files in Amazon S3. The output/response to each GET request will also be Apache Parquet files. It won’t return multiple files. Note that the output file type can be changed as well. For example, you can provide an Excel or CSV file as the output without changing the source data format. This can be included as a header to the GET request to the S3 Access Point.

This example’s AWS Lambda function will be written in Python. I use the pandas library to query the source files and retrieve the necessary columns for each request. Pandas can be added as a layer in Lambda. They can be part of the package you upload to Lambda, whether it’s a zip file or a container image.

Custom authorizer implementation

When a GET request is issued, the event information sent to the Lambda function includes userIdentity information. This can be the account ID and Amazon Resource Name (ARN) of the user or role making the request:

'userIdentity': {
    'type': 'IAMUser',
    'principalId': '[PRINCIPAL ID]',
    'arn': 'arn:aws:iam::[ACCOUNT NUMBER]:user/[USERNAME]',
    'accountId': '[ACCOUNT NUMBER]',
    'accessKeyId': '[ACCESS KEY ID]'
}

This solution looks at the ARNs of the userIdentity to match it with the records in DynamoDB. Use any of the information available to make the correct decisions on what to respond with, depending on your use case. For example, you could limit the permissions to ignore all requests from a certain account after working hours.

The DynamoDB key is the full ARN to ensure that the correct users and roles get access to the content they need. When assuming a role to get content, the ARN in userIdentity will use the Security Token Service (STS) ARN, and will have numerical values after the full role is listed. The code accounts for this and parses out the full role without the trailing numbers. Storing this shortened ARN in DynamoDB allows the assigned roles to only access certain columns.

userIdentity ARN when using a role: arn:aws:sts::[ACCOUNT NUMBER]:assumed-role/[ROLE NAME]/123984892384
What you store in Amazon DynamoDB: arn:aws:sts::[ACCOUNT NUMBER]:assumed-role/[ROLE NAME]

Architecture - Managing access to your Amazon S3 objects with a custom authorizer

Architecture: Managing access to your Amazon S3 objects with a custom authorizer

Every record in DynamoDB consists of the partition key (identity_arn) and the list of columns to which the respective user/role has access to. Other elements such as time of day or days of the week where access is permitted can be stored in DynamoDB as well.

First, the code checks to confirm the user/role from the userIdentity object exists in the user table. If not confirmed, the function responds with a custom error message relaying the proper information. For added security, a 403 or 404 error is displayed to obfuscate the actual issue.

user_id = event["userIdentity"]
identity_arn = user_id['arn']
# clean up ARN for STS (remove everything after second slash)
identity_arn = "/".join(identity_arn.split("/")[:2])

dynamodb_client = boto3.client('dynamodb', region_name="us-east-1")
ddb_response = dynamodb_client.get_item(
    TableName='user_access',
    Key={
        'identity_arn': {'S': identity_arn},
    }
)
if ('Item' in ddb_response): #valid user or role found

If the user/role is found in the user table, the code adds the list of columns from the table to the pandas read_parquet function. It includes the original URL of the object found in the event object input S3Url within getObjectContext. This will retrieve only the necessary columns from the source data and save it into the pandas object.

if ('columns' in ddb_response['Item']):
    all_data = pd.read_parquet(s3_url, columns=ddb_response['Item']['columns']['SS'])
else:
    all_data = pd.read_parquet(s3_url)

The final output for an S3 Object Lambda function is required to use the write_get_object_response boto3 function. This requires the Body parameter to be either bytes or a seek-able file-like object.

all_data.to_parquet('/tmp/outputfile.parquet', compression='None')
all_content = open('/tmp/outputfile.parquet', 'rb')

s3.write_get_object_response(Body=all_content.read(),
                             RequestRoute=request_route,
                             RequestToken=request_token)

S3 Object Lambda requires you to download the object locally, even if you’re not going to modify it. This can present a problem with large objects that don’t fit into the memory of the Lambda function. To help you successfully pass data through, consider the following:

You can store the contents of the output in /tmp like you do in the function, but that is limited to 512 MB.
You can mount an Amazon EFS file system to your Lambda function so you don’t have to store the full contents of the source file in memory or in /tmp.
If you’re not modifying the original object, you can use the AWS API to create a pre-signed URL for the object with a short expiration. It will respond with a custom error code and the URL as the body of the request. Your application can then look for that custom error code and download the file directly using the pre-signed URL.

Considerations and enhancements

Lambda limits: Lambda functions have runtime and memory limits so keep those in mind when you’re processing larger files. Even if not modifying the original file, S3 Object Lambda requires you to download the entire file to the Lambda function from S3 before sending it to the requestor.
File cache: This solution runs through the process of parsing and compiling data with every request. A caching layer can be added based on the columns users can see. If many users/roles can see the same columns, there is no need to reprocess the files multiple times. Use S3 to save a cached version of the processed content with a lifecycle policy so only the latest data is available as needed.
Request auditing: Outside of providing a request-specific output to every GET request, the same Lambda function can call another function. This can log requests in a separate data store or Amazon CloudWatch. Metrics and alarms can then be assigned to these logged requests and provide additional detail on usage patterns.
User federation: This solution uses DynamoDB as its authorizer. Using Amazon Cognito or a third-party federation service like Okta can be beneficial if your enterprise is already taking advantage of those services. Connecting to those services can be achieved through AWS Lambda as well.

Cleaning up

As part of this solution, you uploaded Amazon S3 objects, set up DynamoDB tables, S3 Object Lambda Access Points, and Lambda functions. Remember to delete these objects and tables as they will incur a cost even when not used. In general, be mindful that there are costs associated with running these services. To be cost-efficient, see the following pages for details:

Conclusion

In this blog post, I showed you how to set up a custom authorizer using Amazon S3 Object Lambda Access Points, allowing you to limit data available to the requestor based on their profile stored in DynamoDB.

The full code for the Lambda function, with comments, is available for download here.

Using this solution, you can start limiting access to your structured data immediately. This will save you time and money optimizing the same data for multiple customers. A serverless management screen for the DynamoDB dataset can be created to manage the users and roles that can access the data. This is preferable to using IAM rules and policies for this use case, as it will also give you more granularity.

Thanks for reading this blog post. If you have any comments or questions, feel free to leave them in the comments section.