AWS Storage Blog

Using AWS SFTP Logical Directories to Build a Simple Data Distribution Service

We launched the AWS Transfer for SFTP (AWS SFTP) service in November of 2018, and it has since been adopted by many organizations to enable secure SFTP access to data hosted in Amazon S3. At AWS, we are continuously iterating on our services, and many of our customers have told us that they would like the option to map multiple S3 buckets and keys to a unified, logical namespace for their SFTP-connected users. To address this need, we recently launched a new feature for AWS SFTP called ‘logical directories’. Using this feature, you can customize how S3 bucket paths are visible to your SFTP end users, enabling you to:

  • Easily limit access to files and folders in S3 buckets (e.g. subscription-based access)
  • Preserve file and folder paths referenced in existing applications and scripts
  • Distribute files to multiple consumers without creating copies
  • Prevent S3 bucket names from being visible to SFTP end users for compliance/regulatory purposes

In this post, we’re going to show you how to use logical directories to implement a simple data distribution service for sharing data with subscribers using AWS SFTP.

Building a simple data distribution service

Let’s say ‘Bob’ works as a cloud architect for a financial services organization called SmartTrade. Bob is building a data distribution service that will make financial data available to his subscribers to guide their investment strategies. Users will have access to different data sets, depending upon their subscription profiles, and Bob wants to make sure that users only have access to the data to which they are entitled.

Bob’s data repository is currently built on two S3 buckets named `public-research` and `subscriptions`. Every SmartTrade subscriber gets access to the data in the `public-research` bucket. However, access to data in the `subscriptions` bucket is limited based upon a user’s subscription, and is stored across many sub-folders.

Bob currently has two subscribers, Alice and Bryan, and he wants to make data available to them through the following folder structures (i.e. not the raw S3 bucket paths shown as s3://):

Alice

/
├── public
│   └── research      --> s3://public-research
│       └── global   
└── subscribed
    ├── 2018
    │   └── indices   --> s3://subscriptions/historical/2018/indices
    └── 2019
        ├── indices   --> s3://subscriptions/historical/2019/indices
        └── equities  --> s3://subscriptions/historical/2019/equities

Bryan

/
├── public
│   └── research      --> s3://public-research
│       └── global
└── subscribed
    ├── 2018
    │   ├── equities  --> s3://subscriptions/historical/2018/equities
    │   └── indices   --> s3://subscriptions/historical/2018/indices
    └── 2019
        ├── credit    --> s3://subscriptions/historical/2019/credit
        └── equities  --> s3://subscriptions/historical/2019/equities

Also, when a subscriber logs in using their SFTP client, Bob wants to make sure that their login directory displays only the folders they are entitled to access.

For the remainder of this post, we’ll walk you through an architecture that Bob can use to meet the requirements for his data repository. The architecture will utilize an AWS SFTP custom identity provider, along with logical directories to create the customized folder structures that Bob needs for his subscribers. The code snippets below are based on an example custom identity provider available on GitHub.

Architecture

There are four key components to our example architecture, as shown in the diagram below: the AWS Transfer for SFTP server, Amazon API Gateway, AWS Lambda, and two S3 buckets for the data repository.

An example architecture, utilizing an AWS SFTP custom identity provider, for meeting data repository requirements. Example architecture has four key components: AWS Transfer for SFTP server, Amazon API gateway, AWS Lambda, and two S3 buckets for the data repository.

When a user authenticates using their client, AWS SFTP will invoke the configured API Gateway method to query the custom identity provider. This method uses an AWS Lambda function to authenticate and authorize the user. The AWS Lambda function will attempt to authenticate the user and if authentication is successful, it will return a JSON object describing the user’s SFTP session. The JSON object includes an IAM role and an optional scope-down policy that governs the user’s access to one or more S3 buckets. It also describes the folder structure the user will see and how those folders map to S3 bucket paths. The JSON document is returned to the AWS SFTP server which is then used to present the authorized S3 bucket(s) to the user for that session. From there, AWS SFTP handles standard SFTP commands from the user’s client such as get, ls, and put.

Custom Identity Provider

User authentication is a key part of any SFTP solution. AWS SFTP offers a service-managed identity provider that makes it easy to add users to get up and running quickly. However, some customers prefer to use existing authentication systems, such as Amazon Cognito, Microsoft Active Directory, an LDAP service, or 3rd-party providers such as Okta. For this purpose, AWS SFTP supports a custom identity provider.

As noted in the diagram above, to use the custom identity provider, you create an Amazon API Gateway method which invokes an AWS Lambda function. Each time a user authenticates using their SFTP client, the API method is invoked by the AWS SFTP server. In turn, the API Gateway method passes an event object with the username and password to the Lambda function. In the Lambda function below, Bob passes the username and password to the `authenticated` function:

function authenticated(username, password) {
    if (username in userDb) {
        var userRecord = userDb[username];

        if (password == userRecord.password) {
            return true;
        }
    }

    return false;
}

This function simply verifies that the specified username is in Bob’s “user database” and then compares the password. If the password matches, then the user is authenticated, otherwise authentication fails. Before going into production, Bob would definitely want to replace this code with a call to an identity provider used in his organization.

Note, in addition to password authentication, Bob can also use key based authentication, which would require him to return the public key(s) stored for the user as a part of the response.

Once the user is successfully authenticated, the response is built up by specifying which S3 buckets the user can access, and the IAM roles that control access to the buckets.

Let’s now walk through how Bob can construct logical directories for his users.

Logical Directory Entries

Building the folder structure that Bob wants for his subscribers is easily done using logical directories. With logical directories, one or more S3 buckets can be presented as a single namespace to the user. To provide this data back to AWS SFTP, in addition to an AWS IAM Role, two additional fields have been added to the JSON document for the Lambda function’s response: HomeDirectoryType and HomeDirectoryDetails:

{ 
  "Role": "ARN of IAM role with configured S3 permissions", 
  "Policy": "JSON string of STS Assume role scope down policy", 
  "HomeDirectoryType": "LOGICAL",
  "HomeDirectoryDetails": "JSON string of Entry / Target pairs"
}

Each logical directory in HomeDirectoryDetails contains two fields: `Entry` and `Target`. The `Entry` is the name of the folder that the user will see and the `Target` is an S3 folder path. The value for `Entry` must be an absolute unix-style path which starts with a leading `/` and no trailing slash. Similarly the `Target` value must start with a leading forward-slash, then the S3 bucket name, then the key prefix and no trailing slash. Each `Entry`/`Target` pair is independent of the other logical directories and each `Target` can point to a different S3 bucket, if needed. For example, below is the logical directories entry in our user database for Alice:

"directoryMap": [
    {
        "Entry": "/public/research",
        "Target": "/"+ public_bucket
    },
    {
        "Entry": "/subscribed/2018/indices",
        "Target": "/"+ subscription_bucket + "/historical/2018/indices"
    },
    {
        "Entry": "/subscribed/2019/indices",
        "Target": "/"+ subscription_bucket + "/historical/2019/indices"
    },
    {
        "Entry": "/subscribed/2019/equities",
        "Target": "/"+ subscription_bucket + "/historical/2019/equities"
    }
]

As shown above, Bob has logical directories for two buckets: the public bucket and the subscription bucket. He would need to make sure that the IAM role grants access to both of these buckets. Check out the AWS SFTP docs for more information on creating IAM roles for users.

Note also that even if the IAM Role specified in the response JSON may provide access to an entire S3 bucket, Bob’s end users only have access to the folders specified in the logical directory mapping function. Also, when using logical directories, the `HomeDirectory` field in the JSON document is no longer required, as its value determines what the user sees as their login directory. This value is now implied by what is supplied in the `Entry` field of the `HomeDirectoryDetails` list.

In our example code, the directory mappings are stored in the user database and can be returned directly in the `getDirectoryMapping` function:

function getDirectoryMapping(username) {
    var userRecord = userDb[username];
    return userRecord.directoryMap;
}

Bob’s updated Lambda Handler

All Lambda functions have an entry-point, called a “Handler,” which is where execution of the Lambda function begins. Once Bob’s Lambda handler has queried the identity provider and successfully authenticated his user, access is defined using the IAM Role and logical directories. Bob’s Lambda function returns the following response object to the AWS SFTP server:

response = {
    Role: userRoleArn,
    HomeDirectoryType: "LOGICAL",
    HomeDirectoryDetails: JSON.stringify(directoryMapping)
};

The response object will be returned to the AWS SFTP server and the directory mapping specified in `HomeDirectoryDetails` will be used to construct the folders shown to the authenticated user.

For example, if Alice were to login using her SFTP client and successfully authenticate, she would see the following folder structure:

Folder Structure after login using SFTP client and successful authentication

Conclusion

In this post, we’ve shown you how to create a simple data distribution service using AWS SFTP logical directories, Amazon API Gateway, and an AWS Lambda function. Using an AWS SFTP custom identity provider and the new logical directories feature, you can create granular folder structures to control how users access data in S3 buckets, and provide them an with easy to browse experience using their SFTP clients.

To use this in your environment, start with the CloudFormation stack provided in the AWS SFTP documentation and modify the Lambda function using our example code at this GitHub repository. In addition, we recommend that you:

  • Update the authenticated function to use a secure authentication provider such as Amazon Cognito, Active Directory, or other identity provider
  • Replace our hard-coded user database with a more scalable entitlement system, perhaps powered by an Amazon Relational Database Service (RDS) or Amazon DynamoDB table
  • Explore scope-down policies as a way to further control user access or advertise datasets to your end users
Jason Barto

Jason Barto

Jason Barto works as a solutions architect with AWS. Jason supports customers to accelerate and optimize their business by leveraging cloud services. Jason has 20 years of professional experience developing systems for use in secure, sensitive environments. He has led teams of developers and worked as a systems architect to develop petabyte scale analytics platforms, real-time complex event processing systems, and cyber-defense monitoring systems. Today he is working with financial services customers to implement secure, resilient, and self-healing data and analytics systems using open-source technologies and AWS services

Jeff Bartley

Jeff Bartley

Jeff Bartley is a Hybrid Cloud Storage and Data Transfer Solutions Architect at AWS.