Creating access control mechanisms for highly distributed datasets
Security is priority number one at Amazon Web Services (AWS). Data stored in Amazon Simple Storage Service (Amazon S3) is private by default. However, some datasets are made to be shared. Research organizations such as the Chan Zuckerberg Biohub (CZB) and the Allen Institute have missions to produce high quality, open access datasets for the research community. Other entities are required by law to openly share data for a period of time.
On the AWS Open Data Team, we work with data providers who distribute datasets that follow a “one-to-many,” or even “one-to all” distribution pattern. The difference between these two is nuanced, but some of the datasets we help our data providers distribute are open in a sense that the data provider doesn’t even know who is accessing the data. Either way, our data providers make their data available to hundreds, thousands, and sometimes millions of data users.
Within the life sciences, scientists often work with datasets that contain human clinical and genomic data. To protect the privacy of the individual, human clinical and genomic data come with their own set of data use regulations. For example, the Health Insurance Portability and Accountability Act (HIPAA), the National Institutes of Health (NIH) Genomic Data Sharing Policy, and the General Data Protection Regulation (GDPR) all require that you verify a prospective data user’s data use objective prior to permitting access. Some also require that data providers maintain a current list of approved data users in the event that data must be redacted or otherwise recalled. Some datasets have thousands of users, and data providers need to provide secure and seamless access.
In this blog post, we cover the no-cost mechanisms data providers can utilize to create access control policies for their highly distributed open datasets in AWS.
Access control mechanisms for open datasets in Amazon S3
Amazon S3 is the most common storage service used by our open data providers. Along with many other AWS services, Amazon S3 is HIPAA-eligible; check out a comprehensive list of AWS services in scope by compliance program. In this blog post, we focus on access control mechanisms using no-cost* features within Amazon S3 and AWS Identity Access and Management (IAM) that can help data providers securely and seamlessly provide access to approved data users. These are ordered along two factors: first, ease of implementation (how quickly can I stand this mechanism up?) and second, scaleability (how simply can I use this mechanism to share my data with thousands or millions of approved users?).
The following are just the mechanisms that we’ve seen in action with our data providers so far. There are many other mechanisms within AWS that can provide you with scalable and secure access controls that are not documented here.
*Note that while the features that we discuss are available features of Amazon S3 and IAM for no additional charge, requesting, transferring, and downloading data from Amazon S3 does incur charges.
Create Amazon S3 Bucket Policies
You can add a stanza to your Amazon S3 bucket policy that defines what AWS account IDs can access your Amazon S3 bucket. This takes just a few steps via the Amazon S3 console or with the AWS Command Line Interface (CLI). Use the Bucket Policy Generator for help writing your bucket policy.
Bucket policies are restricted to 20 KB in size. This limits the number of unique principals you can name on a bucket policy to a maximum of a few thousand users, depending on how detailed your read/list/get permissions are. Similar to roles, if you want to retract access permission to any given AWS user, you have to manually alter your bucket policy to do so. Learn more about bucket policies.
Create Identity Access and Management (IAM) roles
An IAM role is an identity you can create that has specific permissions with credentials that are valid for short durations. With IAM roles, you can permit an AWS principal, or principals, to access a specific Amazon S3 bucket or even a specific prefix within an Amazon S3 bucket. You can also allow users to assume an IAM role with security assertion markup language (SAML) or web identity authentication. Depending on how your data user assumes a role, role credentials can last from 15 minutes to 12 hours. If you want to retract access permission to a user, manually remove them from the IAM Role.
Roles can be created through the IAM console and programmatically with the CLI. To create a role for an AWS account or user, you apply a policy that defines what permissions the role has. AWS offers a number of AWS-managed policies that are designed to provide permissions for common use cases (for example, PowerUserAccess, DataScientist, DatabaseAdministrator) that you can select and/or iterate on for your own custom role policy. AWS-managed policies can be found in the IAM console in the “Policies” section.
Role sessions are temporary. Any authenticated user’s session only lasts up to 12 hours, and then they need to reauthenticate. This does need to be architected for. For example, if you have a user who wants to run a pipeline that accesses your data and will run for a few days, make sure you’ve provisioned to support that user so that their pipeline is uninterrupted. There are limits to the number of roles you can create within an AWS account as well as role policy sizes (see specific limits), but roles might still accommodate more approved data users than a bucket policy.
Use pre-signed URLs from Amazon S3 or an Amazon CloudFront distribution network
You can use pre-signed Amazon S3 or Amazon CloudFront URLs that point to specific objects within the Amazon S3 bucket. The data user then accesses the data by downloading it with the pre-signed URL. Pre-signed URLs do not require users to have an AWS account, so if your user base prefers to download data and compute upon it locally, this is a good option to have for them. Plus, pre-signed URLs are highly scalable. Since you’re just sending a link that expires at some point, you don’t run into limits. You can set the expiration date of the pre-signed URL for sometime between 60 minutes and six days.
If you have a large user base that prefers to download your data to their local machines, you are responsible for paying egress fees. Because pre-signed URLs allow for anonymous access, you cannot use pre-signed URLs in conjunction with Requester Pays. Anyone with access to a valid pre-signed URL can, in theory, use the pre-signed URL—for example, if an approved user emailed a pre-signed URL to a colleague, the colleague could access the data via pre-signed URL, provided the pre-signed URL is still valid. You can limit pre-signed URL capabilities by limiting access to specific network paths, but this might not work well in a one-to-many distribution pattern where you may not know your user’s IP address.
In addition, pre-signed URLs only operate at the object level, meaning if you want to give access to lots of objects, you make lots of pre-signed URLs.
Finally, not all AWS services can use a pre-signed URL as a data source. For example, if you were given access to a CSV file via pre-signed URL that you wanted to run queries on using Amazon Athena, you would need to download the file and then load it into Athena before you could start your queries. Learn more about pre-signed URLs.
Use IAM Temporary Security Token Service (STS)
Authenticated IAM users or IAM roles can request temporary security credentials that give them access to your Amazon S3 bucket. The temporary security credentials have a limited lifetime, so you do not have to rotate them or explicitly revoke them when they’re no longer needed. After temporary security credentials expire, they cannot be reused. You can specify how long the credentials are valid, up to a maximum limit: no cleanup required. When (or even before) temporary security credentials expire, an approved user can request new credentials, provided they still have permissions to do so, making this fairly self-service. STS can be vended after a user authenticates with an external web identity such as an email address, making STS a good option for data providers whose users may not have AWS accounts.
Keep in mind that the longest an STS token can stay valid is 3600 minutes, or 60 hours. For some use cases, this may not be enough time for a researcher who is looking to access the data. Learn more about temporary security tokens.
Create Amazon S3 Access Points
This feature of Amazon S3 allows customers to create an access point, each with its own unique access control policy. With up to 1,000 Amazon S3 Access Points per Region, and unlimited principals who can use one access point, Amazon S3 Access Points are highly scalable. Because each access point has its own unique access control policy, you can get very granular with permissions for different groups of users. For example, if Group A should only get access to all the data with Prefix 1, and Group B should have access to only data with Prefix 2, but Group C can have access to data with Prefixes 1 and 2, you can make three access points. Access points work independently of each other. Changing the policy of, or removing, one access point doesn’t disrupt the work of a group using another access point.
Setting up Amazon S3 Access Points does require more policies than a bucket policy or an IAM role: you need a policy for the access point as well as the Amazon S3 bucket. In addition, an access point can only point to one bucket. If you want to give an approved user access to two S3 buckets, direct them to two Amazon S3 access points. Finally, access points do not support anonymous access, so if you have users who do not have an AWS account, architect for an additional access mechanism for them. Like bucket policies, access point policies are limited to 20 KB in size. Learn more about Amazon S3 Access Points.
Get started with open data on AWS
These are just a handful of ways you can grant and control access to data resource using AWS services. If you are interested in exploring further, explore these blog posts that focus on access control, or talk to your account team.
Many of these mechanisms are in action with certain datasets on the Registry of Open Data on AWS. Explore these and many other openly available datasets, and learn how to become a data provider yourself at opendata.aws. Happy data sharing!
Read more stories about AWS and open data:
- Downscaled CMIP5, 1950 US Census, and open genomics data for Galaxy: The latest open data on AWS
- Preventing the next pandemic: How researchers analyze millions of genomic datasets with AWS
- Street-scale global maps, orca sounds, and COVID-19 detection data: The latest open data on AWS
- Climate data, koala genomes, analysis ready radar data, and highly-queryable genomic data: The latest open data on AWS
- Celebrate Open Science Week with the Allen Institute and available open datasets
Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.
Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.