This Guidance helps nonprofit research institutes build a modern data sharing portal. Nonprofit funders seek data sharing policies that will give them visibility into a nonprofit’s goal progress and output. In addition to funders, agencies are starting to require researchers to share data with research participants and communities. In January 2023, the US National Institutes of Health (NIH)* announced that most of the 300,000 researchers and 2,500 institutions that the NIH funds annually will need to include a data management plan in their grant applications and eventually make their data publicly available. This Guidance can help nonprofits achieve this goal by showing how funders can add to the raw data that nonprofit researchers use and how these researchers can share their findings with research participants and communities.
*Data Management & Sharing Policy Overview, National Institutes of Health (NIH), January 2023
Nonprofit researchers upload their data to Amazon Simple Storage Service (Amazon S3). You can give project funders permission to upload other datasets from previous projects to Amazon S3.
AWS Identity and Access Management (IAM) creates roles and tokens to securely authorize access to Amazon S3. You can securely configure the Amazon S3 uploader client app to make it easier for researchers to use.
AWS Lake Formation simplifies management and governance of your scalable data lake using Amazon S3 as the underlying storage. Lake Formation enables access controls (for row-, column-, and table-level security), audit trails, and automatic schema discovery.
Researchers can give Amazon SageMaker or AWS Artificial Intelligence (AI) services access to data in the S3 data lake in order to build, train, and deploy machine learning (ML) models or enrich data sets with AI.
Researchers can use Amazon QuickSight to create visualizations and dashboards for their analysis or to share data insights with funders, research participants, and communities.
Amazon Athena provides an interactive SQL style query engine that you can use on the data lake.
Researchers can build and deploy a data sharing portal using AWS Amplify, a managed service for building, deploying, and hosting a static website. This service provides libraries to simplify permissions management.
Researchers can embed QuickSight dashboards and display results from SageMaker models or Athena queries on their website so funders, research participants, and communities can gauge findings from the research.
Amazon Cognito simplifies log-in and permission management to restrict who can view the data sharing portal and what access they have.
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
This architecture can be deployed using infrastructure as code (IaC). IaC helps you recover from failures by automating the process of launching new environments and infrastructure. The repeatable aspect of IaC enables consistency and ease of deployments in production or operation. Additionally, you can use Amazon CloudWatch to monitor services.
Amazon Cognito provides managed security for controlling access to the data sharing portal. Lake Formation enforces security and governance of the data lake.
The serverless services in this architecture are automatically deployed across multiple Availability Zones, so that if one Availability Zone fails, services are available in another Availability Zone. Additionally, this architecture decouples storage from compute and uses stateless services to enhance reliability and availability. When compute and storage are decoupled, compute happens independently from when data is stored. If there are failures in compute, the person or process performing the data storage operation does not have to wait for confirmation that the issue is resolved. Instead, compute processes can execute automated retries and independently notify users of failures.
Multiple services used in this architecture, including QuickSight, SageMaker, Amazon S3, and AWS Glue, offer AWS Free Tier usage. With this offer, you can experiment and fine-tune architecture configurations without worrying about additional costs.
The serverless services in this architecture scale automatically, meaning they can manage growing data volumes while using only the minimum resources required. To manage costs over time, we recommend implementing a standardized process to identify and remove unused resources, such as unused data, SageMaker resources, and extract, transform, load (ETL) jobs.
This architecture supports S3 Lifecycle policies, which allow you to monitor access patterns to discover data that should be moved to lower-cost storage classes, such as infrequently accessed data storage or cold storage. This helps reduce the amount of resources needed to maintain data storage.
A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.