AWS Public Sector Blog

How researchers can meet new open data policies for federally-funded research with AWS

In 2022, an order from the White House’s Office of Science and Technology Policy introduced new data sharing requirements for federally-funded research. Researchers across academic institutions, nonprofits, and federal agencies themselves will be expected to make research publications and underlying data accessible to the public at no cost, immediately upon publication. This new model gives researchers direct and equitable access to the raw data behind scientific publications, maximizing data reusability as well as experimental reproducibility.

Under these new requirements, agencies must update their public access policies as soon as possible—no later than the end of 2025—and must achieve full implementation of the public access requirements by 2027. So how should researchers prepare to comply with these new data sharing policies? Learn how federal agencies are enacting these public access policies, and how you can use Amazon Web Services (AWS) to prepare your research to meet these new data management and sharing requirements.

New public access policies push federal agencies towards open science

Movements towards open science are being implemented by federal agencies across all domains of research. Open access data and software primes researchers to pick up where other scientists left off. The United States Department of Agriculture, Department of Defense, Department of Energy, and the National Science Foundation have also taken an open data stance.

While not setting a default requirement for open access, the submission of plans necessitates that researchers consider and strive towards open science from the beginning of a project. Such an approach can help researchers design their data collection systems at the outset for public access. This model also makes sure that researchers address the privacy, legal, and ethical considerations in making certain data public—especially given the sensitive data often used in biomedical and human subject research studies.

“The new mandate will facilitate the necessary cultural change in biomedical research that patients and families deserve,” says Ashwini Davison, MD, healthcare executive advisor of academic medical centers at AWS. “Academic medicine has been grappling with a reproducibility crisis, but the enhanced data sharing and management plan requirements are a step in the right direction towards a future of truly open science.”

How AWS can help researchers align with new public access policies

As these and future data sharing policies take effect, researchers must consider how they can meet these new requirements, broaden access to their findings, and equip the next generation of researchers with better data to improve the health of the public. Researchers can use AWS to meet this challenge and design data architectures that optimize research abilities while supporting secure and cost-effective access to data.

Making your artifacts findable, accessible, and reusable with AWS

The AWS Data Exchange public data catalog lists over 3,000 data products that are available on a subscription basis. Publishing on the AWS Data Exchange is self service. Once you have registered as an AWS Marketplace vendor, you can provide a managed subscription experience for your data users. Data users who subscribe to your data product will receive a copy of the data product in their own Amazon Simple Storage Service (Amazon S3) bucket. At re:Invent 2022, AWS also announced AWS Data Exchange for Amazon S3, a feature for subscribers who want to use third-party data files for their data analysis with AWS services without needing to create or manage data copies, as well as data providers who want to offer in-place access to data hosted in their Amazon S3 buckets. AWS Data Exchange also supports tables (AWS Data Exchange for Amazon Redshift) and APIs (AWS Data Exchange for APIs).

The Registry of Open Data on AWS is an open-source digital catalog that helps data providers keep their data findable and accessible. Currently, the registry lists 390 datasets spanning the geospatial sciences, climate, weather, sustainability, healthcare, machine learning, and life sciences. Users can search for datasets that meet a certain keyword or by a specific data provider, and are pointed directly to the resource and mechanism by which they can access the dataset. Nearly all of the datasets on the Registry of Open Data on AWS are distributed using Amazon S3. Data users can access the data in place via AWS native APIs, often without needing an AWS account. To list your dataset on the Registry of Open Data on AWS, make a pull request to the GitHub repository.

Findability, accessibility, and reusability don’t stop at data. As NASA’s directive indicates, technical research artifacts include software and workflows. The Amazon Elastic Container Registry (Amazon ECR) Public Gallery lets you list and search for public container artifacts. All AWS users get 50 GB of public storage in Amazon ECR every month at no cost, and the Amazon ECR Public Gallery is available for anyone to browse at no cost, even if you do not have an AWS account. Learn more about how to get started with Amazon ECR Public Gallery.

Supporting data interoperability with AWS

Being able to use datasets in concert with other datasets can be a major challenge. AWS offers high level guidance and blueprints that can help users deploy AWS services to aggregate, manage, and integrate different data sources—as in a data lake, for example—with well-architected guidance for multi-modal and multi-omic data.

Several of AWS’s analytic services have features that allow you to work with datasets in different formats. For example, Amazon Athena Federated Query lets users query across datasets in different formats and databases. AWS Glue custom connectors also allows users to transfer data from applications and custom data sources to your data lake in Amazon S3. Amazon HealthLake suite of services lets you analyze imaging, structured, and unstructured health data in a HIPAA-eligible environment.

For research data that may be decentralized, AWS services also support data mesh principles to help customers find, aggregate, and analyze data. AWS customers have leveraged AWS LakeFormation and AWS Glue and Amazon HealthLake as data platforms built on data mesh principles. Customers who regularly use data from many third party sources can also use the AWS Data Exchange as the basis for a data mesh. At re:Invent 2022, AWS also announced AWS DataZone to help customers share, search, and discover data at scale across organizational boundaries.

Accelerating public access policy compliance with AWS support services, programs, and partners

AWS offers technical and strategic support to navigate the many options available to researchers to comply with data sharing policies. AWS Professional Services (AWS ProServe) supplements your team with specialized skills and experience to help you build and implement the right data solution for your organization. Through the Data Driven Everything program, AWS works with customers to move faster and with greater precision using a Working Backwards approach to address people, process, and technology-related considerations. The AWS Data Lab offers accelerated, joint engineering engagements between customers and AWS technical resources to create tangible deliverables that accelerate data, analytics, artificial intelligence (AI) and machine learning (ML), serverless, and containers modernization initiatives.

The AWS Open Data Sponsorship Program (ODP) covers the cost of storage for publicly available high-value cloud-optimized datasets. All datasets sponsored by the Open Data Program are listed on the Registry of Open Data on AWS and the AWS Data Exchange, helping you to keep the cost of your shared data low while optimizing findability. The Open Data Program is managed by the open data team, who are experts in best practices for highly distributed datasets.

Researchers can also work with the robust AWS Partner community, which can help you build any data and analytics application in the cloud. Find an AWS data and analytics partner here.

Learn more about AWS for open data

These new public access policies offer opportunities for researchers and higher education institutions. Once implemented, we believe that they will accelerate the rate of scientific discovery and innovation, all while saving the research community time and money by ensuring maximal discoverability and reuse of data.

Do you have questions for how to use AWS to optimize your research for open science? Reach out to your AWS account team, or contact us to learn more.

Read more about AWS for open data:

Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.

Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.

Erin Chu

Erin Chu

Erin Chu is the life sciences lead on the Amazon Web Services (AWS) open data team. Trained to bridge the gap between the clinic and the lab, Erin is a veterinarian and a molecular geneticist, and spent the last four years in the companion animal genomics space. She is dedicated to helping speed time to science through interdisciplinary collaboration, communication, and learning.

Ben Moscovitch

Ben Moscovitch

Ben Moscovitch leads Amazon Web Services (AWS) healthcare and life science public policy efforts in the Americas. Ben focuses on legislative, regulatory, and other policy reforms to improve health data interoperability, medical product innovation, clinical research, AI/ML use in healthcare, public health, and quality improvement. Previously, Ben directed health IT, medical device, and data policy at the Pew Charitable Trusts and was a reporter covering federal life science and medical technology policy.

Jack Fenwick

Jack Fenwick

Jack Fenwick is a strategic alliances manager for the Higher Education Research Team at Amazon Web Services (AWS). He supports AWS participation in federally-sponsored cloud programs to expand academic researcher access to cloud-based services and technologies, including the NIH STRIDES and NSF CloudBank initiatives. Jack has nearly a decade of experience establishing research and development partnerships domestically and internationally, and is passionate about the impact of research collaborations and cloud technologies for scientific discovery.

Ken Harris

Ken Harris

Ken leads the Amazon Web Services (AWS) vertical focused on academic medicine and state and local government providers. His team of principal trusted advisors in precision medicine, hospital modernization, clinical informatics, and artificial intelligence (AI) and machine learning (ML) are both field-based and customer-facing healthcare executives. He has over 30 years of healthcare experience, including founding and taking public a cell and gene therapy company. Prior experience includes being a chief clinical officer, president, and chief executive officer (CEO) for 10 years prior to joining AWS. Ken has a strong background in building and commercializing regulated products in the device, combination product, clinical software, and therapeutic biologics space.

Scott Friedman

Scott Friedman

Scott Friedman is a principal technical business development manager for higher education research at Amazon Web Services (AWS). He holds a Ph.D. in computer science from UCLA and has over 25 years of experience enabling research in higher education.