Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Sign in
Your Saved List Become a Channel Partner Sell in AWS Marketplace Amazon Web Services Home Help

Software Heritage Graph Dataset

Provided by: Software Heritage, part of the AWS Open Data Sponsorship Program

Software Heritage Graph Dataset

Provided by: Software Heritage, part of the AWS Open Data Sponsorship Program

This product is part of the AWS Open Data Sponsorship Program and contains data sets that are publicly available for anyone to access and use. No subscription is required. Unless specifically stated in the applicable data set documentation, data sets available through the AWS Open Data Sponsorship Program are not provided and maintained by AWS.

Description

Software Heritage  is the largest existing public archive of software source code and accompanying development history. The Software Heritage Graph Dataset is a fully deduplicated Merkle DAG representation of the Software Heritage archive.

The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild.

License

Creative Commons Attribution 4.0 International.

By accessing the dataset, you agree with the Software Heritage Ethical Charter for using the archive data , the terms of use for bulk access , and the Software Heritage principles for large language models .

How to cite

Software Heritage Graph Dataset was accessed on DATE from https://registry.opendata.aws/software-heritage .

Update frequency
Data is updated yearly
Support information

Managed by: Software Heritage

Contact: aws@softwareheritage.org

General AWS Data Exchange support

Resources on AWS

Description

Software Heritage Graph Dataset

Resource type
S3 Bucket
Amazon Resource Name (ARN)
arn:aws:s3:::softwareheritage
AWS Region
us-east-1

AWS CLI Access (No AWS account required)

aws s3 ls --no-sign-request s3://softwareheritage/
Description
Resource type
S3 Bucket
Amazon Resource Name (ARN)
arn:aws:s3:::softwareheritage-inventory
AWS Region
us-east-1

AWS CLI Access (No AWS account required)

aws s3 ls --no-sign-request s3://softwareheritage-inventory/

Usage examples