
Overview
Software Heritage is the largest existing public archive of software source code and accompanying development history. The Software Heritage Graph Dataset is a fully deduplicated Merkle DAG representation of the Software Heritage archive.
The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. Author and committer information is anonymized.
Features and programs
Open Data Sponsorship Program
Pricing
This is a publicly available data set. No subscription is required.
How can we make this page better?
Legal
Content disclaimer
Delivery details
AWS Data Exchange (ADX)
AWS Data Exchange is a service that helps AWS easily share and manage data entitlements from other organizations at scale.
Open data resources
Available with or without an AWS account.
- How to use
- To access these resources, reference the Amazon Resource Name (ARN) using the AWS Command Line Interface (CLI). Learn more
- Description
- Software Heritage Graph Dataset
- Resource type
- S3 bucket
- Amazon Resource Name (ARN)
- arn:aws:s3:::softwareheritage
- AWS region
- us-east-1
- AWS CLI access (No AWS account required)
- aws s3 ls --no-sign-request s3://softwareheritage/
- Description
- [S3 Inventory](https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html#storage-inventory-contents) files
- Resource type
- S3 bucket
- Amazon Resource Name (ARN)
- arn:aws:s3:::softwareheritage-inventory
- AWS region
- us-east-1
- AWS CLI access (No AWS account required)
- aws s3 ls --no-sign-request s3://softwareheritage-inventory/
Resources
Vendor resources
Support
Contact
Managed By
Software Heritage
How to cite
Software Heritage Graph Dataset was accessed on DATE from https://registry.opendata.aws/software-heritage .
License
The term "Software Heritage Graph Dataset" designates the internal structure of the Software Heritage archive, and explicitly excludes the file contents. The "Software Heritage Graph Dataset" is distributed under the Creative Commons Attribution 4.0 International license. For terms of use of all other contents found in the S3 buckets, contact datasets@softwareheritage.org
By accessing the dataset, you agree with the Software Heritage Ethical Charter for using the archive data , the terms of use for bulk access , and the Software Heritage principles for large language models .