Listing Thumbnail

    Software Heritage Graph Dataset

     Info
    Open data
    |
    Deployed on AWS
    [Software Heritage](https://www.softwareheritage.org/) is the largest existing public archive of software source code and accompanying development history. The Software Heritage Graph Dataset is a fully deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. Author and committer information is anonymized.

    Overview

    Software Heritage  is the largest existing public archive of software source code and accompanying development history. The Software Heritage Graph Dataset is a fully deduplicated Merkle DAG representation of the Software Heritage archive.

    The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. Author and committer information is anonymized.

    Features and programs

    Open Data Sponsorship Program

    This dataset is part of the Open Data Sponsorship Program, an AWS program that covers the cost of storage for publicly available high-value cloud-optimized datasets.

    Pricing

    This is a publicly available data set. No subscription is required.

    How can we make this page better?

    We'd like to hear your feedback and ideas on how to improve this page.
    We'd like to hear your feedback and ideas on how to improve this page.

    Legal

    Content disclaimer

    Vendors are responsible for their product descriptions and other product content. AWS does not warrant that vendors' product descriptions or other product content are accurate, complete, reliable, current, or error-free.

    Usage information

     Info

    Delivery details

    AWS Data Exchange (ADX)

    AWS Data Exchange is a service that helps AWS easily share and manage data entitlements from other organizations at scale.

    Open data resources

    Available with or without an AWS account.

    How to use
    To access these resources, reference the Amazon Resource Name (ARN) using the AWS Command Line Interface (CLI). Learn more 
    Description
    Software Heritage Graph Dataset
    Resource type
    S3 bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::softwareheritage
    AWS region
    us-east-1
    AWS CLI access (No AWS account required)
    aws s3 ls --no-sign-request s3://softwareheritage/
    Description
    [S3 Inventory](https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html#storage-inventory-contents) files
    Resource type
    S3 bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::softwareheritage-inventory
    AWS region
    us-east-1
    AWS CLI access (No AWS account required)
    aws s3 ls --no-sign-request s3://softwareheritage-inventory/

    Resources

    Support

    Managed By

    Software Heritage

    How to cite

    Software Heritage Graph Dataset was accessed on DATE from https://registry.opendata.aws/software-heritage .

    License

    The term "Software Heritage Graph Dataset" designates the internal structure of the Software Heritage archive, and explicitly excludes the file contents. The "Software Heritage Graph Dataset" is distributed under the Creative Commons Attribution 4.0 International license. For terms of use of all other contents found in the S3 buckets, contact datasets@softwareheritage.org 

    By accessing the dataset, you agree with the Software Heritage Ethical Charter for using the archive data , the terms of use for bulk access , and the Software Heritage principles for large language models .

    Similar products