Common Crawl Corpus

Public Data Sets>Encyclopedic>Common Crawl Corpus
A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.


Size: 541 TB
Source: Common Crawl Foundation -
Created On: February 15, 2012 2:23 AM GMT
Last Updated: March 17, 2014 5:51 PM GMT
Available at: s3://aws-publicdatasets/common-crawl/

Common Crawl is a non-profit organization dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.

The most current crawl data sets includes three different types of files: Raw Content, Text Only, and Metadata. The data sets from before 2012 contain only Raw Content files.

For more details about the file formats and directory structure please see this blog post.

Common Crawl provides the glue code required to launch Hadoop jobs on Amazon Elastic MapReduce that can run against the crawl corpus residing here in the Amazon Public Data Sets. By utilizing Amazon Elastic MapReduce to access the S3 resident data, end users can bypass costly network transfer costs.

To learn more about Amazon Elastic MapReduce please see the product detail page.

Common Crawl's Hadoop classes and other code can be found in its GitHub repository.

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.