Common Crawl Corpus

Public Data Sets>Encyclopedic>Common Crawl Corpus
A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.

Details

Size: 81 TB
Source: Common Crawl Foundation -­ http://commoncrawl.org
Created On: February 15, 2012 2:23 AM GMT
Last Updated: November 29, 2012 9:01 AM GMT
Available at: s3://aws-publicdatasets/common-crawl/crawl-002/

Common Crawl is a non-profit organization dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.

The current crawl data set includes three different types of files: ARC raw content, Text Only, and Metadata. The archived crawl data sets contain only ARC raw content files. For more details about the file formats, please see the Common Crawl Wiki.

Common Crawl provides the glue code required to launch Hadoop jobs on Amazon Elastic MapReduce that can run against the crawl corpus residing here in the Amazon Public Data Sets. By utilizing Amazon Elastic MapReduce to access the S3 resident data, end users can bypass costly network transfer costs.

To learn more about Amazon Elastic MapReduce please see the product detail page.

Common Crawl's Hadoop classes and other code can be found in its GitHub repository.

©2013, Amazon Web Services, Inc. or its affiliates. All rights reserved.