Common Crawl is a non-profit organization dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.
The current crawl data set includes three different types of files: ARC raw content, Text Only, and Metadata. The archived crawl data sets contain only ARC raw content files. For more details about the file formats, please see the Common Crawl Wiki.
Common Crawl provides the glue code required to launch Hadoop jobs on Amazon Elastic MapReduce that can run against the crawl corpus residing here in the Amazon Public Data Sets. By utilizing Amazon Elastic MapReduce to access the S3 resident data, end users can bypass costly network transfer costs.
To learn more about Amazon Elastic MapReduce please see the product detail page.
Common Crawl's Hadoop classes and other code can be found in its GitHub repository.