Common Crawl is a non-profit organization dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.
The most current crawl data sets includes three different types of files: Raw Content, Text Only, and Metadata. The data sets from before 2012 contain only Raw Content files.
For more details about the file formats and directory structure please see this blog post.
Common Crawl provides the glue code required to launch Hadoop jobs on Amazon Elastic MapReduce that can run against the crawl corpus residing here in the Amazon Public Data Sets. By utilizing Amazon Elastic MapReduce to access the S3 resident data, end users can bypass costly network transfer costs.
To learn more about Amazon Elastic MapReduce please see the product detail page.
Common Crawl's Hadoop classes and other code can be found in its GitHub repository.