The Common Crawl corpus includes web crawl data collected over 8 years. Common Crawl offers the largest, most comprehensive, open repository of web crawl data on the cloud.

Amazon Web Services has made the Common Crawl Corpus freely available on Amazon S3 so that anyone may use AWS’s on-demand computing resources to perform analysis and create new products without needing to worry about the cost of storing the data or the time required to download it.

The corpus contains raw web page data, extracted metadata and text extractions. Common Crawl releases new web crawl data on a monthly basis. Each month’s dataset is available on Amazon S3 and includes raw data, metadata, and text from – on average – 2 billion web pages. The data are available for access under Common Crawl’s Terms of Use.

Common Crawl is a non-profit organization dedicated to contributing to the thriving commons of open data that will drive innovation, research, and education in the 21st century. By providing an open repository of web crawl data that is, in essence, a copy of the Internet, Common Crawl advances a truly open web and democratizes access to information.

Machine-scale analysis of Common Crawl data provides insight into politics, art, economics, health, popular culture and almost every other aspects of life. Common Crawl data is used around the world, by people and organizations in many fields of interest, including academics, researchers, scientists, businesses, governments, technologists, startups, and hobbyists. Anyone may now access high quality crawl data that was previously only available to large search engine corporations. 

The Common Crawl dataset is composed of numerous individual crawls performed over a number of years. Which crawl archive to use depends on a number of factors. Early crawls from 2008 to 2012 are available in the ARC data format and were performed over large periods of time. For all crawls since 2013, the crawl has been shorter in duration, usually a month, and data has been stored in the WARC file format. Starting from 2012, the crawl archives also contain metadata (WAT) and text data (WET) extracts, simplifying processing of the data substantially. For full details on the file formats used, up-to-date information on the most recent crawls, and example processing code, refer to Getting Started with Common Crawl.

Source
Common Crawl Foundation
Category Encyclopedic
Format Web Archive format (WARC) for recent crawls, ARC for historical crawls
License This data is available for anyone to use under the Common Crawl Terms of Use.
Storage Service Amazon S3
Location s3://commoncrawl in US East Region
Update Frequency Monthly

Ross Fairbanks created WikiReverse to interactively query how the web uses and refers to Wikipedia articles. The dataset and codebase, which primarily uses Elastic MapReduce, have been released for others to build upon.

In Yelp’s Engineering blog post “Analyzing the Web For the Price of a Sandwich,” the Yelp team explains how they used Elastic MapReduce with Python to extract 748 million US phone numbers from two billion web pages. Most impressively, the Python code they used is only 134 lines long and cost a total of $10.60 to run.

The Web Data Commons (WDC) project has extracted numerous forms of structured data from the Common Crawl dataset and provides the extracted data for public download, support researchers and companies in exploiting the wealth of information that is available on the Web.

Some of the best known datasets that they provide include:

  • WDC Hyperlink Graph - with 3.5 billion nodes and 128 billion edges, this is the largest known freely available real world graph dataset.
  • WDC Web Tables - 147 million relational web tables refined from 11 billion HTML tables extracted from the initial Common Crawl data.

The Web Data Commons team have also published their extraction framework which effectively uses Amazon’s EC2, SQS, and S3 services to complete these large processing tasks.

For large amounts of raw text, the generation of N-gram and language models are frequently useful. Rather than recreating this work yourself, the authors behind “N-gram counts and language models from the CommonCrawl” have published their resulting datasets for anyone to use. They also release the raw deduplicated textual data split by language, likely useful for many other tasks.

Stanford’s GloVe is an unsupervised learning algorithm for obtaining word vectors for use in natural language processing tasks. The team released the source code to reproduce their work as well as models trained on 42 billion tokens and 840 billion tokens of Common Crawl data. Using their novel algorithm and the substantial dataset that Common Crawl provides, they showed competitive results compared to Google’s word2vec.

If you would like to show us what you can do with Common Crawl data on AWS or would like to receive updates please fill out the form below.

Educators, researchers and students can also apply for free credits to take advantage of Public Data Sets on AWS. If you have a research project that could take advantage of Common Crawl, you can apply for AWS Cloud Credits for Research.