
Common Crawl
Provided by: Common Crawl , part of the AWS Open Data Sponsorship Program
Provided by: Common Crawl , part of the AWS Open Data Sponsorship Program

Common Crawl
Provided by: Common Crawl , part of the AWS Open Data Sponsorship Program
Provided by: Common Crawl , part of the AWS Open Data Sponsorship Program
This product is part of the AWS Open Data Sponsorship Program and contains data sets that are publicly available for anyone to access and use. No subscription is required. Unless specifically stated in the applicable data set documentation, data sets available through the AWS Open Data Sponsorship Program are not provided and maintained by AWS.
Description
A corpus of web crawl data composed of over 50 billion web pages.
License
This data is available for anyone to use under the Common Crawl Terms of Use
Documentation
How to cite
Common Crawl was accessed on DATE
from https://registry.opendata.aws/commoncrawl .
Update frequency
Monthly
Support information
Managed by: Common Crawl
General AWS Data Exchange support
Resources on AWS
Description
Crawl data (WARC and ARC format)
Resource type
S3 Bucket
Amazon Resource Name (ARN)
arn:aws:s3:::commoncrawl
AWS Region
us-east-1
AWS CLI Access
aws s3 ls s3://commoncrawl/
Usage examples
Tutorials
- Analysing Petabytes of Websites by Mark Litwintschik
- Common Crawl Index Athena by Edward Ross
- Index to WARC Files and URLs in Columnar Format by Sebastian Nagel
- Large-scale graph mining with Spark by Win Suen
- One click to download all the web pages you may want by Jader Dias
- Search the Common Crawl Using Lambda Functions by Andres Riancho
Publications
- Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures by Pedro Javier Ortiz Suárez, Benoît Sagot, Laurent Romary
- Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl by Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, Chris Biemann
- C4Corpus: Multilingual Web-Size Corpus with Free License by Ivan Habernal, Omnia Zayed, Iryna Gurevych
- CC-News-En: A large English news corpus by Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R. Trippas, J. Shane Culpepper, Alistair Moffat
- CCAligned: A Massive collection of cross-lingual web-document pairs by Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, Philipp Koehn
- Coyo-700m: Image-text pair dataset by Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, Saehoon Kim
- Defending against neural fake news by Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, et al
- Index fun by Philippe Suter
- LAION-5B: An open large-scale dataset for training next generation image-text models by Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, et al
- Language is not all you need: aligning perception with language models by Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, et al
- Language models are few-shot learners by Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al
- Large-scale analysis of style injection by relative path overwrite by Sajjad Arshad, Seyed Ali Mirheidari, Tobias Lauinger, Bruno Crispo, Engin Kirda, William Robertson
- LLaMA: open and efficient foundation language models by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, et al
- Mapping languages: The Corpus of Global Language Use by Jonathan Dunn
- mT5: A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, et al
- Multimodal C4: an open, billion-scale corpus of images interleaved with text by Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, et al
- N-gram counts and language models from the Common Crawl by Christian Buck, Kenneth Heafield, Bas van Ooyen
- No Language Left Behind: scaling human-centered machine translation by Costa-jussà, Marta R., James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, et al
- Of using Common Crawl to play Family Feud by Paul Masurel
- On the impact of publicly available news and information transfer to financial markets by Metod Jazbec, Barna Pásztor, Felix Faltings, Nino Antulov-Fantulin, Petter N. Kolm
- Using open data to predict market movements by DELL EMC
- Web Data Commons - RDFa, microdata, and microformat data sets by Christian Bizer, Robert Meusel, Anna Primpeli
Tools & Applications
- All Around The World: The Common Crawl Dataset - Attack Surface Research by Aliz Hammond
- CCNet: Extracting high quality monolingual datasets from web crawl data by Facebook AI Research
- Dresden Web Table Corpus (DWTC) by Database Systems Group Dresden
- Glove: Global vectors for word representation by Jeffrey Pennington, Richard Socher, Christopher D. Manning
- Learning word vectors for 157 languages by Facebook AI Research
- Ransacking your password reset tokens by Lukas Euler
- Search the html across 25 billion websites for passive reconnaissance using common crawl by Ryan Elkins