Public Data Sets

Public Data Sets>Encyclopedic
Showing 1-10 of 10 results.
Sort by:
A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.
Last Modified: Mar 17, 2014 17:51 PM GMT
A data set containing Google Books n-gram corpuses. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. The original dataset is available from http://books.google.com/ngrams/.
Last Modified: Jan 21, 2012 2:12 AM GMT
A data dump of the basic identifying facts about every topic in Freebase
Last Modified: Jun 24, 2011 18:08 PM GMT
A data dump of all the current facts and assertions in Freebase
Last Modified: Jun 24, 2011 18:04 PM GMT
Contains 16 months of hourly pageview statistics for all articles in Wikipedia
Last Modified: Sep 8, 2010 21:47 PM GMT
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web
Last Modified: Aug 10, 2010 15:23 PM GMT
A complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML.
Last Modified: Sep 29, 2009 1:09 AM GMT
Contains 7 months of hourly pageview statistics for all articles in Wikipedia
Last Modified: Jun 10, 2009 3:29 AM GMT
Freebase is an open database of the world's information, covering millions of topics in hundreds of categories
Last Modified: Jun 4, 2009 20:22 PM GMT
A processed dump of the English language Wikipedia
Last Modified: Jun 4, 2009 20:21 PM GMT
Results per page:
©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.