This dataset contains a 800 GB sample of the data used to power trendingtopics.org. It includes 16 months of hourly page traffic statistics for over 2.5 Million wikipedia articles (~ 2.5 TB uncompressed) along with the associated wikipedia content, linkgraph, & metadata.
Compiled by Peter Skomoroch at Data Wrangling, LLC on Feb, 12, 2010
To mount the snapshot:
localmachine $ ec2-create-volume --snapshot snap-0c155c67 -z us-east-1a localmachine $ ec2-attach-volume vol-ec123456 -i i-df123456 -d /dev/sdf root@domU-XX-XX-XX-XX-XX-XX:/mnt# mkdir /mnt/wikidata root@domU-XX-XX-XX-XX-XX-XX:/mnt# mount /dev/sdf /mnt/wikidata
Contents of the snapshot:
Like Wikipedia itself, all text content, traffic statistics, and link data is released under the Creative Commons: Attribution Share Alike License. http://en.wikipedia.org/wiki/CC-BY-SA-3.0
wikidata/wikistats (650G)Contains hourly wikipedia article traffic statistics dataset covering 16 month period from October 01 2008 to February 6, 2010, from raw anonymous logs provided by Domas Mituzas at http://dammit.lt/2007/12/10/wikipedia-page-counters/
Each log file is named with the date and time of collection: pagecounts-20090430-230000.gz
Each line has 4 fields: projectcode, pagename, pageviews, bytes
en Barack_Obama 997 123091092 en Barack_Obama%27s_first_100_days 8 850127 en Barack_Obama,_Jr 1 144103 en Barack_Obama,_Sr. 37 938821 en Barack_Obama_%22HOPE%22_poster 4 81005 en Barack_Obama_%22Hope%22_poster 5 102081
wikidata/wikilinks (1.1G)Contains a wikipedia linkgraph dataset provided by Henry Haselgrove.
These files contain all links between proper english language Wikipedia pages, that is pages in "namespace 0". This includes disambiguation pages and redirect pages.
In links-simple-sorted.txt, there is one line for each page that has links from it. The format of the lines is:
from1: to11 to12 to13 ... from2: to21 to22 to23 ... ...
where from1 is an integer labelling a page that has links from it, and to11 to12 to13 ... are integers labelling all the pages that the page links to. To find the page title that corresponds to integer n, just look up the n-th line in the file titles-sorted.txt.
Contains raw wikipedia dumps along with some processed versions using data from: http://en.wikipedia.org/wiki/Wikipedia_databaseSee README files in the corresponding subdirectories for more details
- Data Wrangling -