This dataset contains a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011).
Compiled by Scott C. Frase at BiggData on April 1, 2011
To mount the snapshot:
localmachine $ ec2-create-volume --snapshot snap-f57dec9a -z us-east-1a localmachine $ ec2-attach-volume vol-ec123456 -i i-df123456 -d /dev/sdf root@domU-XX-XX-XX-XX-XX-XX:/mnt# mkdir /mnt/wikidata root@domU-XX-XX-XX-XX-XX-XX:/mnt# mount /dev/sdf /mnt/wikidata
Contents of the snapshot:
Like Wikipedia itself, All text content is licensed under the GNU Free Documentation License (GFDL). All statistics and link data is also licensed under the GNU Free Documentation License (GFDL). http://www.gnu.org/copyleft/fdl.html
wikidata/wikistats (150G)
Contains hourly wikipedia article traffic statistics dataset covering 3 month period from January 01 2011 to March 31 2011, this data is regularly logged from the wikipedia squid proxy by Domas Mituzas.
Each of the 2,161 log files is named with the date and time of collection: pagecounts-20090430-230000.gz
Each line has 4 fields: projectcode, pagename, pageviews, bytes
en Barack_Obama 997 123091092 en Barack_Obama%27s_first_100_days 8 850127 en Barack_Obama,_Jr 1 144103 en Barack_Obama,_Sr. 37 938821 en Barack_Obama_%22HOPE%22_poster 4 81005 en Barack_Obama_%22Hope%22_poster 5 102081