The WestburyLab USENET corpus

Public Data Sets>The WestburyLab USENET corpus
The WestburyLab USENET corpus is an anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010.

Details

Submitted By: adamgray1234
US Snapshot ID (Linux/Unix): snap-c1d156aa (US West)
Size: 40GB
Source: The Usenet
Created On: November 17, 2010 1:17 AM GMT
Last Updated: November 17, 2010 1:17 AM GMT

The WestburyLab USENET corpus is an anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010.

All NNTP headers were discarded [there is no way to recover them, and this is done to ensure the privacy of the authors.] All message bodies that had the same 128bit SHA-1 hash as other message bodies were discarded (reducing duplication of documents from cross-posts). To reduce the amount of garbage data and non-english text in the corpus, the following pre-processing steps were taken:

  • All documents that were less than 500 words and greater than 500,000 words were omitted.
  • Documents that contained less than 90% English words were omitted. (English words were defined as words that are contained in a 100,000 words dictionary of english).

To anonymize the text, we aslo did the following:

  • Replaced all of the obvious e-mail addresses with the token <EMAILADDRESS>.
  • Replaced all of the obvious HTTP URLs with the token <URL>, and news URLs with <NEWSURL>.

It is over 40GB in size, compressed (delivered as weekly bundles of about 150 Mb each).

For more information about this corpus please see: The Westbury Lab Web Site.

Citation:
Shaoul, C. & Westbury C. (2010) A USENET corpus (2005-2010) Edmonton, AB: University of Alberta

©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.