The WestburyLab USENET corpus is an anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010.
All NNTP headers were discarded [there is no way to recover them, and this is done to ensure the privacy of the authors.] All message bodies that had the same 128bit SHA-1 hash as other message bodies were discarded (reducing duplication of documents from cross-posts). To reduce the amount of garbage data and non-english text in the corpus, the following pre-processing steps were taken:
- All documents that were less than 500 words and greater than 500,000 words were omitted.
- Documents that contained less than 90% English words were omitted. (English words were defined as words that are contained in a 100,000 words dictionary of english).
To anonymize the text, we aslo did the following:
- Replaced all of the obvious e-mail addresses with the token <EMAILADDRESS>.
- Replaced all of the obvious HTTP URLs with the token <URL>, and news URLs with <NEWSURL>.
It is over 40GB in size, compressed (delivered as weekly bundles of about 150 Mb each).
For more information about this corpus please see: The Westbury Lab Web Site.
Shaoul, C. & Westbury C. (2010) A USENET corpus (2005-2010) Edmonton, AB: University of Alberta