Google Books Ngrams

Public Data Sets>Google Books Ngrams
A data set containing Google Books n-gram corpuses. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. The original dataset is available from http://books.google.com/ngrams/.

Details

Submitted By: adamgray1234
Size: 2.2 TB
Source: Google Books
Created On: January 5, 2011 6:11 PM GMT
Last Updated: January 21, 2012 2:12 AM GMT
Available at: s3://datasets.elasticmapreduce/ngrams/books/

What are n-grams?

N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters.

The n grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token. For example, the following sentence.

The yellow dog played fetch.

Would produce the following 2-grams:

["The", "yellow"]
["yellow", 'dog"]
["dog", "played"]
["played", "fetch"]
["fetch", "."]

Or the following 3-grams:

["The", "yellow", "dog"]
["yellow", "dog", "played"]
["dog", "played", "fetch"]
["played", "fetch", "."]

You can aggregate equivalent n-grams to find the total number of occurrences of that n-gram. This dataset contains counts of n-grams by year along three axis: total occurrences, number of pages on which they occur, and number of books in which they appear.

Dataset format

There are a number of different datasets available. Each dataset is a single n-gram type (1-gram, 2-gram, etc.) for a given input corpus (such as English or Russian text).

We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable.

The value is a tab separated string containing the following fields:

  • n-gram - The actual n-gram
  • year - The year for this aggregation
  • occurrences - The number of times this n-gram appeared in this year
  • pages - The number of pages this n-gram appeared on in this year
  • books - The number of books this n-gram appeared in during this year

The n-gram field is a space separated representation of the tuple.

Example: analysis is often described as 1991 1 1 1

Available Datasets

The entire dataset hasn't been released yet, but those that were complete as of the time of writing are available. Here are the names of the available corpuses and their abbreviation.

  • English One Million - eng-1M
  • American English - eng-us-all
  • British English - eng-gb-all
  • English Fiction - eng-fiction-all
  • Chinese (simplified) - chi-sim-all
  • French - fre-all
  • German - ger-all
  • Hebrew - heb-all
  • Russian - rus-all
  • Spanish - spa-all

Within each corpus there are up to five datasets, representing the n-grams from length one to five. These can be found in Amazon S3 at the following location.

s3://datasets.elasticmapreduce/ngrams/books/20090715/[corpus]/[#]gram/data

For example, you can find the American English 1-grams at the following location:

s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data

NOTE: These datasets are hosted in the us-east-1 region. If you process these from other regions you will be charged data transfer fees.

Dataset statistics

This table contains information about all available datasets.

DataRowsCompressed Size
English
1 gram 472,764,897 4.8 GB
2 gram 6,626,604,215 65.6 GB
3 gram 23,260,642,968 218.1 GB
4 gram 32,262,967,656 293.5 GB
5 gram 24,492,478,978 221.5 GB
English One Million
1 gram 261,823,186 2.6 GB
2 gram 3,383,379,445 32.1 GB
3 gram 10,565,828,499 94.8 GB
4 gram 12,987,703,773 113.1 GB
5 gram 8,747,884,729 75.8 GB
American English
1 gram 291,639,822 3.0 GB
2 gram 3,923,370,881 38.3 GB
3 gram 12,368,376,963 113.9 GB
4 gram 15,118,570,841 135.0 GB
5 gram 10,175,161,944 90.2 GB
British English
1 gram 188,660,459 1.9 GB
2 gram 2,000,106,933 19.1 GB
3 gram 5,186,054,851 46.8 GB
4 gram 5,325,077,699 46.6 GB
5 gram 3,044,234,000 26.4 GB
English Fiction
1 gram 191,545,012 2.0 GB
2 gram 2,516,249,717 24.3 GB
3 gram 7,444,565,856 68.0 GB
4 gram 8,913,702,898 79.1 GB
5 gram 6,282,045,487 55.5 GB
Chinese
1 gram 7,741,178 0.1 GB
2 gram 209,624,705 2.2 GB
3 gram 701,822,863 7.2 GB
4 gram 672,801,944 6.8 GB
5 gram 325,089,783 3.4 GB
French
1 gram 157,551,172 1.6 GB
2 gram 1,501,278,596 14.3 GB
3 gram 4,124,079,420 37.3 GB
4 gram 4,659,423,581 41.2 GB
5 gram 3,251,347,768 28.8 GB
German
1 gram 243,571,225 2.5 GB
2 gram 1,939,436,935 18.3 GB
3 gram 3,417,271,319 30.9 GB
4 gram 2,488,516,783 21.9 GB
5 gram 1,015,287,248 8.9 GB
Hebrew
1 gram 44,400,490 0.5 GB
2 gram 252,069,581 2.4 GB
3 gram 163,471,963 1.5 GB
4 gram 43,778,747 0.4 GB
5 gram 11,088,380 0.1 GB
Russian
1 gram 238,494,121 2.5 GB
2 gram 2,030,955,601 20.2 GB
3 gram 2,707,065,011 25.8 GB
4 gram 1,716,983,092 16.1 GB
5 gram 800,258,450 7.6 GB
Spanish
1 gram 164,009,433 1.7 GB
2 gram 1,580,350,088 15.2 GB
3 gram 3,836,748,867 35.3 GB
4 gram 3,731,672,912 33.6 GB
5 gram 2,013,934,820 18.1 GB

More reading

For more information on getting started processing n-gram data, read this article about discovering trending terms by decade with Apache Hive on Amazon Elastic MapReduce.

Citation

This work by Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331 (2011) [Published online ahead of print 12/16/2010].
©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.