What are n-grams?
N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters.
The n grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token. For example, the following sentence.
The yellow dog played fetch.
Would produce the following 2-grams:
["The", "yellow"]
["yellow", 'dog"]
["dog", "played"]
["played", "fetch"]
["fetch", "."]
Or the following 3-grams:
["The", "yellow", "dog"]
["yellow", "dog", "played"]
["dog", "played", "fetch"]
["played", "fetch", "."]
You can aggregate equivalent n-grams to find the total number of occurrences of that n-gram. This dataset contains counts of n-grams by year along three axis: total occurrences, number of pages on which they occur, and number of books in which they appear.
Dataset format
There are a number of different datasets available. Each dataset is a single n-gram type (1-gram, 2-gram, etc.) for a given input corpus (such as English or Russian text).
We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable.
The value is a tab separated string containing the following fields:
- n-gram - The actual n-gram
- year - The year for this aggregation
- occurrences - The number of times this n-gram appeared in this year
- pages - The number of pages this n-gram appeared on in this year
- books - The number of books this n-gram appeared in during this year
The n-gram field is a space separated representation of the tuple.
Example: analysis is often described as 1991 1 1 1Available Datasets
The entire dataset hasn't been released yet, but those that were complete as of the time of writing are available. Here are the names of the available corpuses and their abbreviation.
- English One Million - eng-1M
- American English - eng-us-all
- British English - eng-gb-all
- English Fiction - eng-fiction-all
- Chinese (simplified) - chi-sim-all
- French - fre-all
- German - ger-all
- Hebrew - heb-all
- Russian - rus-all
- Spanish - spa-all
Within each corpus there are up to five datasets, representing the n-grams from length one to five. These can be found in Amazon S3 at the following location.
s3://datasets.elasticmapreduce/ngrams/books/20090715/[corpus]/[#]gram/data
For example, you can find the American English 1-grams at the following location:
s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data
NOTE: These datasets are hosted in the us-east-1 region. If you process these from other regions you will be charged data transfer fees.
Dataset statistics
This table contains information about all available datasets.
| Data | Rows | Compressed Size |
|---|---|---|
| English | ||
| 1 gram | 472,764,897 | 4.8 GB |
| 2 gram | 6,626,604,215 | 65.6 GB |
| 3 gram | 23,260,642,968 | 218.1 GB |
| 4 gram | 32,262,967,656 | 293.5 GB |
| 5 gram | 24,492,478,978 | 221.5 GB |
| English One Million | ||
| 1 gram | 261,823,186 | 2.6 GB |
| 2 gram | 3,383,379,445 | 32.1 GB |
| 3 gram | 10,565,828,499 | 94.8 GB |
| 4 gram | 12,987,703,773 | 113.1 GB |
| 5 gram | 8,747,884,729 | 75.8 GB |
| American English | ||
| 1 gram | 291,639,822 | 3.0 GB |
| 2 gram | 3,923,370,881 | 38.3 GB |
| 3 gram | 12,368,376,963 | 113.9 GB |
| 4 gram | 15,118,570,841 | 135.0 GB |
| 5 gram | 10,175,161,944 | 90.2 GB |
| British English | ||
| 1 gram | 188,660,459 | 1.9 GB |
| 2 gram | 2,000,106,933 | 19.1 GB |
| 3 gram | 5,186,054,851 | 46.8 GB |
| 4 gram | 5,325,077,699 | 46.6 GB |
| 5 gram | 3,044,234,000 | 26.4 GB |
| English Fiction | ||
| 1 gram | 191,545,012 | 2.0 GB |
| 2 gram | 2,516,249,717 | 24.3 GB |
| 3 gram | 7,444,565,856 | 68.0 GB |
| 4 gram | 8,913,702,898 | 79.1 GB |
| 5 gram | 6,282,045,487 | 55.5 GB |
| Chinese | ||
| 1 gram | 7,741,178 | 0.1 GB |
| 2 gram | 209,624,705 | 2.2 GB |
| 3 gram | 701,822,863 | 7.2 GB |
| 4 gram | 672,801,944 | 6.8 GB |
| 5 gram | 325,089,783 | 3.4 GB |
| French | ||
| 1 gram | 157,551,172 | 1.6 GB |
| 2 gram | 1,501,278,596 | 14.3 GB |
| 3 gram | 4,124,079,420 | 37.3 GB |
| 4 gram | 4,659,423,581 | 41.2 GB |
| 5 gram | 3,251,347,768 | 28.8 GB |
| German | ||
| 1 gram | 243,571,225 | 2.5 GB |
| 2 gram | 1,939,436,935 | 18.3 GB |
| 3 gram | 3,417,271,319 | 30.9 GB |
| 4 gram | 2,488,516,783 | 21.9 GB |
| 5 gram | 1,015,287,248 | 8.9 GB |
| Hebrew | ||
| 1 gram | 44,400,490 | 0.5 GB |
| 2 gram | 252,069,581 | 2.4 GB |
| 3 gram | 163,471,963 | 1.5 GB |
| 4 gram | 43,778,747 | 0.4 GB |
| 5 gram | 11,088,380 | 0.1 GB |
| Russian | ||
| 1 gram | 238,494,121 | 2.5 GB |
| 2 gram | 2,030,955,601 | 20.2 GB |
| 3 gram | 2,707,065,011 | 25.8 GB |
| 4 gram | 1,716,983,092 | 16.1 GB |
| 5 gram | 800,258,450 | 7.6 GB |
| Spanish | ||
| 1 gram | 164,009,433 | 1.7 GB |
| 2 gram | 1,580,350,088 | 15.2 GB |
| 3 gram | 3,836,748,867 | 35.3 GB |
| 4 gram | 3,731,672,912 | 33.6 GB |
| 5 gram | 2,013,934,820 | 18.1 GB |