O'Reilly / Amazon /
Google Books

Natural Language Corpus Data: Beautiful Data

This directory contains code and data to accompany the chapter Natural Language Corpus Data from the book Beautiful Data (Segaran and Hammerbacher, 2009). If you like this you may also like: How to Write a Spelling Corrector.

Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium.

Code copyright (c) 2008-2009 by Peter Norvig. You are free to use this code under the MIT license.

To run this code, download the files listed below. Then from a shell execute python -i ngrams.py (or start a Python IDE and import ngrams), and if you want to test if everything works, call test(). Note that the hillclimbing function has a random component, so if you have bad luck it is possible that some of the tests will fail, even if everything is correctly installed. (It is unlikely that they will fail twice in a row.)

Files for Download

.
0.7MBch14.pdfThe chapter from the book.
 
0.0 MBngrams.pyThe Python code for everything in the chapter.
0.0 MBngrams-test.txt  Unit tests; run by the Python function test().
 
4.9 MBcount_1w.txtThe 1/3 million most frequent words, all lowercase, with counts. (Called vocab_common in the chapter, but I changed file names here.)
5.6 MBcount_2w.txtThe 1/4 million most frequent two-word (lowercase) bigrams, with counts.
0.0 MBcount_2l.txtCounts for all 2-letter (lowercase) bigrams.
0.2 MBcount_3l.txtCounts for all 3-letter (lowercase) trigrams.
0.0 MBcount_1edit.txtCounts for all single-edit spelling correction edits, from the file spell-errors.txt.
0.5 MBspell-errors.txtA collection of "right: wrong1, wrong2" spelling mistakes, collected from Wikipedia and Roger Mitton.
 
The following files are not referenced in the chapter, but may be useful to you.
 
6.5 MBbig.txt File of running text used in my spell correction article.
1.0 MBsmaller.txt Excerpt of file of running text from my spell correction article. Smaller; faster to download.
0.3 MBcount_big.txtA word count file (29,136 words) for big.txt.
1.5 MBcount_1w100k.txtA word count file with 100,000 most popular words, all uppercase.
.02 MBwords4.txt4360 words of length 4 (for word games)
.04 MBsgb-words.txt5757 words of length 5 (for word games) from Knuth's Stanford GraphBase
1.1 MBwordlist.ascTom Murphy's word list for portmantout words.
.03 MBwords.js1000 most common words of English from xkcd Simple Writer (more than 1,000 words because plurals are included)
4.3 MBshakespeare.txtThe complete works of Shakespeare, tokenized so that there is a space between words and punctuation. From John DeNero.
4.5 MBshakespeare_input.txtUntokenized Shakespeare, from Andrej Karpathy.
6.2 MBlinux_input.txtUntokenized Linux Kernel C++ code, from Andrej Karpathy.
3.0 MBsowpods.txtThe SOWPODS word list (267,750 words) -- used by Scrabble players (except in North America) and in other word games.
1.9 MBTWL06.txtThe Tournament Word List (178,690 words) -- used by North American Scrabble players.
1.9 MBenable1.txtThe ENABLE word list (172,819 words) -- also used by word game players. Words with Friends uses a variant of this.
2.7 MBword.listThe YAWL (Yet Another Word List) word list (263,533 words) -- formed by combining the above.
(See Internet Scrabble Club for more lists.)


Peter Norvig, 8 July 2008; updated 22 Nov 2011