Vocabulary Analysis of Project Gutenberg

Zachary Booth Simpson
May 2000

(c)2002 ZBS. http://www.mine-control.com/zack
Please sign my guestbo0k if you find this work useful.

Introduction

While reading Moby Dick in April 2000, I was astounded by Melville's enormous vocabulary. I wondered what was Moby Dick's total vocabulary and how it compared to other works. Thanks to the Project Gutenberg, an online resource for literature, (and copious spare-time) I was able to download a considerable sample of works and perform a word analysis. The following are the results from this informal study including relative vocabulary densities and anomalous word usage.

Please sign my guestbo0k if you find this interesting or helpful. Thanks, Zack

Sample Database

The works represented in this study come exclusively from the Project Gutenberg (PG). While most PG works are included, the sample is not complete; some works have been eliminated for obvious reasons (e.g. Pi to 10,000 digits) while others works were eliminated because they were malformed or unavailable. Some books in the Project Gutenberg are split into several seperate volumes or alternatively several works are combined into one; this may effect the sample slightly, especially the Anomalous Word Charts. In some cases, I have manually combined multiple volumes into one for logical consistency.

The sample index was derived from the Thallason Index of the Project Gutenberg because the master indices from the PG itself were inconsistent. I extend my thanks to their efforts as well as to all contributors to the Project Gutenberg.

View the sample database by TITLE
View the sample database by AUTHOR

NOTE: Due to a change of server, I no longer have sufficient room to store the entire sample database on-line.  My apologies.

Total Vocabulary

'Total Vocabulary' is the measure of unique words in a book. A word is defined as a set of case-insensitive alpha characters and apostrophes (to include contractions such as can't) thus excludes numbers and punctuation. Each work is scanned in its entirety including titles, indices, and page numbers after eliminating the Gutenberg Preamble which prefixes each work.

Largest Vocabularies (Regardless Of Book Size)

Title
(click on work to view word anomalies)
Author Vocabulary Words
Decline and Fall of the Roman Empire, vol 1-6Gibbon, Edward431131543676
Roget's ThesaurusAnonymous / Various39023203886
Gargantua and PantagruelRabelais, Francis25985323013
1998 CIA World Factbook, TheUS CIA24220422744
Les MiserablesHugo, Victor23334570508
Anomalies and Curiosities of MedicineGould/Pyle22930393856
Brann The Iconoclast, vol 1,10,12Brann, William Cowper22507300783
Plutarch's Lives, trans by A. H. CloughPlutarch20237742013
History Of The Conquest Of Peru (2nd ver), ThePrescott, William H.19235300976
Warfare of Science/TheologyWhite, Andrew Dickson19187322799
Bible, Douay-Rheims Version, Challoner Revision, TheAnonymous / Various185591029084
Moby DickMelville, Herman17227211763
Cloister and the Hearth, TheReade, Charles16911282120
Hackers' Dictionary of Computer Jargon, TheAnonymous / Various16757169716
Sketches by BozDickens, Charles16413262440
Vanity FairThackeray, William Makepeace16349360049
Our Mutual FriendDickens, Charles16337338266
Dombey and SonDickens, Charles16332366517
Pickwick Papers, TheDickens, Charles16253313143
Don Quixote (tr John Ormsby)Cervantes16160425814
Count of Monte Cristo, TheDumas, père, Alexandre16110464256
Terminal Compromise/NetNovelSchartau, Win15898213672

Vocabulary Density

'Vocabulary Density' is a measurement of vocabulary usage in comparison to the length of the book. This ratio is expressed as the 'Inverse Absolute Vocabulary Density' and is computed dividing the Total Words by the Unique Words (W/V). This statistic may be thought of as: 'how many words will be read on average before a new word is encountered.' For example, Moby Dick has a (W/V) score of approximately 12 -- a new word is introduced on approximately every line of the book! That is quite an accomplishment for a work that is almost a quarter of a million words long.

Ideally, the (W/V) statistic allows comparison of one book's style to another. However, this simplistic metric is complicated by the simple fact that a short work will inevitably be denser than a larger work due to the fact that practically every word in a short work is unique. To understand, consider the case of writing a multi-million word essay. Given that there are only a limited number of words in the English language (~400,000 in this sample), one would eventually run out of words and thus the vocabulary density of such a titanic treatise would drop accordingly. This effect can be seen in the flattening trend of the scatter plots seen below.

Figure 1. 800,000 word domain
Figure 2. 100,000 word domain
Figure 3. 30,000 word domain

Scatter-plots of inverse vocabulary density (y-axis) vs. total words (x-axis). Samples below the pink trend line have denser vocabularies than average, those above, sparser. Note that trend line fits less well for smaller works.

In order that the vocabulary densities of large and small works may be compared, a 'normalizing' curve is fit to the sample creating a 'normalized density score' useful for comparison. Unfortunately, the one-size-fits-all trend curve (found empirically by minimizing least mean square error of a square-root scale coefficient) fails to fit the smaller works well as can be seen in Figure 3. Thus, comparison of large works (> 30,000 words) to smaller ones (< 30,000) is ill-advised. Therefore, the following tables isolate these two sample groups.

NOTE: Due to a change of server, I no longer have sufficient room to store the entire sample database on-line.  My apologies.

Most Dense Vocabularies, Normalized For Book Size. Books Over 30,000 Words

Title
(click on work to view word anomalies)
Author Vocabulary Words Normal
Density
Decline and Fall of the Roman Empire, vol 1-6Gibbon, Edward431131543676-12.92
Roget's ThesaurusAnonymous / Various39023203886-12.48
Gargantua and PantagruelRabelais, Francis25985323013-9.86
Brann The Iconoclast, vol 1,10,12Brann, William Cowper22507300783-8.14
1998 CIA World Factbook, TheUS CIA24220422744-8.04
Anomalies and Curiosities of MedicineGould/Pyle22930393856-7.43
Hackers' Dictionary of Computer Jargon, TheAnonymous / Various16757169716-6.03
History Of The Conquest Of Peru (2nd ver), ThePrescott, William H.19235300976-5.87
Moby DickMelville, Herman17227211763-5.75
Warfare of Science/TheologyWhite, Andrew Dickson19187322799-5.46
Poems And Songs Of Robert BurnsBurns, Robert14968129551-5.46
Les MiserablesHugo, Victor23334570508-5.17
Travels through France & ItalySmollett, Tobias14625142922-5.05
WaverleyScott, Walter15325185273-4.79
Tracks of a Rolling StoneCoke, Henry J.13259106525-4.77
Terminal Compromise/NetNovelSchartau, Win15898213672-4.69
Main StreetLewis, Sinclair14580169912-4.51
Sketch-Book of Geoffrey Crayon, TheIrving, Washington13241129907-4.32
Devil's Dictionary, TheBierce, Ambrose1117260906-4.23
Roads of DestinyO'Henry1219698952-4.22
Leaves of GrassWhitman, Walt12924124036-4.21
Lucasta Poems, TheLovelace, Richard1115362900-4.20

Least Dense Vocabularies, Normalized For Book Size. Books Over 30,000 Words

Title
(click on work to view word anomalies)
Author Vocabulary Words Normal
Density
Book of Mormon, TheAnonymous / Various561227588728.56
Bible, Both Testaments, King James Version, TheAnonymous / Various1286779012626.55
Le Morte D'Arthur, vol 2Malory, Thomas571719424916.69
Bible, Douay-Rheims Version, Challoner Revision, TheAnonymous / Various18559102908415.67
High History of the Holy Graal, TheAnonymous / Various532715848814.14
Le Morte D'Arthur, vol 1Malory, Thomas582616970312.97
Treaty of the European Union [Maastricht], TheAnonymous / Various28265946911.48
Nada the LilyHaggard, H. Rider5040117857 9.92
White Knight: Tirant Lo Blanc (tr R.S. Rudder), TheMartorell, Joanot6343161871 9.74
Story of Burnt Njal (Njal's Saga) Icelandic, TheAnonymous / Various5468129135 9.52
Moll FlandersDefoe, Daniel6139139300 8.05
Heimskringla [Norwegian Kings]Sturlson, Snorri10405306474 7.74
Twilight LandPyle, Ernie Howard411374003 7.32
First Book of Adam and EvePlatt, Rutherford228732820 7.25
On the Origin of SpeciesDarwin, Charles6993155549 6.78
Personal Memoirs of U.S. Grant, vol 2Grant, Ulysses S.6965154177 6.74
Two Years in the Forbidden CityDer Ling, Princess496292456 6.71
United States Copyright Act of 1976, TheAnonymous / Various227130635 6.63
Princess of Cleves, TheLafayette, Madame de377961809 6.61
EmmaAusten, Jane7228161099 6.55
Flower FablesAlcott, Louisa May250134525 6.52
ParmenidesPlato261636337 6.41

Most Dense Vocabularies, Normalized For Book Size. Books Under 30,000 Words

Title
(click on work to view word anomalies)
Author Vocabulary Words Normal
Density
Biog Study of A. W. KinglakeTcikwell, Rev. W.679429001-2.41
Waifs and Strays, etcO'Henry582629482-1.67
50 Bab Ballads (vol 1)Gilbert, W.S.568928588-1.61
StyleRaleigh, Walter538524331-1.60
Cicero's Orations [selected orations in Latin]Cicero452513219-1.59
New PoemsThompson, Francis539225151-1.55
Chita: A Memory of Last IslandHearn, Lafcadio549526874-1.54
PoemsHenley, William E.530124303-1.53
Georgics [English], TheVirgil508921668-1.51
Letters on LiteratureLang, Andrew555029479-1.42
ShelleyWaterlow, Sydney490721390-1.38
Sword Blades and Poppy SeedLowell, Amy525526996-1.31
Bab Ballads, vol 2, TheGilbert, W.S.476920582-1.31
Who Was Who: 5000 B. C. to DateGordon, Irwin L.480221807-1.25
Ginx's Baby, A SatireJenkins, Edward537029763-1.22
Bab Ballads, vol 3, TheGilbert, W.S.485323153-1.20
Foolish Dictionary, TheWurdz, Gideon382611615-1.19
Lays of Ancient RomeMacaulay, Thomas Babbington498725043-1.18
Essay on Comedy, Comic SpiritMeredith, George434417204-1.18
Reginald in Russia and Other SketchesSaki (H.H. Munro)471522184-1.14
Philobiblon of Richard de Bury, TheBury, Richard de492124906-1.13
Reading of Life, and Other Poems, AMeredith, George387812990-1.12

Least Dense Vocabularies, Normalized For Book Size. Books Under 30,000 Words

Title
(click on work to view word anomalies)
Author Vocabulary Words Normal
Density
New McGuffey First Reader, TheMcGuffey (compiler), W.H.6308276 9.57
Ethics, part 2 (tr Elwes)Spinoza, Benedict de148518314 7.03
Ethics, part 3 (tr Elwes)Spinoza, Benedict de186622877 6.33
Somebody's Little GirlYoung, Martha9839795 6.08
Ethics, part 1 (tr Elwes)Spinoza, Benedict de142214046 5.23
Berne Universal Copyright Convention [1988], TheAnonymous / Various9678023 4.78
Adventures of Reddy FoxBurgess, Thornton W.157214948 4.71
Ethics, part 5 (tr Elwes)Spinoza, Benedict de126910805 4.44
Well of the Saints, TheSynge, J.M.169515540 4.28
Lady Windermere's FanWilde, Oscar205619942 4.16
Organic SynthesesConant (Editor), James Bryant220221695 4.08
Alice's Adventures in WonderlandCarroll (C.L. Dodgson), Lewis264927785 3.95
White People, TheBurnett, Frances Hodgson226221593 3.78
DreamsSchreiner, Olive213719817 3.75
True Story of Christopher Columbus, TheBrooks, Elbridge S.280529141 3.69
Woman of No Importance, AWilde, Oscar237422496 3.59
Meno, second partPlato9336200 3.56
Tom Sawyer DetectiveTwain (Samuel Clemens), Mark247523467 3.47
RosmersholmIbsen, Henrik281528053 3.40
Ballad of Reading GaolWilde, Oscar11968287 3.36
Deirdre of the SorrowsSynge, J.M.196816415 3.32
Story of Doctor Dolittle, TheLofting, Hugh275926835 3.30

Word Anomalies

It would be interesting to know for a given book what words are used uncommonly often or, likewise, uncommonly infrequently. To compute this, the relative frequency of each words is sampled from the database at large and then compared to the frequency in each book.

Not surprisingly, these 'Anomalous Word Summaries' paint an incredibly accurate picture of the work. For example, among Moby Dick's most anomalous words are: whale, sperm, and harpooneer. Of course, proper names tend to dominate these lists; for example, ahab, stubb, and queequeg top out Moby Dick. Just as interesting is what the book is NOT about. Among Moby Dick's most infrequently used words (i.e. words which are common in other books, but not in this one) are: miss, government, happiness, smiled, and machine.

The Infrequently Used Summaries list only words which are actually used in the work. While it might be logical to list words that are frequently used in other books but that never show up in this book, it would be useless because such a list would be dominated by anachronistic words such as 'thou' and 'thy' that are common in the database but unused in most works.

Misspellings significantly skew both the Infrequent and Unique Word Lists and are fairly common due to the use of Optical Character Recognition (OCR) software which is extremely prone such mistakes.

The following table is a sample of Word Anomalies picked by hand from the database to illustrate the technique. To view Anomaly Summaries for any work, click on the book name in either the author index or title index.

NOTE: Due to a change of server, I no longer have sufficient room to store the entire sample database on-line.  My apologies.

View the index by TITLE
View the index by AUTHOR
(Click on any title to view the Anomaly Summary)

Sample of Word Anomalies

The Bible (King James Edition); Anonymous / Various
Frequent: unto, lord, isreal, shall, god, moses, jesus, david, offering, tabernacle
Infrequent: girl, boy, school, success, condition, listen, princess
Wonderful Wizard of Oz; Baum, Frank
Frequent: woodman, scarecrow, witch, tin, emerald, monkeys, kansas, brains, winged
Infrequent: mother, money, soul, natural
White Fang; London, Jack
Frequent: musher, beaver, sled, dogs, cherokee, snarl
Infrequent: letter, person, window, green, sweet, loved, party, paper
The Republic; Plato
Frequent: guardians, unjust, true, injustice, state, gymnastic, rulers, democractical
Infrequent: miss, girl, boy, prince
Alice's Adventures In Wonderland; Carroll (C.L. Dodgson), Lewis
Frequent: gryphon, turtle, caterpiller, mock, dodo, mouse, rabbit, hedgehog
Infrequent: death, country, happy, fair, common
Origin of the Species; Darwin, Charles
Frequent: species, varieties, subaerial, selection, sterility, plants, modification, forms, variability
Infrequent: person, government, love, thinking, god, evil, fire
Communist Manifesto; Marx, Karl/Engels, Friedrich
Frequent: bourgeois, proletariat, communists, antagonisms, revolutionising, socialism, production, class, feudal, reactionary, exploitation, conditions, crises
Infrequent: said, love, why, heart, mother, poor, felt
Paradise Lost; Milton, John
Frequent: wonderous, heaven, satan, dominations
Infrequent: country, church, horses, sister
Apology; Plato
Frequent: corrupter, accusers, demigods, socrates, oracle, indictment
Infrequent: she, work, morning, replied, body
Gargantua and Pantagruel; Rabelais, Francis
Frequent: codpiece, catchpole, ballocks, dingdong, fart, chitterlings, gymnast, arse
Infrequent: smile, existence, feelings, british, professor, suffering
1st Inaugural Speech; Roosevelt, Franklin Delano
Frequent: foreclosure, interdependence, uneconomical, leadership, outgo, unsolvable, values, redistribution, national, emergency
Infrequent: you, her, his
The Jungle; Sinclair, Upton
Frequent: packingtown, packers, stockyards, fertilizer, slaughterhouses, streetcar, lituanian
Infrequent: influence, village, pray, gods, example
20,000 Leagues Under The Sea; Verne, Jules
Frequent: manometer, canadian, captain, frigate, harpoon, cuttlefish, submarine
Infrequent: garden, justice, ladies, laughed, wife
Time Machine; Wells, H. G.
Frequent: psychologist, sphinx, traveller, machine, i, lever, dimension
Infrequent: mother, dear, money, friends, horse, peace
War of the Worlds; Wells, H. G.
Frequent: martians, leatherhead, artilleryman, londonward, cylinder, pit, scullery
Infrequent: love, king, truth, gentleman, joy, youth
Moby Dick; Melville, Herman
Frequent: whale, sperm, harpooner, pequod, leviathan, fishery
Infrequent: miss, fortune, happiness, smiled, angry, enemies