Please sign my guestbo0k if you find this interesting or helpful. Thanks, Zack
The sample index was derived from the Thallason Index of the Project Gutenberg because the master indices from the PG itself were inconsistent. I extend my thanks to their efforts as well as to all contributors to the Project Gutenberg.
View the sample database by TITLE
View the sample database by AUTHOR
NOTE: Due to a change of server, I no longer have sufficient room to store the entire sample database on-line. My apologies.
Ideally, the (W/V) statistic allows comparison of one book's style to another. However, this simplistic metric is complicated by the simple fact that a short work will inevitably be denser than a larger work due to the fact that practically every word in a short work is unique. To understand, consider the case of writing a multi-million word essay. Given that there are only a limited number of words in the English language (~400,000 in this sample), one would eventually run out of words and thus the vocabulary density of such a titanic treatise would drop accordingly. This effect can be seen in the flattening trend of the scatter plots seen below.
|
|
|
Scatter-plots of inverse vocabulary density (y-axis) vs. total words (x-axis). Samples below the pink trend line have denser vocabularies than average, those above, sparser. Note that trend line fits less well for smaller works. |
||
In order that the vocabulary densities of large and small works may be compared, a 'normalizing' curve is fit to the sample creating a 'normalized density score' useful for comparison. Unfortunately, the one-size-fits-all trend curve (found empirically by minimizing least mean square error of a square-root scale coefficient) fails to fit the smaller works well as can be seen in Figure 3. Thus, comparison of large works (> 30,000 words) to smaller ones (< 30,000) is ill-advised. Therefore, the following tables isolate these two sample groups.
NOTE: Due to a change of server, I no longer have sufficient room to store the entire sample database on-line. My apologies.
Not surprisingly, these 'Anomalous Word Summaries' paint an incredibly accurate picture of the work. For example, among Moby Dick's most anomalous words are: whale, sperm, and harpooneer. Of course, proper names tend to dominate these lists; for example, ahab, stubb, and queequeg top out Moby Dick. Just as interesting is what the book is NOT about. Among Moby Dick's most infrequently used words (i.e. words which are common in other books, but not in this one) are: miss, government, happiness, smiled, and machine.
The Infrequently Used Summaries list only words which are actually used in the work. While it might be logical to list words that are frequently used in other books but that never show up in this book, it would be useless because such a list would be dominated by anachronistic words such as 'thou' and 'thy' that are common in the database but unused in most works.
Misspellings significantly skew both the Infrequent and Unique Word Lists and are fairly common due to the use of Optical Character Recognition (OCR) software which is extremely prone such mistakes.
The following table is a sample of Word Anomalies picked by hand from the database to illustrate the technique. To view Anomaly Summaries for any work, click on the book name in either the author index or title index.
NOTE: Due to a change of server, I no longer have sufficient room to store the entire sample database on-line. My apologies.
View the index by TITLE
View the index by AUTHOR
(Click on any title to view the Anomaly Summary)
| The Bible (King James Edition); Anonymous / Various | |
| Frequent: | unto, lord, isreal, shall, god, moses, jesus, david, offering, tabernacle |
| Infrequent: | girl, boy, school, success, condition, listen, princess |
| Wonderful Wizard of Oz; Baum, Frank | |
| Frequent: | woodman, scarecrow, witch, tin, emerald, monkeys, kansas, brains, winged |
| Infrequent: | mother, money, soul, natural |
| White Fang; London, Jack | |
| Frequent: | musher, beaver, sled, dogs, cherokee, snarl |
| Infrequent: | letter, person, window, green, sweet, loved, party, paper |
| The Republic; Plato | |
| Frequent: | guardians, unjust, true, injustice, state, gymnastic, rulers, democractical |
| Infrequent: | miss, girl, boy, prince |
| Alice's Adventures In Wonderland; Carroll (C.L. Dodgson), Lewis | |
| Frequent: | gryphon, turtle, caterpiller, mock, dodo, mouse, rabbit, hedgehog |
| Infrequent: | death, country, happy, fair, common |
| Origin of the Species; Darwin, Charles | |
| Frequent: | species, varieties, subaerial, selection, sterility, plants, modification, forms, variability |
| Infrequent: | person, government, love, thinking, god, evil, fire |
| Communist Manifesto; Marx, Karl/Engels, Friedrich | |
| Frequent: | bourgeois, proletariat, communists, antagonisms, revolutionising, socialism, production, class, feudal, reactionary, exploitation, conditions, crises |
| Infrequent: | said, love, why, heart, mother, poor, felt |
| Paradise Lost; Milton, John | |
| Frequent: | wonderous, heaven, satan, dominations |
| Infrequent: | country, church, horses, sister |
| Apology; Plato | |
| Frequent: | corrupter, accusers, demigods, socrates, oracle, indictment |
| Infrequent: | she, work, morning, replied, body |
| Gargantua and Pantagruel; Rabelais, Francis | |
| Frequent: | codpiece, catchpole, ballocks, dingdong, fart, chitterlings, gymnast, arse |
| Infrequent: | smile, existence, feelings, british, professor, suffering |
| 1st Inaugural Speech; Roosevelt, Franklin Delano | |
| Frequent: | foreclosure, interdependence, uneconomical, leadership, outgo, unsolvable, values, redistribution, national, emergency |
| Infrequent: | you, her, his |
| The Jungle; Sinclair, Upton | |
| Frequent: | packingtown, packers, stockyards, fertilizer, slaughterhouses, streetcar, lituanian |
| Infrequent: | influence, village, pray, gods, example |
| 20,000 Leagues Under The Sea; Verne, Jules | |
| Frequent: | manometer, canadian, captain, frigate, harpoon, cuttlefish, submarine |
| Infrequent: | garden, justice, ladies, laughed, wife |
| Time Machine; Wells, H. G. | |
| Frequent: | psychologist, sphinx, traveller, machine, i, lever, dimension |
| Infrequent: | mother, dear, money, friends, horse, peace |
| War of the Worlds; Wells, H. G. | |
| Frequent: | martians, leatherhead, artilleryman, londonward, cylinder, pit, scullery |
| Infrequent: | love, king, truth, gentleman, joy, youth |
| Moby Dick; Melville, Herman | |
| Frequent: | whale, sperm, harpooner, pequod, leviathan, fishery |
| Infrequent: | miss, fortune, happiness, smiled, angry, enemies |