Why are we interested in counting words? The immediate payoff is not always clear. Many of us are familiar with what I like to call the Moby Dick is About Whales model of quantitative work, wherein we generate some kind of word-frequency chart and the most dominant words are terms that are so central to the overall story being presented.
In the case made by Moby Dick is About Whales we get words like WHALE, BOAT, CAPTAIN, SEA presented as hugely important terms. Great! There is no doubt that these terms are important to Moby Dick. However, and this is crucial: there is nothing terribly groundbreaking about discovering these words are central to the world of Moby Dick. In fact, it is nothing we couldn’t have discovered if we sat down and read the book ourselves. (Another example of this phenomenon is ‘Shakespeare’s plays are about kings and queens’, lest it sound like I am picking on the 19c Americanists.)
One of the reasons the Moby Dick is About Whales model is so popular is because both humans and computers can handle the saturation of these words. WHALE, BOAT, CAPTAIN, and SEA are indeed very high-frequency words in the novel. But so are the tiny boring words like THE, OF, I, IS, ARE, WHO, DO, FOR, ON, WITH, YOU, ARE, SHE, HE, HIS, HER, BUT, WHICH, THAT, FROM. We tend not to notice these terms so much as readers, because they serve a specific function rather than delivering specific content-driven meaning. However, these are by far the most frequent terms in any given English-language document. The overall distribution of these terms often fluctuates based on the kinds of documents we are writing/reading .  We can contrast these function words – which have some sort of grammatical purpose, first and foremost – against words that have some kind of content-driven purpose. These content words are often the words that make up what a document is about, which makes it easier to keep track of and care about these things.
In 1935 and 1945, G. K. Zipf formulates a now-famous postulation, now called Zipf’s law, that within a group or corpus of documents, the frequency of any word is inversely proportional to its rank in a frequency table. Thus, the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. Zipf’s law, which feels a bit hard to visualize as a reader, is often presented as a logarithmic distribution that looks like Figure 1.
|Figure 1. Zipf’s law, visualized.|
Our very high-frequency function words are way up on the top of the y axis – they are everywhere, to the point of over-saturating. (You could go back and take a highlighter to every one of them in this essay and you’ll find that the majority of your document is covered in highlighter). As readers, writers, and speakers, we simply don’t notice them because there are too many examples of them to keep track of with our puny human minds. We do, however, pay a lot more attention to lower-frequency terms, in part because there is so much variation on the far right-hand side of this graph. These contentful words are much less likely to occur, so the variation in these terms is ultimately much more meaningful to readers. In the middle, where there is an L-shaped curve, is what I will call the sweet spot where medium-high-frequency words start to creep into the domain of very noticeable. Over here we have high-saliency content words like WHALES and BOATS that are hugely obvious to readers: they are approaching a high enough frequency that we notice them, but not so high-frequency to basically become invisible.
So, the interpretive argument that Moby Dick is About Whales is pretty dependent on words that are high-enough frequency to be noticeable to a linear reader, but low-frequency enough to be content-driven. The challenge in doing quantitative literary analysis, then, is pushing past the obvious stuff in the output. This is not to say that Moby Dick is About Whales is a waste of time – but it should be a way to guide a more complex analysis. Part of the reason it is so easy to criticize the Moby Dick is About Whales model of digital scholarship is that Moby Dick is a long book, and a rather complicated one at that, so it makes sense to want to get the big picture. But there is much more going on in this novel: there are big questions surrounding the ideas of isolation, homosociality, and self-discovery. One of the joys of working with literary language is that we have to offer interpretations of what is happening beyond the most obvious level. When we accept the Moby Dick is about Whales model of scholarship, we are accepting the C-student interpretation of the novel. Most of us who teach literature expect our students to be able to read and interpret beyond the most obvious models of what a book is ‘about’: It is fine if your biggest takeaway from reading novel is that it is about boats and whales, because the book is indeed about boats and whales. But this is the start of the conversation, not the end of it. When it comes to word counting, we have a quite accessible way of discussing language, style, and variation.
For example, I looked up the use of ‘ship’ or ‘ships’ in Moby Dick. Ship(s) appear 607 times in total across the entire novel. ‘Boat’ or ‘boats’ appears 484 times. Now, if you were anything like me, you spent most of your school years despairing over math problems about Mary having 64 oranges and Tom having 33 apples. Who cares? Why do these people have so much fruit? But in the context of a literary analysis, there is a fascinating question to ask: why does Melville use ‘ship’ so more frequently than ‘boat’? Doing a survey of keyword in context hits for both sets of terms, I observed some general patterns in their usage. ‘Ship’ is largely used as part of a phrase: “that ship”, “the ship”, “whale-ship”, whereas ‘boat’ is largely used in a way that shows possession: X’s boat, her boats, his boats, the boat, (a number of) boats, whale boat. In other words, ‘boat’ shows much more variation than ‘ship’ does in this novel.
Here’s another example, culled from a word cloud of most frequent contentful words in Moby Dick. ‘Old’ appears 450 times; ‘new’ appears 99 times. Old is often used as part the phrase ‘old man’ (and to a lesser extent in reference to Ahab, who is called ‘old Ahab’). Are these the same person? That’s a research question. Meanwhile, ‘new’ largely applies to places, like New England, New Bedford, and New Zealand. Although old and new are theoretically antonyms, one references a person and the other references a place. Are they serving the same purpose? Something similar happens for ‘sea’ compared to ‘ocean’ (taken from the same word cloud): ‘sea’ appears 455 times, whereas ocean appears 81 times. The use of ‘ocean’ is more specific, showing up in constructions like ‘the ocean’, ‘Pacific Ocean’, and ‘Indian ocean’. Meanwhile, ‘sea’ suggests more movement: a sea(-something), at sea, the sea.
One final point to make: Computers are very good at keeping track of presence and absence for us. There are around 15 chapters in Moby Dick in which nobody talks about whales at all and we discuss boats in great detail; this is something we may not notice as linear readers but it is something that computers are very good at showing us. As readers we continue to understand that the book remains about whales and boats. But at the level of less-obvious lexical variation, what does this look like? Capitan Ahab appears 62 times in Moby Dick; Ahab appears 517 times, and captain appears 329 times. When does Ahab get to be called Captain Ahab? When is he just called by his first name? Who calls him by which name? Why does this matter to our understanding the over the novel?
To develop these kinds of research questions we haven’t done anything more complicated than simple arithmetic. We did a little bit of addition (we added together the overall frequencies of ‘boat’ and ‘boats’), and then we compared the size of several groups of things against each other, by asking which pile was bigger than the others. Other than that, I simply returned to the text using a concordance’s Keyword In Context viewer to find examples of our terms and observed ways our terms worked in practice. Computers are good at finding patterns and people are good at interpreting patterns. Moreover, without anything too much more complicated we can ask much more interesting questions. My favourite example of this is that the word ‘she’ is comparatively underused in Macbeth than in other Shakespeare plays, a fact that my friends who study Shakespeare are always shocked by. Lady Macbeth is the most important figure! She is the driving force behind the whole play! Yes, this is true, but also nobody speaks about her in great detail until she falls ill in Act 5 Scene 1. We could have found this fact out by sitting down with highlighters and a copy of the text, but computers make this whole process so much easier, allowing us to get to more interesting questions that linear readers of a text may not always be able to observe.
“Quantification”, as Morris Halle says, “is not everything”. And in doing so, we must consider what we can do with the ability to look at texts in a non-linear fashion: the ability to move between close reading and a more bird’s eye view of a corpus is truly the most powerful thing counting words can offer us.
 See George K. Zipf (1935) The Psychobiology of Language. Houghton-Mifflin, and George K. Zipf (1949) Human Behavior and the Principle of Least Effort. Cambridge, Massachusetts: Addison-Wesley. The Wikipedia page is generally quite good for explaining this, too: https://en.wikipedia.org/wiki/Zipf%27s_law