counting

Of time, of numbers and due course of things

[This is the text, more or less, of a paper I presented to the audience of the Scottish Digital Humanities Network’s “Getting Started In Digital Humanities” meeting in Edinburgh on 9 June 2014. You can view my slides here (pdf)]

Computers help me ask questions in ways that are much more difficult to achieve as a reader. This may sound obvious: reading a full corpus of plays, or really any text, takes time, and by the time I closely read all of them, I will have either have not noticed the minutae of all the texts or I will not have remembered some of them. Here, for example, is J. O. Halliwell-Phillipp’s The Works of William Shakespeare; the Text Formed from a New Collation of the Early Editions: to which are Added, All the Original Novels and Tales, on which the Plays are Founded; Copious Archæological Annotations on Each Play; an Essay on the Formation of the Text: and a Life of the Poet, which takes up quite a bit of space on a shelf:IMG_20140604_160953

This isn’t a criticism, nor is it an excuse for not reading; it just means that humans are not designed to remember the minutae of collections of words. We remember the thematic aboutness of them, but perhaps not always the smaller details. Having closely read all these plays (though not in this particular edition: I have read the Arden editions, which were much more difficult to stick on one imposing looking shelf), all I remember what they were about, but perhaps not at the level of minutae I might want to have. So today I’m going to illustrate how I might go from sixteen volumes of Shakespeare to a highly specific research question, and to do that, I’m going to start with a calculator.

A calculator is admittedly a rather old and rather simple piece of technology; it’s one that is not particularly impressive now that we have cluster servers that can crunch thousands of data points for us, but it remains useful nonetheless. Without using technology which is more advanced than our humble calculator, I’m going to show how the simple task of counting and a little bit of basic arithmetic can raise some really interesting questions. Straightforward counting is starting to get a bit of a bad rap in digital humanities discourse (cf Jockers and Mimno 2013, 3 and Goldstone and Underwood 2014, 3-4): yes, we can count, but that is simple. We can also complicate this process with calculation and get even more exciting results! This is, of course, true, and provides many new insights to texts which were otherwise unobtainable. Eventually today I will get to more advanced calculation, but for now, let’s stay simple and count some things.

Except that counting is not actually all that simple: decisions have to be made about what to count and how to decide what to count, and then how you are going to do that. I happen to be interested in gender, which I think is one of the more quantifiable social identity variables in textual objects, though it certainly isn’t the only one. Let’s say I wanted to find three historically relevant gendered noun binaries for Shakespeare’s corpus. Looking at the historical thesaurus of the OED for historical contexts, I can decide on lord/lady, man/woman, and knave/wench, as they show a range of formalities (higher – neutral – lower) and these terms are arguably semantically equivalent. The first question I would have is “how often do these terms actually appear in 38 Shakespeare plays?”

shx minus node words pie chart

Turns out the answer is “not much”: they are right up there in the little red sliver there. My immediate next question would be “what makes up the rest of this chart?” The obvious answer is, of course, that it covers everything that is not our node words in Shakespeare. However, there are two main categories of words contained therein: the frequency of function words (those tiny boring words that make up much of language) and the frequency of content words (words that make up what each play is about). We have answers, but instantly I have another question: what does the breakdown of that little red sliver look like?

This next chart shows the frequency of both the singular and the plural form of each node word, in total, for all 38 Shakespeare plays. There are two instantly noticeable things in this chart: first, the male terms are far more frequent than the female terms, and that wench is not used very much (though we may think of wench as being a rather historical term).
individual node word plurals in Shakespeare (full)

There are more male characters than female characters in Shakespeare – by quite a large margin, regardless of how you choose to divide up gender – but surely they are talking about female characters (as they are the driving force of these plays: either a male character wants to marry or kill a female character). This is not to say that male and female characters won’t talk to each other; there just happens to be a lot more male characters. Biber and Burges (2000) have noted that in 19th century plays, male to male talk is more frequent than male to female talk (and female to female talk). I am not going to claim this is true here, but it seems to be a suggestive model, as male characters dominate speech quantities in the plays. There are lots of questions we can keep asking from this point, and I will return to some of them later, but I want to ask a bigger question: how does Shakespeare’s use of these binaries compare to a larger corpus of his contemporaries, 1512-1662?

It is worth noting that this corpus contains 332 plays, even though it is called the 400 play corpus; some things, I suppose, sound better when rounded up. These terms are still countable, though, and we see a rather different graph for this corpus:
400 play corpus full node words frequencies

The 400 play corpus includes Shakespeare, so we are now comparing Shakespeare to himself and 54 other dramatists.[1] The male nouns are noticeably more frequent than the female nouns, which suggests that maybe the proportions of male to female characters from Shakespeare is true here too. Interestingly, lord is less frequent than man, which is the opposite of what we saw previously. The y axis is different for this graph, as this is a much larger corpus than Shakespeare’s, but it seems like the female nouns are consistent.

One glaring problem with this comparison is that I am looking at two different-sized objects. A corpus of 332 plays is going to be, generally speaking, larger than a corpus of 38 plays.[2] McEnery and Wilson note that comparisons of corpora often require adjustment: “it is necessary in those cases to normalize the data to some proportion […] Proportional statistics are a better approach to presenting frequencies” (2003, 83). When creating proportions, Adam Kilgariff notes “the thousands or millions cancel out when we do the division, it makes no difference whether we use thousands or millions” (2009, 1), which follows McEnery and Wilson’s assertion that “it is not crucial which option is selected” (2003, 84). For my proportions, I choose parts per million.
Shakespeare from Martin's corpus 12.16.10 and Martin's Corpus, normalized plural node words graphed
Shakespeare is rather massively overusing lord in his plays compared to his contemporaries, but he is also underusing the female nouns compared to contemporaries. Now we have a few research questions to address, all of which are very interesting:

  • Why does Shakespeare use lord so much more than the rest of Early Modern dramatists?
  • Why do the rest of Early Modern dramatists use wench so much more than Shakespeare?
  • Why is lady more frequent than woman overall in both corpora?

I’m not going to be able to answer all of these today, though they but let’s talk a little bit about lord. This is a pretty noticeable difference for a term which seems pretty typical of Early Modern drama, which is full of noblemen. If I had to guess, I would say that lord might be more frequent in history plays compared to the tragedies or the comedies. I say this because as a reader I know there are most definitely noblemen, and probably defined as such, in these plays.

So what if we remove the histories from Shakespeare’s corpus, count everything up again, and make a new graph comparing Shakespeare minus the histories to all of Shakespeare? By removing the history plays it is possible to see how Shakespeare’s history plays as a unit compare to his comedy & tragedy plays as a unit. [3]
Shakespeare minus histories compared to shakespeare with histories per million
Female nouns fare better in Shakespeare Without Histories than in Shakespeare Overall, possibly because the female characters are more directly involved in the action of tragedies and comedies than they are in histories (though we know the Henry 4 plays are an exception to that), so that is perhaps not all that interesting. What is interesting, though, is the difference between lord in Shakespeare Without Histories and Shakespeare With Histories. What is going on in the histories? How do Shakespeare’s histories compare to all histories in the 400 play corpus?
history plays, shx vs history plays from 400 play corpus
Now we have even more questions, especially “what on earth is going on with lord in Shakespeare” and “why is wench more frequent in all of the histories?” I’m going to leave the wench question for now, though: not because it’s uninteresting but because it is less noticeable compared to what I’ve been motioning at with lord, which is clearly showing some kind of generic variation.

Remember, we haven’t done anything more complex than counting and a little bit of arithmetic yet, and we have already created a number of questions to address. Now we can create an admittedly low-tech visualization of where in the history plays these terms show up: each black line is one instance, and you read these from left to right (‘start’ to ‘finish’):
Screen shot 2014-06-06 at 4.30.36
And now I instantly have more questions (why are there entire sections of plays without lord? Why do they cluster only in what clearly are certain scenes? etc) but what looks most interesting to me is King John, which has the fewest examples. On a first glance, King John and Richard 3 appear to be outliers (that is, very noticeably different from the others: 42 instances vs 236 instances). Having read King John, I know that there are definitely nobles in the play: King John, King Philip, the Earls of Sudbury, Pembroke, Essex and the excellently named Lord Bigot. And, again, having read the play I know that it is about the relationships between fathers, mothers and brothers – the play centers around Philip the Bastard’s claim to the throne – and also is about the political relationship (or lack thereof) between France and England. From a reader’s perspective, none of that is particularly thematically unique to this play compared to the rest Shakespeare’s history plays, though.

I can now test my reader’s perspective using a statistical measure of keyness called log likelihood, which asks which words are more or less likely to appear in an analysis text compared to a larger corpus. This process will provide us with words which are positively and negatively ranked overall with a ranking of statistical significance (more stars means more statistically significant). Now I am asking the computer to compare King John to all of Shakespeare’s histories. I have excluded names from this analysis, as a reader definitely knows hubert arthur robert philip faulconbridge geoffrey are in this play without the help of the computer.
Screen shot 2014-06-03 at 10.20.23
However, you can see that the absence of lord in King John is highly statistically significant (marked with four *s, compared to others with fewer *s). Now, we saw this already with the line plots, though it is nice to know that this is in fact one of the most significant differences between King John and the rest of the histories.

All of this is nice, and very interesting, as it is something we might not have ever noticed as a reader: because it is a history play with lords in it, it is rather safe to assume that it will contain the word lord more often than it actually does. Revisiting E.A.J. Honingmann’s notes on his Arden edition of King John, there have been contentions about the use of king in the First Folio (2007, xxxiii-xliii), most notably around the confusions surrounding King Lewis, King Philip and King John all labeled as ‘king’ in the Folio (see xxxiv-xxxvii for evidence). But none of this is answering our question about lord’s absence. So what is going on with lord? We can identify patterns with a concordancer, and we get a number of my lords:Screen shot 2014-06-03 at 10.37.59
This is looking like a fairly frequent construction: we might want to see what other words are likely to appear near lord in Shakespeare overall: is my one of them? As readers, we might not notice how often these two words appear together. I should stress that we still have not answered our initial question about lord in King John, though we are trying to.

Using a conditional probability of the likelihood of one lemma (word) to appear next to another lemma (word) in a corpus using the dice coefficiency test, which is the mean of two conditional probabilities: P(w1,w2) and P(w2,w1). Assuming the 2nd word in the bigram appears given the 1st word, and the 1st word in the bigram appears given the 2nd word, this relationship can be computed on a scale from 0-1. 0 would mean there is no relationship; 1 means they always appear together. With this information, you can then show which words are uniquely likely to appear near lord in Shakespeare and contrast that to the kinds of words which are uniquely likely to appear next to lady – and again for the other binaries as well. Interestingly, my only shows up with lord!

Screen shot 2014-06-03 at 10.49.51

This is good, because it shows that lord does indeed appear very differently to our other node words in Shakespeare’s corpus, and suggests that there’s something highly specific going on here with lord, all of which is still suggestive that there is something about lord which is notable. However, I’m still not sure what is happening with lord in King John. Why are there so few instances of it?

Presumably if there is an absence of one word or concept, there will be more of a presence a second word or concept. One such example might be king, but the log-likelihood analysis shows that this is comparatively more frequent in King John than in the rest of Shakespeare’s histories (note the second entry on this list)
Screen shot 2014-06-03 at 10.20.23

Now we have two questions: why is lord so absent, and why is this so present? From here I might go back to our concordance plot visualizations, but this is addressable at the level of grammar: this is a demonstrative pronoun, which Jonathan Hope defines in Shakespeare’s Grammar as “distinguish[ing] number (this/these) and distance (this/these = close; that/those = distant). Distance may be spatial or temporal (for example ‘these days’ and ‘those days’)” (Hope 2003, 24). Now we have a much more nuanced question to address, which a reader would never have noticed: Does King John use abstract, demonstrative pronouns to make up for a lack of the concrete content word lord in the play? I admit I have no idea: does anybody else know?

 

WORKS CITED
Halliwell-Phillipps, J.O. (1970. [1854].) The works of William Shakespeare, the text formed from a new collation of the early editions: to which are added all the original novels and tales on which the plays are founded; copious archæological annotations on each play; an essay;on the formation of the text; and a life of the poet. New York: AMS press.

“Early English Books Online: Text Creation Partnership”. Available online: http://quod.lib.umich.edu/e/eebogroup/ and http://www.proquest.com/products-services/eebo.html.

“Early English Books Online: Text Creation Partnership”. Text Creation Partnership. Available online: http://www.textcreationpartnership.org/

Anthony, L. (2012). AntConc (3.3.5m) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.antlab.sci.waseda.ac.jp/

Biber , Douglas, and Jená Burges. (2000) “Historical Change in the Language Use of Women and Men: Gender Differences in Dramatic Dialogue”. Journal of English Linguistics 28 (1): 21-37.

DEEP: Database of Early English Playbooks. Ed. Alan B. Farmer and Zachary Lesser. Created 2007. Accessed 4 June 2014. Available online:http://deep.sas.upenn.edu.

Froehlich, Heather. (2013) “How many female characters are there in Shakespeare?” Heather Froehlich. 8 February 2013. https://hfroehlich.wordpress.com/2013/02/08/how-many-female-characters-are-there-in-shakespeare/

Froehlich, Heather. (2013). “How much do female characters in Shakespeare actually say?” Heather Froehlich. 19 February 2013. https://hfroehlich.wordpress.com/2013/02/19/how-much-do-female-characters-in-shakespeare-actually-say/

Froehlich, Heather. (2013). “The 400 play corpus (1512-1662)”. Available online: http://db.tt/ZpHCIePB [.csv file]

Goldstone, Andrew, and Ted Underwood. “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us.” New Literary History, forthcoming.

Hope, Jonathan. (2003). Shakespeare’s Grammar. The Arden Shakespeare. London: Thompson Learning.

Jockers, M.L. and Mimno, D. (2013). Significant themes in 19th-century literature. Poetics. http://dx.doi.org/10.1016/j.poetic.2013.08.005

Kay, Christian, Jane Roberts, Michael Samuels, and Irené Wotherspoon (eds.). (2014) The Historical Thesaurus of English. Glasgow: University of Glasgow. http://historicalthesaurus.arts.gla.ac.uk/.

Kilgariff, Adam. (2009). “Simple Maths for Keywords”. Proceedings of the Corpus Linguistics Conference 2009, University of Liverpool. Ed. Michaela Mahlberg, Victorina González Díaz, and Catherine Smith. Article 171. Available online: http://ucrel.lancs.ac.uk/publications/CL2009/#papers

McEnery, Tony and Wilson, Andrew. (2003). Corpus Linguistics: An Introduction. Edinburgh: Edinburgh University Press, 2nd Edition. 81-83

Mueller, Martin. WordHoard. [Computer Software]. Evanston, Illinois: Northwestern University. http://wordhoard.northwestern.edu/

Shakespeare, William. (2007). King John. Ed. E. A. J. Honigmann. London: Arden Shakespeare / Cengage Learning.

[1] Please see http://db.tt/ZpHCIePB [.csv file] for the details of contents in the corpus.

[2] This is not always necessarily true: counting texts does not say anything about how big the corpus is! A lot of very short texts may actually be the same size as a very small corpus containing a few very long texts.

[3] The generic decisions described in this essay have been lifted from DEEP and applied by Martin Mueller at Northwestern University. I am very slowly compiling an update to these generic distinctions from DEEP, which uses Annals of English Drama, 975-1700, 3rd edition, ed. Alfred Harbage, Samuel Schoenbaum, and Sylvia Stoler Wagonheim (London: Routledge, 1989) as its source to Martin Wiggins’ more recent British Drama: A Catalog, volumes 1-3 (Oxford: Oxford UP, 2013a, 2013b, 2013c) for further comparison.

Advertisements

Counting things in Early Modern Plays So You Don’t Have To: Type/Token Ratios

If you’re just joining me, I’ve been working on word frequencies of six highly-prototypical lexical items in a corpus of slightly less than 400 Early Modern London plays. I recommend starting with my research notes and then looking at some quick & dirty results.

As I noted in my quick & dirty results, these numbers hadn’t been normalized in any way: it was all raw data. In an effort to move beyond just raw data, I compiled the total number of words in each play in the corpus. I initially was interested in how play length might be a variable over time my corpus, so I graphed that. The bulk of my plays are from the early 1600s, as you can see:

play length

Overall, plays do seem to get longer until about 1600, at which point they start to get shorter again. 1662 looks to be an outlier here, as the plays in a straight line on the far right-hand side are mostly by Margaret Cavendish. (I am currently trying to figure out how to color my graphs by author, so if you have advice on that, please let me know: I’m rather haphazardly teaching myself to graph in R as I go.)

OK, so I have the total number of tokens in each text. What if treated every instance of my prototypical lexical items as a specific type, and plotted them as type/token ratios? Type/token ratios have a bit messy history in corpus linguistics, as they’re mostly used to calculate vocabulary denseness (Type/Token Ratios: what do they really tell us?, Richards 1987 [pdf]), but this would show a ratio of the raw frequency of each lexical item of interest in each play compared to the length of each play, which would normalize my data a bit.

Click to zoom:

type/token ratios

First of all, it’s notable that the lexical-frequency-to-play-length ratio make some pretty clear bell-curve shapes; I haven’t tried to calculate standard deviations of play-length. (I suppose I could do that next.) The average length of an Early-Modern London play in my corpus was 22086.5 words.

It seems that as plays get longer, they’re more likely to use man (and, to some extent, wom*n) in ways that are not true for lord/lady and knave/wench. It’s also worth looking at scales here: there are nearly double the number of lords than ladys, although man/woman and knave/wench are more comparable. Also,  there are way fewer instances of knave and wench in my corpus overall, which suggests that maybe these words are not nearly as popular as we might like to think.

Counting things in Early Modern Plays So You Don’t Have To: Some Quick & Dirty Results

I was given a corpus of 400 plays for my PhD on gender in Early Modern London plays. Up to this point I had previously been focusing largely on Shakespeare, but have recently been moving into the larger corpus. So what does one do with 400 plays? My solution was “get to know them a little bit.”  I was counting the raw frequencies for lord/lady, man/wom*n, and knave/wench in the entire corpus using AntConc, manually recording it, and then transcribing this data into a spreadsheet. I had selected these terms on the basis that I had recently spent a lot of time looking at likely collocates for these terms, as these binaries represent a high-, neutral-, and low-  formality distinction.

Several of my twitter followers asked why I was just looking at wom*n and not also m*n, and the answer is that without a regular expression I was going to get a fair quantity of noise from m*n (including but certainly not limited to man, men, mean, moon, maiden, maintain, morn, mutton…). Wom*n, I had found, was a highly successful use of a wildcard, only picking up woman and women in the corpus. While this category remains somewhat imbalanced, it presents a pretty clear scope of the quantities for more neutral forms. Now that I have a better sense of what my corpus is like beyond “those files in that folder on my computer”, I can always go back and get other information pretty easily.

What can we learn from a corpus of 400 plays?
For starters, there’s not actually 400 plays in the 400-play-corpus, but 325 plays. I knew when I started this project that this corpus was less than 400, and that it did not cover everything. It is a representative corpus, but I was a bit surprised at how much less than 400 plays I actually had. These 325 plays cover 53 individual authors from the years 1514-1662,* which looks like this:

dates

Each dot represents a year of publication. You will note that some authors are more represented than others (Shirley, for example, has 33 plays in the corpus, spanning a number of years, whereas someone like Beza has only one play in the corpus.) The average year for a play to be published was in 1613, and an overwhelming majority of these plays have been published in the late 1500s into the first half of the 1600s.

Once I had the raw frequencies for everything, I was curious to see how these terms performed diachronically. For ease I’m going to keep calling it the 400-play corpus, and as you’re reading, remember that this is very quick & dirty. There’s a lot more to say & do with this data, but I think talking about raw data is a useful endeavor in that speaks volumes about the sample itself.

lady lord (diachronic)-1

man woman diachronic-1

These graphs suggest that the use of lady and wom*n look more frequent in the corpus from the late 1500s onwards (they’re both almost in a parabola shape) whereas the use of lord and man begins to decline around 1600, creating more of a bell curve effect.

And what about knave and wench? We see there’s a distinct decrease in usage for both just after the early 1600s, though knave was more frequent earlier in the corpus:knave wench diachronically-1

Two of these three sets of binaries show very similar graphs, but that’s because this is raw data: there’s simply more instances of plays occurring around the late 1530s onwards.

This was my first time using R for any graphing ever, so I’m going to dive back in and see what I can do with a more normalized corpus next.


Additionally, I owe a great debt to the following people, who were very selfless and helpful:
Sarah Werner, Julia Flanders, Shawn Moore, Douglas ClarkSimon Davies. Thank you.

Counting gender-specific nouns in 400 plays so you don’t have to: research notes

Those of you following me on Twitter will have noticed I’ve been tweeting bite-sized facts about gender in Early Modern London plays. Here are some research notes on what I’ve been doing.

The Corpus.
I have a corpus of ~400 Early Modern London plays, culled from EEBO by someone who is not me, spanning from 1514 to 1662. This almost certainly does not cover every play written in that time, nor does it cover variant editions of these plays. This is meant to be a largely representative corpus: I have all major playwrights, a number of minor ones, and most (but importantly not all) plays written by them; one edition per play. These files have been labeled by canonical generic description (eg comedy, tragedy, history, tragicomedy), year of publication, abbreviated author surname, and a truncated version of the title. All of this metadata has been collected from EEBO, again by that same someone who is not me.

The files themselves have had everything but the words said by characters stripped out. There are no headers (no scene/act denotations) and no character markers. Each word is on its own line, and all spelling has been modernized. Here is a sample, from Kyd’s The Spanish Tragedy:
The Spanish Tragedy
This, you will note, is not ideal for reading by human eyes. But computers can do some wonderful things with this format.

I’ve been sorting these files into separate folders by author, to get a sense of how many and which plays by which authors I have in my corpus. This is, quite simply, a little more manageable than a running list of plays sorted by genre & date. It also gives me a larger sense of when these authors are working, what generic kinds of plays I have for them, and allows me to have the flexibility to group them in a variety of ways (playhouses associated with specific playwrights, authors who are contemporaries, etc) later on.

My present goal is to get counts of how many times the words lord/lady, man/wom*n, and knave/wench appear in each play in my corpus. Part of the reason I’ve chosen these terms is that they represent a shift from high – neutral – low formality while retaining gender-specific contexts. I could have chosen other ones: I’ve been looking at collocational patterns in Shakespeare using these terms (here are the relevant slides, .pdf) and wanted to get a sense of how these terms are represented in the larger corpus before I do anything else.

I consider this “getting to know my plays” because I’ve been reading as many of these plays as I possibly can, but I have several disadvantages here:
1. I can remember what many of these plays are about, but not the fine level of detail the computer can pull out for me.

2. Some of these plays are very hard to find in print (and, as I’ve shown, they are not in an ideal format for reading). My university no longer subscribes to EEBO, so I don’t actually have access to the original full-text files.

Getting Data on 400 Plays and What To Do With It.
I’ve been running the plays through a concordance program called AntConc to get a visualization of where and how many of these terms appear in each subcorpus of author. Here’s what Dekker looks like in Antconc’s Concordance Plot viewer:
Screen shot 2013-04-28 at 2.22.44
Each black line represents one instance of the search term, and is visualized in a linear way (so, from the beginning to end of each play). This is useful in that the software  will give me a number of hits in each play AND shows me where these words appear in the play-texts. For The Honest Whore, Part 1, there’s a few instances of “lady” all at once, at the beginning of the play, a few scattered in the middle, another small clump (probably representing a conversation) in the middle, and a few sparse other instances toward the end of the play.  I’m doing this mostly to get a sense of where these highly salient words appear and don’t appear in ways that are very hard to keep track of when you’re reading 400 plays in a traditional, linear fashion. These are words you’d (presumably) expect to find in Early Modern plays, so you’re not really paying much attention to them as a reader.

I record this data by hand in a notebook by author and then manually copy the information into a csv file. While it would be great to essentially have a spreadsheet of all of this information automatically produced, spreadsheets are also not particularly well-designed for human eyes to read. Eventually this will turn into a very nice graph, I’m sure, but in this format, it’s hard to make much sense of it all:
Screen shot 2013-04-28 at 2.38.55

This is admittedly a little easier:
Scan 4

There is an easier way to do this for every my entire corpus at once in R and – presumably – Python, but quite frankly that would become information overload very quickly. So while some of you more computational people may be wondering why I’m moving at such a seemingly glacial pace, the answer is “because I want to be comfortable with the data and familiar in a way that allows me to think and reflect on it as it comes”, rather than having it all at once. I want to get to know my corpus a little bit more first. Eventually, I’ll be moving into R with this data – but not yet.

When I’m done I will be making the csv file available, and will hopefully be posting a write-up here. Thanks for your patience. In the meantime, here’s the csv file for all of Shakespeare (from the Globe Shakespeare, 1841) organized by genre (comedy, history, late plays, tragedies).

How much do female characters in Shakespeare actually say?

Recently I suggested there might be 147 female characters in Shakespeare. If we are to trust that, how do they break down by play? I used the Open Source Shakespeare genre distinctions to categorize each play and the female-character categorizations from WordHoard to produce the following:Screen shot 2013-02-17 at 9.16.43 In this graph, green represents comedy, black represents history, and red represents tragedy. As you will recall from my previous post, The Winter’s Tale has the most female characters, and 1H4, Julius Caesar, and Tempest have the least amount of female characters.

17 out of 37 plays have four female characters. This makes sense, as the Early Modern theatre could hire two boys to cover all female roles, although this would obviously limit the characters who could then speak to each other. More female characters required either more boys, or for each boy-actor to take on more parts (which would again limit the amount these characters could speak to each other).

But how much do these characters talk? Or, in other words, how much of each play is made up of words said by female characters? To do that, I’d first have to find how many words were in each play, and how much of those words were said by female characters. I already had made note of how many words were said by female characters in each play from my previous post, but I didn’t have the total number of words in each play.

I returned to WordHoard’s find words function to get a word-count according to the software’s own encoded edition of each play:Screen shot 2013-02-19 at 2.12.20 With this information, I was now able to produce the following graph. Again, green represents comedy, black represents history, and red represents tragedy; the shapes of each mark on the graph represents how many female characters are in each play:

Screen shot 2013-02-19 at 3.34.54

Female characters in As You Like It say the most out of all the female characters in Shakespeare (but that number includes Rosalind/Ganymede) with 8,643 words spoken out of 21,298 total words in the play. Female characters in Timon of Athens say the least, with 61 words out of 17,744 total words in the play. On the whole, while there may be slightly more female characters in comedies, the amount of words they actually speak is highly variable, whereas the histories seem to show the least amount of variation. I had also taken the average of all female characters in each genre and found that comedies had an average of 4.07 female characters; histories, an average of 4.083 female characters; and tragedies had an average of 3.72 female characters – suggesting that the history plays may be the most stable out of the three categories for female characters, which is interesting. If you are interested in which female characters say the most words, please click here for the relevant image.

A number of people have asked me if Shakespeare passes the Bechdel test: I’m working on it! Stay tuned…

How many female characters are there in Shakespeare?

This was a fairly straightforward question I found myself asking recently for a footnote.  Easy, I thought. I’ll go find a list of characters, count up the female ones, subtract them from the total number of characters, and I’ll have my answer. Though I could have picked up my Complete Works of Shakespeare and started counting from the dramatis personae for each play, I didn’t – because I knew that this information had been encoded before. Gender of characters is something that is often encoded in metadata (there’s a TEI category for gender), and character lists are easy to obtain.

I started with Open Source Shakespeare’s list of characters, which lists 1222 total characters in 37 plays. Characters included in this list included variations of “all”, from many plays:

Screen shot 2013-02-08 at 5.33.00

So, these instances of “all” aren’t really individual characters. However, the rest of this list contained every single character in all the plays, and that was something I could work with. If there are 1222 total “characters”, minus 31 instances of “alls”, there are 1191 individual characters. From there I could either put each of the 1191 individual characters in a box labeled “male”, “female” or “unknown, ambiguous or mixed”, or I could ask another program to do it for me.

I opened WordHoard and asked it to Find Words by Speaker Gender, which would account for those three categories. WordHoard covers all of the same plays as Open Source Shakespeare.
Screen shot 2013-02-08 at 5.25.05
Intuition tells me that it will be an easier task for a computer to isolate female characters than it will be to isolate male characters, so I select “female”, and click “find”. A few minutes later, WordHoard produces the total words spoken by all female characters in each play – and I add the criteria to show “words by speaker name”. My screen looks like this (click to make bigger):
Screen shot 2013-02-08 at 5.48.45Counting each character I reach a total of 147 female characters in all of Shakespeare, which of our 1191 characters amounts to about 12% of all the characters in Shakespeare. Winter’s Tale has the most female characters (8); Tempest and Henry IV part 1 have the least (2). But that depends on whether or not Ferdinand counts as a female character, in which case Tempest only has one female character. The Young Son in Richard III is deemed female. Macbeth has 7 female characters, but that includes the Witches:Screen shot 2013-02-08 at 5.55.58

I don’t particularly think that the Witches count as female- I would have been happier to see them as “unknown, mixed, or ambiguous”. How do we know if a character is really female? I could give the Open Source Shakespeare list to any Shakespeare scholar and they could come up with a different count by gender. According to WordHoard, though, Rosalind, Viola, Ferdinand, and the Witches are female characters and treats them universally throughout its system as being female. The benefit of this is that they cannot ever suddenly change categories within the structure of the program, though you may not necessarily agree with the way it has categorized them.

According to my numbers, I had 1044 characters left, covering “male” and “ambiguous”. I was curious as to what counts as “unknown, mixed, or ambiguous” according to WordHoard. (again, click to make bigger):

Screen shot 2013-02-08 at 6.07.34

Interestingly, characters who count as “gender-ambiguous”, according to WordHoard, include the actors Mustardseed, Peaseblossom, Cobweb and Moth from A Midsummer Night’s Dream. I disagreed with this distinction; as if they are ambiguous, surely the Witches should be as well? A number of examples here include the aforementioned “alls” and a number of ghosts or apparitions (“Ghosts of Others Murdered By Richard III” was my personal favorite). This raises more questions: Should apparitions and spirits get their own gender category? Are they gendered? What counts as “gendered”?

Ultimately I counted and removed all the “all”s – which here totals 17, and is in disagreement with the Open Source Shakespeare count. Had I been doing this by hand, I might have counted instances of two or more characters speaking together as “alls”, but WordHoard isn’t counting this information – WordHoard is merely counting the total number of words for each character, here marked as “all”, whereas if two characters say something at the same time they may not be marked as “all”.

This left 46 total ambiguous characters, covering characters such as servants, attendants, various apparitions, and the actors from A Midsummer Night’s Dream, and accounts for about 4% of the characters in the Shakespeare corpus. The 17 Alls accounted for about 1% of the corpus, leaving 998 male characters or about 83% of the corpus.

So, in review: how many female characters are there in Shakespeare? It’s hard to say, but one answer is 147.