On Teaching Literature to Computer Science Students

[Previously: On Teaching Coding to English Studies Students]

Recently I wrote about English studies students learning to code in an interdisciplinary computer science and English class. In that post I mentioned that this class (running for the third consecutive year) comes with a variety of challenges – some strictly institutional, some cross-departmental, and some pedagogical. I’ve been collecting a number of these and will be blogging about them in the future. In that post I also mentioned that there are two very oppositional learning curves at play: one is getting the English studies students to think about computers in a critical way and the other is getting the computer science students to read literature.

We have just hit the very exciting point in the course where the computer science students are learning to read and the English studies students are truly hitting their stride, which is a dramatic turnaround from the start of the course. Last week we asked each group to give a very short, informal presentation about their assigned Shakespeare plays in relation to the rest of the Shakespeare corpus. It was no surprise that in every group the English studies students gave an overview of the plot of their play and a few key themes whereas the computer science students reported what they had deemed to be a finding. English studies students have been studying how to analyze texts and the computer science students haven’t done that in the same way.

Here in Scotland, students begin to track either towards arts & humanities or science long before they hit university – they start to track in high school, and take school leaving exams in a number of subjects (“Highers”), from a rather long list which you can read here. Once you choose your track, it’s rare (though not unheard of) to have much overlap between A&H and science in one’s Highers qualifications. Most degree programs will have preferred subjects for applicants which guide students’ decisions about which Highers to take; Strathclyde’s entry requirements for a student wishing to be a Computer Science undergraduate can be found here (pdf). Unless you’re going into a joint honours in Computer Science and Law, English literature or language is not a required Higher for prospective students in computer science at my university. It goes the other way, too – a Higher in Maths is not a requirement for a prospective English studies student unless they plan on taking a joint honours with Mathematics (again, pdf). Some students may take Highers strictly out of interest (or uncertainty about which route to take), but like the SAT II or AP exams, this is not necessarily something you’d do for fun – these are high-stakes exams.

If faced with that choice I would definitely have taken the Arts & Humanities track despite liking science and being (told I was) very bad at math. I suspect that a lot of computer science students may have liked history or media studies but were bad at writing (or told they were…) and that was enough to turn them away from taking an Arts & Humanities track. It might be that students who sign up for a degree in Computer Science are just really passionate about computers, or that they like practical problem solving, or they’ve been told that computer science is a lucrative field. I have no idea – I’m not them*. On the surface, it can look like they have a lot of missing cultural information for not knowing things about literature – but they also know a whole lot more than we do about very different things.

That’s not to say these students don’t read of their own accord or aren’t interested in books. However, by the time they show up to my class, they have rather successfully avoided close reading for a few years, whereas English studies students have been practicing this skill for a while now. This is something English studies students are very comfortable with, and have now learned enough about the way that computers “think” (or lack thereof) to reach a common ground with the CS students.

However, in the same way the computer science students found the first half of the class easy and the English studies students found it tremendously daunting, the computer science students suddenly feel like they’ve been thrown in the deep end. The English studies students have to teach them how to analyze a text.

In our in-class presentations, each group had to discuss a discovery they’ve made about their play and explain why it was interesting. Without fail, the computer science students had lots to say about various discoveries they had found about kinds of words that were more or less frequent in their play compared to all of Shakespeare’s plays. And yet when they were pressed about why they thought it was happening, they weren’t really sure. The English studies students could postulate theories about why there were more or less of a specific kind of feature in their text, because they know how to approach this problem.

Over the next few weeks we’re letting the students self-guide their own projects and produce explanations for their discoveries, which means the computer science students are on a crash course on close-reading from their in-group local expert. They’re learning that data isn’t everything when it comes to understanding what makes their play in some way different (or similar) from other plays. In fact, they’re learning the limitations of data and ways that close-reading is not just supplementary but essential to a model of distance reading with computational methods. And if the student presentations I saw last week are any indication, I suspect I have some very exciting work coming my way in a few weeks’ time.

* I did my dual-major undergrad degree in English lit and Linguistics; I didn’t get involved in computers until my masters.

(with thanks to Kat Gupta for comments on this post)

CEECing new directions with Digital Humanities

[editor’s note: this post is cross-posted to Kielen kannoilla, the VARIENG blog. I am extremely grateful to Anni Sairio and Tanja Säily for all their organization behind making this visit, and thus this blog post, possible.]

This past week I was talking about the relationships between corpus linguistics and digital humanities as a visiting scholar at VARIENG, a very well known historical sociolinguistics and corpus linguistics working group. Corpus linguistics is a very text-oriented approach to language data, with much interest in curation, collection, annotation, and analysis – all things of much concern to digital humanists. If corpus linguistics is primarily concerned with text, digital humanities can be argued to be primarily be concerned about images: how to visualize textual information in a way that helps the user understand and interact with large data sets.

VARIENG has been compiling the Corpus of Early English Correspondence (CEEC) for a number of years, and one of their primary concerns is ‘what else can we do with all this metadata we’ve created’? Together, we discussed three main themes of corpus linguistics and digital humanities: access, ability, and the role of supplementary vs created knowledge. Digital humanities runs on a form of knowledge exchange, but this raises questions of who knows what, how, and how to access them.

Approaching a computer scientist with a bunch of historical letters may raise some “so what” eyebrows, but likewise, a computer scientist approaching a linguist with a software package to pull out lexical relationships might raise similar “so what” eyebrows: why should we care about your work and what can we do with it? Because both groups walk in with very different kinds of expertise, one of the very big challenges of digital work is to be able to reach a common language between the disciplines: both have very established, very theoretically-embedded systems of working.

All of this is to say that the takeaway factor for corpus linguistics research, and indeed any kind of digitally-inflected project, is very high. As Matti Rissanen says, and rightly so, “research begins when counting ends”. The so-what factor of counting requires heavy contextualization, human brainpower, time, funding, systems and communication – and none of these features are unique to corpus linguistics. Digitally-inflected scholarship requires complementary expertise in techniques, working and interacting with data; we need humanistic questions which can be pushed further with digital methods, not digital methods which (we hope) will push humanistic questions further. While it is nice to show what we already understand by condensing lots of information into a pretty picture, there are deeper questions to ask. If digital humanities currently serves mostly to supplement knowledge, rather than create new knowledge, we need to start thinking forward to ask “What else can we do with this data we’ve been curating?”

One thing we can do with this data is view it in new tools and learn to ask different questions, as we did with Docuscope, a rhetorical analysis software developed at Carnegie Mellon University. Digital tools and techniques are question-making machines, not answer-providing packages. Here we may ask ourselves why F_1720-39.txt has a low count of Personal Pronouns in Docuscope, and the answer may be that what we consider to be personal pronouns (grammatically) are categorized otherwise by Docuscope and that other constructions are used instead. This isn’t magic and this can’t be quiet handwaving: we should be pushing ourselves towards asking questions which were previously impossible at the scale of sentence-level or lexical-level of detail, because suddenly we can.

Resources:
Slides from last week’s workshops (right-click to save as pdf files):

Day one: What is digital humanities?
- Blogpost by Timo Honkela, summarizing Day 1
Day two: What do you do with millions of words?
Day three: Projects

On teaching coding to English studies students

This term I am teaching English studies students how to code. This seemingly goes against quite a lot that I’ve been saying online for a while, so let me back up.

I’ve been involved in an interdisciplinary digital humanities course called Textlab for the past two years, as part of a university-sponsored, research-oriented, cross-faculty project; this time around I’ve stepped into a larger role of co-convening rather than simply being support staff for it. (It’s week 2 and I’m already planning a blog post called Things I Learned From Co-Convening A Interdisciplinary, Interyear Course, so stay tuned for that.) The premise of Textlab is that Computer Science and English studies will work together in small groups which study a specific Shakespeare play with computational and literary methods (for what this looks like in practice, please see this paper). The goals of knowledge exchange here is pretty clear: English studies students learn hands-on computer skills and the computer science students learn how to practically apply their knowledge on a real-world project. In essence, the computer science students have to revisit Literature and the English studies students learn Computers. The learning curve on both sides is, effectively, massive.

In many ways the first half of this course is digital literacies for the English studies students and literacy literacies for the Computer Science students in the latter half of the 10 week course. We are obviously covering a lot of ground here; in the past we have had students blogging and wiki-ing and tweeting, but this year we’re scaling back to address textual analysis from the ground up, what that looks like, and how computers can supplement our literary understandings of texts.

Because this is a practical, hands-on, interdisciplinary class, the students all need to be on roughly the same page early on. And that is how I end up back with teaching English studies students how to code – we are starting with really basic Unix commands to understand how computers work, and build on these principles of commands to programs to understanding how computers can supplement human inquiry. Predictably, the CS students fly through the early Unix stuff and can really struggle with the reading, whereas the English studies students succeed wildly with the reading stuff and can really struggle with the computational stuff.

I’ve been sitting on a blog post for a long time about why I don’t think everyone needs to learn how to code and I continue to think that. This blog post may never see the light of Internet-Day; we’ll see. But the main thing I want to highlight is that I have no “official” training in programming or coding; my degrees are all in literature and linguistics and I have self-taught myself almost everything I know about computational analysis now. Three years ago I couldn’t have told you what a command line is and now I deal with it on a semi-regular basis. So I totally understand where the English studies students are coming from, and I also understand the potential of what the computer science students are capable of. The problem in much of the Everyone Must Learn to Code Now dogma is that you need a practical problem to care about, otherwise it is essentially meaningless. I don’t really care that Mary has three watermelons and Jane has seven oranges and how many apples does John have? Learning to code is all very well & good but the practical aspect has to be there, otherwise it just feels like our fruit distribution word problem.

So as I sat in the computer lab today with my students, I saw the English studies students struggling with commands like wc, -l, -w, mkdir, uniq, grep, cat. Part of the problem was that the students had no conception of why they were typing these letters into some computer. “I don’t understand,” they said. So I asked them what they had done so far, and they said they had just typed in some things and now they aren’t sure what to do with it.

We sat down and talked about what each step was, and what each of these commands meant, while walking through parts of the lab together. Sometimes they weren’t sure, and we had to address what to do about that. “Cat” is not a very googleable command: we aren’t looking for furry creatures with pointy ears and whiskers. But as a command, cat does a lot and what it does is not super transparent, so we talked about how to get information about what these letters mean.

The other thing I found a lot of my English studies students doing was feeling self-conscious about not knowing how to go back and fix what they understood to be a mistake. It’s easy enough to go back to the folder we are working in, but we hadn’t given them that command and there’s no easy BACK button in a terminal. Here’s a secret about me: I always have to look this up. I have to look up a lot of information when I want to do something computational. I would never claim to be a programmer, let alone a proficient one, but even after three years of this kind of stuff I have to look it up. One of the big problems they were having is that they didn’t know how, or where, to get that kind of information, so we talked about that too.

I don’t like or support the idea that computational language is like a natural language (it’s not) but the easiest analogy to make here is that learning what these letters mean is a lot like learning a language. If you’re learning French you need to know what the sounds you’re mashing together represent, and what kind of meaning they hold. Je voudrai un ananas may be meaningless to an English speaker, but to a French speaker that phrase holds meaning. Likewise, asking a computer sort -n file | uniq | sort –r > sorted_file holds a specific kind of meaning. If you don’t speak French you probably don’t understand what I just said; if you don’t speak “computer” you don’t understand what you just said either. Simply replicating letters in order doesn’t allow the students to critically engage with a task the way we might want it to. The goal of the course is to get the English studies students to understand how a computer works more fully, but producing replication tasks makes this just another black box: type this, MAGIC HAPPENS, you will have results.

Next week we are addressing pipelines more fully. My role as a teacher, educator, and mentor is to help my students understand what we’re doing, and one of the many ways this class is challenging is that I don’t want to be standing in front of them lecturing if I can avoid it. Standing in front of my students and telling them how to do things isn’t hands-on learning. But talking about how to find resources and how to ask questions is hands-on learning. In the meantime, I am thinking about how I can support the computer science students when the coin is flipped and getting them talking about literature in a way that feels tangible and relevant to them.

Early “English” Books Online?

Early English Books Online, or EEBO, is what might be technically known as “a hot mess”. (If you’re unfamiliar with EEBO and its messiness, I highly recommend Ian Gadd’s “The Use and Misuse of Early English Books Online” which summarizes how we arrived at this hot mess, Sarah Werner’s blogpost on the kinds of things EEBO doesn’t show us well, and Daniel Powell’s roundup of EEBO weirdness). I want to stress that this isn’t necessarily a bad thing, as it’s a product of time and technology from a while ago. It’s being rekeyed by humans (the TCP enterprise), and overall it is just a really big dataset of Early Modern English. When you’re looking at giant datasets like EEBO it doesn’t really matter if parts of it are imperfect. It will always be imperfect.

I’ve been looking at spelling variation for various gender terms and collocational patterns surrounding gender terms in EEBO lately because it is a really big dataset and those tend to be useful for testing our perceptions of language, especially when they contain a number of different kinds of texts. One of the ones I was looking at was hir, a known variable spelling of her. One example of this is can be found in Shakespeare’s Merry Wives of Windsor (V.ii.2150); Melchiori’s Arden Shakespeare edition has a note about the phrase “his muffler”: the Folio edition of Wives reads his, but the Quarto edition read “her muffler”. This “may be Evans’ confusion, but more likely Shakespeare’s slip or a printer’s misreading of ‘hir’, an alternative spelling of her” (2000: 253).

So in looking for examples of hir I found myself suddenly looking at Welsh. Specifically, this text, Ymadroddion bucheddol ynghylch marvvolaeth o waith Dr. Sherlock (all links will go to the Michigan Text Creation Partnership permalinks, for ease of reference. Because I’m based the UK, my access comes from JISC Historic Ebooks, not the Chadwyck interface, meaning that generated permalinks might not work – further problems!). The below image is from the ‘text’ option on the JISC interface for Ymadroddion:

I don’t speak – or read – Welsh, let alone Early Modern Welsh, so I turned first to google translate and secondly to twitter, where I joyfully found a number of people who either work with or speak/read Welsh (and one person who studied Medieval Welsh in undergrad – officially winning the title of ‘most obscure gen ed ever’. The internet continues to amaze.)

In welsh, hir means ‘long’, so it’s not a pronoun but an adjective. I was curious about the structures of grammatical gender in Welsh, namely if it would have agreement by gender in ways that Old English, for example, did. This answer was a little bit more complicated to elucidate but it was declared that yes, there is a gender system in Welsh; and no, it should not affect hir. [1] So, that’s good to know. But here’s a question: When we say ‘Early English Books (Online)’, do we really mean English the place, or English the language?

Linguistically, Welsh is rather decidedly not English, as the extremely useful BBC Modern Welsh Grammar will illustrate. But I was rather surprised to find Welsh being considered part of “English” in this set. So, I went back to the EEBO-TCP site , where they say the following about text selection:

Selection is based on the New Cambridge Bibliography of English Literature (NCBEL). Works are eligible to be encoded if the name of their author appears in NCBEL. Anonymous works may also be selected if their titles appear in the bibliography. The NCBEL was chosen as a guideline because it includes foundational works as well as less canonical titles related to a wide variety of fields, not just literary studies.
In general, we prioritize selection of first editions and works in English (although in the past we have also tackled Latin and Welsh texts). Because our funding is limited, we aim to key as many different works as possible, in the language in which our staff has the most expertise. However, exceptions for specific works may be made upon request.
A work will not be passed over for encoding simply because it is available in another electronic collection. Not only is the quality of these collections sometimes uncertain, a text’s presence outside of EEBO will not allow it to be searched through the same interface as the EEBO encoded texts.
Titles requested by users at partner institutions are placed at the head of the production queue.

There is quite a lot of Latin in EEBO, because it was in some ways considered a prestige language in the earlier early modern period. Many early printed books were in Latin, so it is generally unsurprising that there’s a lot of it in the EEBO set. Again, this is not English-the-language but English-The-Place. Curiously, the place of imprinting for Ymadroddion bucheddol ynghylch marwolaeth o waith Dr. Sherlock is listed as “gan Leon Lichfield, i John March yn Cat-Eaten-Street, ag i Charles Walley yn Aldermanbury, […] yn Llundain” [by Leon Lichfield, John March-Eaten-in Cat Street, with Charles Walley in Aldermanbury, London], suggesting that “English” refers to place rather than strictly language- and it gets the following metadata:

Publication Country : England
Language : Welsh

Interesting.

Scotland joins with England in 1603 when James VI, King of Scotland inherits the throne to become James I, King of England, but the two countries remain largely independent states until the Acts of Union in 1707. But would we find examples of Scots in EEBO? Scots, like Welsh, is an example of another localized language, though arguably Scots gets more English influence. Kirk is a nice Scots word meaning ‘church’, and here’s an example from William Dunbar’s The tua mariit wemen and the wedo. And other poems from around 1507:

Curiously, this is listed in the records as

Publication Country: Scotland
Language: English

As above, I’m not sure everyone would agree that this is “English”. Nor is it printed in “England”. But these books (and more) are there as part of Early English Books Online.

[1] Thanks to Jonathan Morris (@jonmorris83), a marketing assistant at Palgrave Linguistics, Alun Withey (@DrAlun), Liz Edwards (@eliz_edw) and Sarah Courtney (@sgcourtney)

Counting things in Early Modern Plays So You Don’t Have To: Type/Token Ratios

If you’re just joining me, I’ve been working on word frequencies of six highly-prototypical lexical items in a corpus of slightly less than 400 Early Modern London plays. I recommend starting with my research notes and then looking at some quick & dirty results.

As I noted in my quick & dirty results, these numbers hadn’t been normalized in any way: it was all raw data. In an effort to move beyond just raw data, I compiled the total number of words in each play in the corpus. I initially was interested in how play length might be a variable over time my corpus, so I graphed that. The bulk of my plays are from the early 1600s, as you can see:

Overall, plays do seem to get longer until about 1600, at which point they start to get shorter again. 1662 looks to be an outlier here, as the plays in a straight line on the far right-hand side are mostly by Margaret Cavendish. (I am currently trying to figure out how to color my graphs by author, so if you have advice on that, please let me know: I’m rather haphazardly teaching myself to graph in R as I go.)

OK, so I have the total number of tokens in each text. What if treated every instance of my prototypical lexical items as a specific type, and plotted them as type/token ratios? Type/token ratios have a bit messy history in corpus linguistics, as they’re mostly used to calculate vocabulary denseness (Type/Token Ratios: what do they really tell us?, Richards 1987 [pdf]), but this would show a ratio of the raw frequency of each lexical item of interest in each play compared to the length of each play, which would normalize my data a bit.

Click to zoom:

First of all, it’s notable that the lexical-frequency-to-play-length ratio make some pretty clear bell-curve shapes; I haven’t tried to calculate standard deviations of play-length. (I suppose I could do that next.) The average length of an Early-Modern London play in my corpus was 22086.5 words.

It seems that as plays get longer, they’re more likely to use man (and, to some extent, wom*n) in ways that are not true for lord/lady and knave/wench. It’s also worth looking at scales here: there are nearly double the number of lords than ladys, although man/woman and knave/wench are more comparable. Also, there are way fewer instances of knave and wench in my corpus overall, which suggests that maybe these words are not nearly as popular as we might like to think.

Counting things in Early Modern Plays So You Don’t Have To: Some Quick & Dirty Results

I was given a corpus of 400 plays for my PhD on gender in Early Modern London plays. Up to this point I had previously been focusing largely on Shakespeare, but have recently been moving into the larger corpus. So what does one do with 400 plays? My solution was “get to know them a little bit.” I was counting the raw frequencies for lord/lady, man/wom*n, and knave/wench in the entire corpus using AntConc, manually recording it, and then transcribing this data into a spreadsheet. I had selected these terms on the basis that I had recently spent a lot of time looking at likely collocates for these terms, as these binaries represent a high-, neutral-, and low- formality distinction.

Several of my twitter followers asked why I was just looking at wom*n and not also m*n, and the answer is that without a regular expression I was going to get a fair quantity of noise from m*n (including but certainly not limited to man, men, mean, moon, maiden, maintain, morn, mutton…). Wom*n, I had found, was a highly successful use of a wildcard, only picking up woman and women in the corpus. While this category remains somewhat imbalanced, it presents a pretty clear scope of the quantities for more neutral forms. Now that I have a better sense of what my corpus is like beyond “those files in that folder on my computer”, I can always go back and get other information pretty easily.

What can we learn from a corpus of 400 plays?
For starters, there’s not actually 400 plays in the 400-play-corpus, but 325 plays. I knew when I started this project that this corpus was less than 400, and that it did not cover everything. It is a representative corpus, but I was a bit surprised at how much less than 400 plays I actually had. These 325 plays cover 53 individual authors from the years 1514-1662,* which looks like this:

Each dot represents a year of publication. You will note that some authors are more represented than others (Shirley, for example, has 33 plays in the corpus, spanning a number of years, whereas someone like Beza has only one play in the corpus.) The average year for a play to be published was in 1613, and an overwhelming majority of these plays have been published in the late 1500s into the first half of the 1600s.

Once I had the raw frequencies for everything, I was curious to see how these terms performed diachronically. For ease I’m going to keep calling it the 400-play corpus, and as you’re reading, remember that this is very quick & dirty. There’s a lot more to say & do with this data, but I think talking about raw data is a useful endeavor in that speaks volumes about the sample itself.

These graphs suggest that the use of lady and wom*n look more frequent in the corpus from the late 1500s onwards (they’re both almost in a parabola shape) whereas the use of lord and man begins to decline around 1600, creating more of a bell curve effect.

And what about knave and wench? We see there’s a distinct decrease in usage for both just after the early 1600s, though knave was more frequent earlier in the corpus:

Two of these three sets of binaries show very similar graphs, but that’s because this is raw data: there’s simply more instances of plays occurring around the late 1530s onwards.

This was my first time using R for any graphing ever, so I’m going to dive back in and see what I can do with a more normalized corpus next.

—
Additionally, I owe a great debt to the following people, who were very selfless and helpful:
Sarah Werner, Julia Flanders, Shawn Moore, Douglas Clark, Simon Davies. Thank you.

Counting gender-specific nouns in 400 plays so you don’t have to: research notes

Those of you following me on Twitter will have noticed I’ve been tweeting bite-sized facts about gender in Early Modern London plays. Here are some research notes on what I’ve been doing.

The Corpus.
I have a corpus of ~400 Early Modern London plays, culled from EEBO by someone who is not me, spanning from 1514 to 1662. This almost certainly does not cover every play written in that time, nor does it cover variant editions of these plays. This is meant to be a largely representative corpus: I have all major playwrights, a number of minor ones, and most (but importantly not all) plays written by them; one edition per play. These files have been labeled by canonical generic description (eg comedy, tragedy, history, tragicomedy), year of publication, abbreviated author surname, and a truncated version of the title. All of this metadata has been collected from EEBO, again by that same someone who is not me.

The files themselves have had everything but the words said by characters stripped out. There are no headers (no scene/act denotations) and no character markers. Each word is on its own line, and all spelling has been modernized. Here is a sample, from Kyd’s The Spanish Tragedy:

This, you will note, is not ideal for reading by human eyes. But computers can do some wonderful things with this format.

I’ve been sorting these files into separate folders by author, to get a sense of how many and which plays by which authors I have in my corpus. This is, quite simply, a little more manageable than a running list of plays sorted by genre & date. It also gives me a larger sense of when these authors are working, what generic kinds of plays I have for them, and allows me to have the flexibility to group them in a variety of ways (playhouses associated with specific playwrights, authors who are contemporaries, etc) later on.

My present goal is to get counts of how many times the words lord/lady, man/wom*n, and knave/wench appear in each play in my corpus. Part of the reason I’ve chosen these terms is that they represent a shift from high – neutral – low formality while retaining gender-specific contexts. I could have chosen other ones: I’ve been looking at collocational patterns in Shakespeare using these terms (here are the relevant slides, .pdf) and wanted to get a sense of how these terms are represented in the larger corpus before I do anything else.

I consider this “getting to know my plays” because I’ve been reading as many of these plays as I possibly can, but I have several disadvantages here:
1. I can remember what many of these plays are about, but not the fine level of detail the computer can pull out for me.

2. Some of these plays are very hard to find in print (and, as I’ve shown, they are not in an ideal format for reading). My university no longer subscribes to EEBO, so I don’t actually have access to the original full-text files.

Getting Data on 400 Plays and What To Do With It.
I’ve been running the plays through a concordance program called AntConc to get a visualization of where and how many of these terms appear in each subcorpus of author. Here’s what Dekker looks like in Antconc’s Concordance Plot viewer:
Screen shot 2013-04-28 at 2.22.44
Each black line represents one instance of the search term, and is visualized in a linear way (so, from the beginning to end of each play). This is useful in that the software will give me a number of hits in each play AND shows me where these words appear in the play-texts. For The Honest Whore, Part 1, there’s a few instances of “lady” all at once, at the beginning of the play, a few scattered in the middle, another small clump (probably representing a conversation) in the middle, and a few sparse other instances toward the end of the play. I’m doing this mostly to get a sense of where these highly salient words appear and don’t appear in ways that are very hard to keep track of when you’re reading 400 plays in a traditional, linear fashion. These are words you’d (presumably) expect to find in Early Modern plays, so you’re not really paying much attention to them as a reader.

I record this data by hand in a notebook by author and then manually copy the information into a csv file. While it would be great to essentially have a spreadsheet of all of this information automatically produced, spreadsheets are also not particularly well-designed for human eyes to read. Eventually this will turn into a very nice graph, I’m sure, but in this format, it’s hard to make much sense of it all:
Screen shot 2013-04-28 at 2.38.55

This is admittedly a little easier:
Scan 4

There is an easier way to do this for every my entire corpus at once in R and – presumably – Python, but quite frankly that would become information overload very quickly. So while some of you more computational people may be wondering why I’m moving at such a seemingly glacial pace, the answer is “because I want to be comfortable with the data and familiar in a way that allows me to think and reflect on it as it comes”, rather than having it all at once. I want to get to know my corpus a little bit more first. Eventually, I’ll be moving into R with this data – but not yet.

When I’m done I will be making the csv file available, and will hopefully be posting a write-up here. Thanks for your patience. In the meantime, here’s the csv file for all of Shakespeare (from the Globe Shakespeare, 1841) organized by genre (comedy, history, late plays, tragedies).

Does Shakespeare pass the Bechdel Test?

The Bechdel Test is a measure of how male and female characters are portrayed in cinema and other media. A piece passes the Bechdel test if it:

a) has at least two women in it
b) who talk to each other about something besides a man.

That’s it. Pretty simple, right? Not a lot of contemporary media passes the Bechdel test, rather alarmingly. While I was working out proportions of male and female characters in Shakespeare, I got a number of questions about whether or not Shakespeare will pass. I went looking to see if anyone else had approached this question before. Someone has, but at the time of writing this, their website is down for maintenance.

I have already shown that all of Shakespeare’s plays have 2 or more female characters. But what about “talking to each other about something other than a man”?

I began by searching in WordHoard for all examples of characters with the gender of female who use the lemma form she. In essence I am doing this analysis backwards: I’m asking if there are female characters who talk about something other than a man, then seeing if plays which pass this aspect of the test also feature a female character talking to another female character. If a male character was referred to in some way in the window of +7 words left or right in a way indisputably linking the discussion about the female character to the male character, the play has failed this part of the test.

WordHoard highlights the place in the play where each instance of the lemma she appears; these examples can be cross-referenced by clicking each individual example to call them up in the context of the play by act and scene.

King Lear, for example, fails, with “Why should she write to Edmund?” (IV.v.19)

Titus Andronicus might pass the first part of the test, though:
These examples do not show any female character talking about another female character in explicit reference to a man. Male characters (lords) are alluded to, but I read them to not be directly implicated to the newborn baby the Nurse speaks to Aaron about – though you may disagree.

The first cull – do female characters in Shakespeare talk about something other than a man? – left me with the following plays:
Winter’s Tale, Pericles, Macbeth, 2 Henry 6, King John, 2 Henry 4, 1 Henry 6, Tempest, Henry 5, and I’m going to include Titus Andronicus.

1 Henry 4, Richard 2 and Julius Caesar had no examples of the lemma form she, so I will address them here as well.

The next question is “do female characters talk to other female characters in the play?”
Open Source Shakespeare allows you to isolate character’s speeches by name – and gives you the option to show cue speeches and the ability to see these speeches in the context of the play. They have been linked where appropriate.

The Winter’s Tale does not pass the test. Although Emilia and Paulina are talking to each other, they are talking about the king in Act 2 Scene 2.

Pericles does not pass the test: Leonine and Marina are talking to each other, but about Marina’s father (scroll up just slightly from where this link will take you) in Act 4 Scene 1.

Macbeth does not pass the test either, as The Gentlewoman talks about Lady Macbeth, but to the Doctor, who is presumably male, in Act 5 Scene I.

2 Henry 6 does not pass the test, as the female characters do not talk to each other.

King John does not pass, because of an interchange between Constance and Queen Elinor in Act 2 Scene I, in which they discuss John, Elinor’s son.

2 Henry 4 also does not pass, for two reasons: one, this interchange between Lady Northumberland and Lady Percy has them talking about the King in Act 2 Scene 3, and two, because of this interchange between Doll Tearsheet and Hostess Quickly from Act 2 Scene 4, in reference to Pistol.

1 Henry 6 does not pass the test because the female characters do not talk to each other.

The Tempest also does not pass the test because the female characters do not talk to each other. (I am considering Ariel a female character here; this is still very much up for debate, and this may automatically disqualify The Tempest overall.) Miranda and Ariel are not in conversation.

Henry V does pass the Bechdel Test, due to this discussion (in French) between Katherine and Alice from Act 3 Scene 4.

Titus Andronicus ultimately does not pass the test due to this conversation between Tamora, Lavinia and Bassanius in Act 2 Scene 3.

1 Henry 4 does not pass because the female characters do not talk to each other.

Richard 2 passes because the Queen and her ladies “are carefully not talking about Richard” as @angevin2 kindly points out; they are instead talking about garden sports in Act 3 Scene 4.

Julius Caesar does not pass because the female characters do not talk to each other.

By and large, Shakespeare does not pass the Bechdel test: but two plays do – and it’s not the plays I ever would have expected. However, I should point out I might be wrong here: like I said above, I did this backwards, by finding plays that had female characters talking without mentioning male characters, then checking to see if these plays did show two female characters in conversation. If you have a better solution for finding out if Shakespeare passes the Bechdel test, I am all ears!

EDIT (18 June 2015)
Some recommended further reading:
Selisker, Scott. (2014) “Literary Data and the Bechdel Test“, from the What Is Data in Literary Studies? colloquy, Modern Language Association annual meeting, Chicago, IL.

Mariani, Daniel. (2013) “Visualizing The Bechdel Test“. Ten Chocolate Sundaes blog post, 24 June 2013.

Agarwal et al (2015) “Key Female Characters in Film Have More to Talk About Besides Men: Automating the Bechdel Test“. Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pp. 830–840, Denver, Colorado, May 31 – June 5, 2015.

How much do female characters in Shakespeare actually say?

Recently I suggested there might be 147 female characters in Shakespeare. If we are to trust that, how do they break down by play? I used the Open Source Shakespeare genre distinctions to categorize each play and the female-character categorizations from WordHoard to produce the following: In this graph, green represents comedy, black represents history, and red represents tragedy. As you will recall from my previous post, The Winter’s Tale has the most female characters, and 1H4, Julius Caesar, and Tempest have the least amount of female characters.

17 out of 37 plays have four female characters. This makes sense, as the Early Modern theatre could hire two boys to cover all female roles, although this would obviously limit the characters who could then speak to each other. More female characters required either more boys, or for each boy-actor to take on more parts (which would again limit the amount these characters could speak to each other).

But how much do these characters talk? Or, in other words, how much of each play is made up of words said by female characters? To do that, I’d first have to find how many words were in each play, and how much of those words were said by female characters. I already had made note of how many words were said by female characters in each play from my previous post, but I didn’t have the total number of words in each play.

I returned to WordHoard’s find words function to get a word-count according to the software’s own encoded edition of each play: With this information, I was now able to produce the following graph. Again, green represents comedy, black represents history, and red represents tragedy; the shapes of each mark on the graph represents how many female characters are in each play:

Female characters in As You Like It say the most out of all the female characters in Shakespeare (but that number includes Rosalind/Ganymede) with 8,643 words spoken out of 21,298 total words in the play. Female characters in Timon of Athens say the least, with 61 words out of 17,744 total words in the play. On the whole, while there may be slightly more female characters in comedies, the amount of words they actually speak is highly variable, whereas the histories seem to show the least amount of variation. I had also taken the average of all female characters in each genre and found that comedies had an average of 4.07 female characters; histories, an average of 4.083 female characters; and tragedies had an average of 3.72 female characters – suggesting that the history plays may be the most stable out of the three categories for female characters, which is interesting. If you are interested in which female characters say the most words, please click here for the relevant image.

A number of people have asked me if Shakespeare passes the Bechdel test: I’m working on it! Stay tuned…

How many female characters are there in Shakespeare?

This was a fairly straightforward question I found myself asking recently for a footnote. Easy, I thought. I’ll go find a list of characters, count up the female ones, subtract them from the total number of characters, and I’ll have my answer. Though I could have picked up my Complete Works of Shakespeare and started counting from the dramatis personae for each play, I didn’t – because I knew that this information had been encoded before. Gender of characters is something that is often encoded in metadata (there’s a TEI category for gender), and character lists are easy to obtain.

I started with Open Source Shakespeare’s list of characters, which lists 1222 total characters in 37 plays. Characters included in this list included variations of “all”, from many plays:

So, these instances of “all” aren’t really individual characters. However, the rest of this list contained every single character in all the plays, and that was something I could work with. If there are 1222 total “characters”, minus 31 instances of “alls”, there are 1191 individual characters. From there I could either put each of the 1191 individual characters in a box labeled “male”, “female” or “unknown, ambiguous or mixed”, or I could ask another program to do it for me.

I opened WordHoard and asked it to Find Words by Speaker Gender, which would account for those three categories. WordHoard covers all of the same plays as Open Source Shakespeare.

Intuition tells me that it will be an easier task for a computer to isolate female characters than it will be to isolate male characters, so I select “female”, and click “find”. A few minutes later, WordHoard produces the total words spoken by all female characters in each play – and I add the criteria to show “words by speaker name”. My screen looks like this (click to make bigger):
Counting each character I reach a total of 147 female characters in all of Shakespeare, which of our 1191 characters amounts to about 12% of all the characters in Shakespeare. Winter’s Tale has the most female characters (8); Tempest and Henry IV part 1 have the least (2). But that depends on whether or not Ferdinand counts as a female character, in which case Tempest only has one female character. The Young Son in Richard III is deemed female. Macbeth has 7 female characters, but that includes the Witches:

I don’t particularly think that the Witches count as female- I would have been happier to see them as “unknown, mixed, or ambiguous”. How do we know if a character is really female? I could give the Open Source Shakespeare list to any Shakespeare scholar and they could come up with a different count by gender. According to WordHoard, though, Rosalind, Viola, Ferdinand, and the Witches are female characters and treats them universally throughout its system as being female. The benefit of this is that they cannot ever suddenly change categories within the structure of the program, though you may not necessarily agree with the way it has categorized them.

According to my numbers, I had 1044 characters left, covering “male” and “ambiguous”. I was curious as to what counts as “unknown, mixed, or ambiguous” according to WordHoard. (again, click to make bigger):

Interestingly, characters who count as “gender-ambiguous”, according to WordHoard, include the actors Mustardseed, Peaseblossom, Cobweb and Moth from A Midsummer Night’s Dream. I disagreed with this distinction; as if they are ambiguous, surely the Witches should be as well? A number of examples here include the aforementioned “alls” and a number of ghosts or apparitions (“Ghosts of Others Murdered By Richard III” was my personal favorite). This raises more questions: Should apparitions and spirits get their own gender category? Are they gendered? What counts as “gendered”?

Ultimately I counted and removed all the “all”s – which here totals 17, and is in disagreement with the Open Source Shakespeare count. Had I been doing this by hand, I might have counted instances of two or more characters speaking together as “alls”, but WordHoard isn’t counting this information – WordHoard is merely counting the total number of words for each character, here marked as “all”, whereas if two characters say something at the same time they may not be marked as “all”.

This left 46 total ambiguous characters, covering characters such as servants, attendants, various apparitions, and the actors from A Midsummer Night’s Dream, and accounts for about 4% of the characters in the Shakespeare corpus. The 17 Alls accounted for about 1% of the corpus, leaving 998 male characters or about 83% of the corpus.

So, in review: how many female characters are there in Shakespeare? It’s hard to say, but one answer is 147.