Counting gender-specific nouns in 400 plays so you don’t have to: research notes

Those of you following me on Twitter will have noticed I’ve been tweeting bite-sized facts about gender in Early Modern London plays. Here are some research notes on what I’ve been doing.

The Corpus.
I have a corpus of ~400 Early Modern London plays, culled from EEBO by someone who is not me, spanning from 1514 to 1662. This almost certainly does not cover every play written in that time, nor does it cover variant editions of these plays. This is meant to be a largely representative corpus: I have all major playwrights, a number of minor ones, and most (but importantly not all) plays written by them; one edition per play. These files have been labeled by canonical generic description (eg comedy, tragedy, history, tragicomedy), year of publication, abbreviated author surname, and a truncated version of the title. All of this metadata has been collected from EEBO, again by that same someone who is not me.

The files themselves have had everything but the words said by characters stripped out. There are no headers (no scene/act denotations) and no character markers. Each word is on its own line, and all spelling has been modernized. Here is a sample, from Kyd’s The Spanish Tragedy:
The Spanish Tragedy
This, you will note, is not ideal for reading by human eyes. But computers can do some wonderful things with this format.

I’ve been sorting these files into separate folders by author, to get a sense of how many and which plays by which authors I have in my corpus. This is, quite simply, a little more manageable than a running list of plays sorted by genre & date. It also gives me a larger sense of when these authors are working, what generic kinds of plays I have for them, and allows me to have the flexibility to group them in a variety of ways (playhouses associated with specific playwrights, authors who are contemporaries, etc) later on.

My present goal is to get counts of how many times the words lord/lady, man/wom*n, and knave/wench appear in each play in my corpus. Part of the reason I’ve chosen these terms is that they represent a shift from high – neutral – low formality while retaining gender-specific contexts. I could have chosen other ones: I’ve been looking at collocational patterns in Shakespeare using these terms (here are the relevant slides, .pdf) and wanted to get a sense of how these terms are represented in the larger corpus before I do anything else.

I consider this “getting to know my plays” because I’ve been reading as many of these plays as I possibly can, but I have several disadvantages here:
1. I can remember what many of these plays are about, but not the fine level of detail the computer can pull out for me.

2. Some of these plays are very hard to find in print (and, as I’ve shown, they are not in an ideal format for reading). My university no longer subscribes to EEBO, so I don’t actually have access to the original full-text files.

Getting Data on 400 Plays and What To Do With It.
I’ve been running the plays through a concordance program called AntConc to get a visualization of where and how many of these terms appear in each subcorpus of author. Here’s what Dekker looks like in Antconc’s Concordance Plot viewer:
Screen shot 2013-04-28 at 2.22.44
Each black line represents one instance of the search term, and is visualized in a linear way (so, from the beginning to end of each play). This is useful in that the software  will give me a number of hits in each play AND shows me where these words appear in the play-texts. For The Honest Whore, Part 1, there’s a few instances of “lady” all at once, at the beginning of the play, a few scattered in the middle, another small clump (probably representing a conversation) in the middle, and a few sparse other instances toward the end of the play.  I’m doing this mostly to get a sense of where these highly salient words appear and don’t appear in ways that are very hard to keep track of when you’re reading 400 plays in a traditional, linear fashion. These are words you’d (presumably) expect to find in Early Modern plays, so you’re not really paying much attention to them as a reader.

I record this data by hand in a notebook by author and then manually copy the information into a csv file. While it would be great to essentially have a spreadsheet of all of this information automatically produced, spreadsheets are also not particularly well-designed for human eyes to read. Eventually this will turn into a very nice graph, I’m sure, but in this format, it’s hard to make much sense of it all:
Screen shot 2013-04-28 at 2.38.55

This is admittedly a little easier:
Scan 4

There is an easier way to do this for every my entire corpus at once in R and – presumably – Python, but quite frankly that would become information overload very quickly. So while some of you more computational people may be wondering why I’m moving at such a seemingly glacial pace, the answer is “because I want to be comfortable with the data and familiar in a way that allows me to think and reflect on it as it comes”, rather than having it all at once. I want to get to know my corpus a little bit more first. Eventually, I’ll be moving into R with this data – but not yet.

When I’m done I will be making the csv file available, and will hopefully be posting a write-up here. Thanks for your patience. In the meantime, here’s the csv file for all of Shakespeare (from the Globe Shakespeare, 1841) organized by genre (comedy, history, late plays, tragedies).