A cautionary concordance plot tale

In my previous post I addressed how to produce a view of many concordance plots at once, and presented concordance plots for twelve vocatives which are indicative of social class in Shakespeare and a larger reference corpus of Early Modern Drama.

After double-checking all the concordance plot files using a hand-numbered master sheet, I normalised the files using the command convert plot*.jpg -size 415x47! plot*.jpg (on the off chance that any files weren’t ultimately the same size), created a new folder of the normalised files, and pulled out the examples which matched the numbers I had for Shakespeare’s plays for further analysis. I hadn’t addressed titles, as I wasn’t really aiming to look at individual authors, so each file is named plot1, plot2, plot234, etc. I went on to compile the results for these plays, felt confident about the fact that I had isolated Shakespeare, and wrote up my previous blog post.

This morning I had a nagging thought: What if those weren’t Shakespeare’s plays? After all, I had broken my #1 rule about using computational methods – assuming that everything at every step of the process worked the way I thought it did. I am probably a self-parodying pendant when it comes to computational methods, because when something goes wrong at some stage in the coding process it may *never* be visible or even noticed in the final output, and this gives me reason to seriously distrust automated processes for analysis.

Ultimately, I decided I would double-check the plays I had deemed to be “Shakespeare”’s. Even though I hadn’t done much automated processing with the image files, I had assumed that the normalisation process would only change the file names to represent a modified version: so that plot10 would become plot0-10, plot 11 would become plot0-11, plot234 would become plot0-234. I had assumed the information in these files wouldn’t change, and the names would correspond to the original files.

This was not true. Instead, I had isolated a very nice sample of 36 plays which I thought matched Shakespeare’s plays in numbering, but turned out to be sampled from throughout the corpus. Matching the sampled “Shakespeare” concordance plots to the master document of concordance plots, I found that I had at least one Middleton play and at least one Seneca play in addition to some (but not all) Shakespeare plays.[1] At this point I was worried, so I re-created Shakespeare’s concordance plots from the master document of concordance plots. By redoing the concordance plots, I could guarantee that these were at least all Shakespeare’s plays in the first instance. Then I normalised them again for size, and went back to see what happens in that process. The first files were a perfect match, as I had hoped. But once I moved to the second concordance plot, I was in trouble.

Below is an image showing the unmodified concordance plot for The Taming of the Shrew (shx2), outlined in red and on the top left-hand side.The other eight concordance plots in this image are normalised for size, and even without great detail you can tell that none of these match the original file. You don’t even need to see the whole image to see this:

Screen Shot 2015-02-26 at 3.26.03

In other words, as I had suspected, the names of the normalised files didn’t correspond to the original file names, though they were all there.[2] More worryingly, I hadn’t caught it because I had assumed that the files were fine after running a process on them. The files produced results, and if I hadn’t double-checked (really, at this point, triple-checked), I wouldn’t have caught this discrepancy.

So what do concordance plots for Shakespeare’s plays look like in composite for the vocatives attached to a name in a bigram (reminder of search terms: lord [A-Z]|sir [A-Z]|master [A-Z]|duke [A-Z]|earl [A-Z]|king [A-Z]|signior [A-Z]|lady [A-Z]|mistress [A-Z]|madam [A-Z]|queen [A-Z]|dame [A-Z]) look like? Well, surprisingly, not so different from the sample curated previously, which may be less indicative of a specific authorial style:


Remember, we read these from left to right; now there’s a lot of use of vocatives in the very beginning of the plays, which stay quite strong near the rising action until there’s a relative absence just before and around the climax and the start of the falling action. Curiously, the heavy double hit || towards the end is still very visible, as well as a few more dark lines leading up to the conclusion. In some ways, the absence of these vocatives is almost more consistent, and therefore the white bits are more visible.

In the meantime I’m having a fascinating discussion with Lauren Ackerman about how to best address pixel density and depth of detail (especially in the larger EM play corpus), so maybe there will be a third instalment of concordance plots in the future.

[1] Seneca’s plays were published in the 1550s and 1560s, which is why they are included in this data set of printed plays in Early Modern London.

[2] The benefits of working with a smaller set like this means that there are are much smaller, finite number of texts to address: rather than n = 332332 possible combinations, I was now only looking at a possibility of n = 3636. So that was an improvement. In case you’re wondering what happened to one play, because previously I had claimed there were 37 Shakespeare plays, one play doesn’t have any instances of the vocatives being addressed in a bigram with a capital letter.

How to address many concordance plots at once

What if you could take many concordance plots and layer them to get a composite view of many concordance plots in one image? I wanted to see if vocatives which mark for high-status individuals attached to a name appear in any particular pattern which resembles Freytag’s model of dramatic structure.[1]

I selected 12 vocatives which clearly illustrate social class attached to a word beginning with a capital letter for analysis, all of which are relatively frequent in the corpus of 332 plays comprising of 7,305,366 words. In order to get my concordance plots for vocatives attached to a name, I used regular expressions searching for the vocative in question in a bigram with a capital letter strung together by pipelines, so the resulting search looked like this (signior is spelled incorrectly; this is the spelling which produced hits – I suspect something happened in the spelling normalisation stage):
lord [A-Z]|sir [A-Z]|master [A-Z]|duke [A-Z]|earl [A-Z]|king [A-Z]|signior [A-Z]|lady [A-Z]|mistress [A-Z]|madam [A-Z]|queen [A-Z]|dame [A-Z]

Although the regular expression I used picked up examples of queen I and the like, the examples of a capital letter representing the start of a name was far more frequent overall. In the case of mistress, Alison Findlay’s definition (“usually a first name or surname, is a form of polite address to a married woman, or an unmarried woman or girl” (2010, 271) ) accounts for its inclusion here. Though there are certainly complicated readings of this title, I consider instances of mistress to be at the very least a vocative relating to social class in Early Modern England.

The obvious solution to doing this kind of work is R, as people such as Douglas Duhaime and Ted Underwood have been making some gorgeous composite graphs with R for a number of years. To be honest, I didn’t really want to go through the process of addressing a corpus by writing an entire script to produce something that I know can be done quickly and easily in AntConc‘s concordance plot view: I had one specific need; AntConc is an existing framework for producing concordance plots which are normalised for length, as well as a KWIC viewer and several other statistical analyses. I knew that if i wanted to check anything, I could do it easily. I didn’t feel any real need to reinvent the wheel by scripting to accomplish my task, unlike the general DIY process presented by R or Python.[2] The only real downside is that if you want to do more with the output, you have to move into another software package to do that, but even that is not the end of the world.

Ultimately, what I wanted to do was take concordance plots for 332 plays and layer them for a composite picture of how they appear, rather than address them as individual views on a play-by-play basis. Layering images is a common way of addressing edits in printed books; Chris Forster has done exactly that with magazine page size; he suggested I use ImageMagick, a command line processing tool for image compositioning.[3] I have a similarly normalised view of texts at my disposal, as each concordance line is normalized for length. Moreover, Chris and I are of the same mind when it comes to not introducing more complicated software for the sake of using software, so when he told me about this I was willing to give it a try, especially as he has successfully done exactly what I was trying to do. But first I needed concordance plots.

AntConc produces concordance plots but won’t export them, which is annoying but not as annoying as you may think. 38 screen grabs later, I had .pngs of each play’s concordance line. Here they are in AntConc:

womp womp(If you’re not used to reading concordance lines, you read them from left to right (from “start” to “finish”, in narrative terms); each | = 1 hit; the more hits closer together, the darker the line will look.)

I turned these screenshots into a very large jpg with the help of an open source image editing program, just to have them all in one document together. The most well-known is probably GIMP but both lifehacker and Oliver Mason offer Seashore as a more mac-friendly alternative to the GIMP.[4]
Then I broke the master document into individual concordance plots, sized 415×47, using Seashore’s really good select-copy-make new document from pasteboard option, which let you keep and move the select box around the master document, as seen below. Screen Shot 2015-02-23 at 12.15.10So far I have only used regular expressions, command-shift-4, copy, paste, save as .jpg, and pen & paper to record what I was doing. Nothing complicated! It took a while, but in the process I got to know these results really well. Not all the plays in the corpus contain all, or in fact, any, of each vocative: in some instances, there are plays that didn’t use any of the above titles, and aren’t included in this output; some plays only use one vocative out of the twelve investigated or any combination of vocatives which do not represent the full twelve.

As a test, I separated out Shakespeare’s plays to see what a bunch of concordance plots looked like in composite. To do this, I opened a terminal, moved to the correct directory, which comprised moving through 6 directories. Then I normalised everything to the same size with
convert plot*.jpg -size 415x47! plot*.jpg, just in case.
I put those in a new folder of normalised images.
Then, from the directory of normalised images: convert plot*.jpg -evaluate-sequence mean average_page.jpg.

Here’s what 12 vocatives for social class in 37 Shakespeare plays look like in composite:
average shx play_monochrome
There are a few things that I notice in this plot: There’s a quick use of naming vocatives near the beginning of the plays, a relative absense immediately after, but during the rising action and climax there are clear sections which use these vocative quite heavily- especially in the build-up to the climax. Usage drops in the falling action, until just before the denoument; there is a point where vocatives are used quite consistently heavily, marked by || but surrounded by white on both sides. If you can’t see it, here is the concordance plot again, with that point highlighted in red.

If you repeat the above process for the 332 plays, you get the following composite image. Although the amount of information in some ways obfuscates what you’re trying to see, there are darker and lighter bits to this image.
avg EM drama play_monochrome
Most notably, the rising action has a similar cluster of class-status vocative use at the tail end of the introduction and into the rising action, a relative absence until the climax, and then the use of vocatives for social class seem to pick up towards the falling action and end of the plays. Interestingly, the same kind of || notation is visibible towards the conclusion, though it reduplicates itself twice. (Again, if you can’t see it, I’ve highlighted it in red here).

Now to address the details of these plots… But you should also read the follow-up post about this, as well.

tl;dr version:

Can you look at many concordance plots at the same time? Yes.

Do vocatives attached to a name which mark for class status have recognizable patterns in dramatic structure? MAYBE.

[1] Matthew Jockers (1, 2) and Benjamin Schmidt have been doing interesting things with regards to computationally analyzing dramatic structure. I’m not going anywhere near their levels of engagement with dramatic arcs in this post, but they are interesting reads nonetheless. (Followup: Annie Swafford’s blog post on Jockers’ analyses are worth a read as well)

[2] If you particularly enjoy using R to achieve relatively simple tasks like concordance plots, Stefan Gries’ 2009 cookbook Quantitative corpus linguistics with R: a practical introduction and Matthew Jockers’ 2014 cookbook Text Analysis with R for Students of Literature both outline how to do this.

[3] Okay, so this required a few more steps of code, most of which were install scripts which require very little work on the human end beyond following directions of ‘type this, wait for computer to return the input command’. If you are on a mac, you will need to get Xcode to download macports to download ImageMagick, and then X11 to display output. X11 seems optional, especially if you keep your finder window open nearby. Setting all this up took about two hours.

[4] It transpired that I could have done this with ImageMagick using the command
convert -append plots*.png out.png.
Oh well. Seashore also offers layering capabilities for the more graphic design driven amongst you but perhaps more importantly for me, it looks a lot like my dearly beloved MS Paint, a piece of software I’ve been trying to find a suitable replacement for since I joined The Cult of Mac in 2006.

How much do female characters in Shakespeare actually say?

Recently I suggested there might be 147 female characters in Shakespeare. If we are to trust that, how do they break down by play? I used the Open Source Shakespeare genre distinctions to categorize each play and the female-character categorizations from WordHoard to produce the following:Screen shot 2013-02-17 at 9.16.43 In this graph, green represents comedy, black represents history, and red represents tragedy. As you will recall from my previous post, The Winter’s Tale has the most female characters, and 1H4, Julius Caesar, and Tempest have the least amount of female characters.

17 out of 37 plays have four female characters. This makes sense, as the Early Modern theatre could hire two boys to cover all female roles, although this would obviously limit the characters who could then speak to each other. More female characters required either more boys, or for each boy-actor to take on more parts (which would again limit the amount these characters could speak to each other).

But how much do these characters talk? Or, in other words, how much of each play is made up of words said by female characters? To do that, I’d first have to find how many words were in each play, and how much of those words were said by female characters. I already had made note of how many words were said by female characters in each play from my previous post, but I didn’t have the total number of words in each play.

I returned to WordHoard’s find words function to get a word-count according to the software’s own encoded edition of each play:Screen shot 2013-02-19 at 2.12.20 With this information, I was now able to produce the following graph. Again, green represents comedy, black represents history, and red represents tragedy; the shapes of each mark on the graph represents how many female characters are in each play:

Screen shot 2013-02-19 at 3.34.54

Female characters in As You Like It say the most out of all the female characters in Shakespeare (but that number includes Rosalind/Ganymede) with 8,643 words spoken out of 21,298 total words in the play. Female characters in Timon of Athens say the least, with 61 words out of 17,744 total words in the play. On the whole, while there may be slightly more female characters in comedies, the amount of words they actually speak is highly variable, whereas the histories seem to show the least amount of variation. I had also taken the average of all female characters in each genre and found that comedies had an average of 4.07 female characters; histories, an average of 4.083 female characters; and tragedies had an average of 3.72 female characters – suggesting that the history plays may be the most stable out of the three categories for female characters, which is interesting. If you are interested in which female characters say the most words, please click here for the relevant image.

A number of people have asked me if Shakespeare passes the Bechdel test: I’m working on it! Stay tuned…

How many female characters are there in Shakespeare?

This was a fairly straightforward question I found myself asking recently for a footnote.  Easy, I thought. I’ll go find a list of characters, count up the female ones, subtract them from the total number of characters, and I’ll have my answer. Though I could have picked up my Complete Works of Shakespeare and started counting from the dramatis personae for each play, I didn’t – because I knew that this information had been encoded before. Gender of characters is something that is often encoded in metadata (there’s a TEI category for gender), and character lists are easy to obtain.

I started with Open Source Shakespeare’s list of characters, which lists 1222 total characters in 37 plays. Characters included in this list included variations of “all”, from many plays:

Screen shot 2013-02-08 at 5.33.00

So, these instances of “all” aren’t really individual characters. However, the rest of this list contained every single character in all the plays, and that was something I could work with. If there are 1222 total “characters”, minus 31 instances of “alls”, there are 1191 individual characters. From there I could either put each of the 1191 individual characters in a box labeled “male”, “female” or “unknown, ambiguous or mixed”, or I could ask another program to do it for me.

I opened WordHoard and asked it to Find Words by Speaker Gender, which would account for those three categories. WordHoard covers all of the same plays as Open Source Shakespeare.
Screen shot 2013-02-08 at 5.25.05
Intuition tells me that it will be an easier task for a computer to isolate female characters than it will be to isolate male characters, so I select “female”, and click “find”. A few minutes later, WordHoard produces the total words spoken by all female characters in each play – and I add the criteria to show “words by speaker name”. My screen looks like this (click to make bigger):
Screen shot 2013-02-08 at 5.48.45Counting each character I reach a total of 147 female characters in all of Shakespeare, which of our 1191 characters amounts to about 12% of all the characters in Shakespeare. Winter’s Tale has the most female characters (8); Tempest and Henry IV part 1 have the least (2). But that depends on whether or not Ferdinand counts as a female character, in which case Tempest only has one female character. The Young Son in Richard III is deemed female. Macbeth has 7 female characters, but that includes the Witches:Screen shot 2013-02-08 at 5.55.58

I don’t particularly think that the Witches count as female- I would have been happier to see them as “unknown, mixed, or ambiguous”. How do we know if a character is really female? I could give the Open Source Shakespeare list to any Shakespeare scholar and they could come up with a different count by gender. According to WordHoard, though, Rosalind, Viola, Ferdinand, and the Witches are female characters and treats them universally throughout its system as being female. The benefit of this is that they cannot ever suddenly change categories within the structure of the program, though you may not necessarily agree with the way it has categorized them.

According to my numbers, I had 1044 characters left, covering “male” and “ambiguous”. I was curious as to what counts as “unknown, mixed, or ambiguous” according to WordHoard. (again, click to make bigger):

Screen shot 2013-02-08 at 6.07.34

Interestingly, characters who count as “gender-ambiguous”, according to WordHoard, include the actors Mustardseed, Peaseblossom, Cobweb and Moth from A Midsummer Night’s Dream. I disagreed with this distinction; as if they are ambiguous, surely the Witches should be as well? A number of examples here include the aforementioned “alls” and a number of ghosts or apparitions (“Ghosts of Others Murdered By Richard III” was my personal favorite). This raises more questions: Should apparitions and spirits get their own gender category? Are they gendered? What counts as “gendered”?

Ultimately I counted and removed all the “all”s – which here totals 17, and is in disagreement with the Open Source Shakespeare count. Had I been doing this by hand, I might have counted instances of two or more characters speaking together as “alls”, but WordHoard isn’t counting this information – WordHoard is merely counting the total number of words for each character, here marked as “all”, whereas if two characters say something at the same time they may not be marked as “all”.

This left 46 total ambiguous characters, covering characters such as servants, attendants, various apparitions, and the actors from A Midsummer Night’s Dream, and accounts for about 4% of the characters in the Shakespeare corpus. The 17 Alls accounted for about 1% of the corpus, leaving 998 male characters or about 83% of the corpus.

So, in review: how many female characters are there in Shakespeare? It’s hard to say, but one answer is 147.

How to Choose a Postgraduate Degree Abroad

In 2010 I graduated from the University of New Hampshire with degrees in English and Linguistics, and moved to Glasgow to undertake a Masters of Research at Strathclyde for 2010-2011. I’m still here working on my PhD, and every few months, I get a Facebook message from an acquaintance saying something along the lines of “I’m thinking about going to graduate school abroad! You did it, right? Can you tell me about it?” And every time I think, here we go. Time to dig out my response from the last time I answered this question! So, for posterity, here’s my answer. Keep in mind that details on applying for postgraduate education are going to vary from country to country. This took me a little bit by surprise at first, because I had expected all education systems to be like the US’s. They are not. It is easy to forget this! As a result, keep one eye on all deadlines, because they might come sneaking up on you much sooner than you think (or take much longer than you would have anticipated).

Admittedly, my experience has been limited to a humanities track, so there might be some variants between humanities and science or engineering, for instance. This is meant to be resource for others considering graduate school abroad from the US, but I think it generally works for applying to graduate school anywhere.

First things first.
Great, you’re thinking about doing a graduate degree! What, specifically, do you want to study in your field? What are you interested in? I came to Scotland specifically to work with my supervisors because I was interested in the intersection of literature and linguistics. In undergrad, I would meet with my teachers, explaining that I wanted to write my essays about linguistics in a literature class and vice versa. I was tired of having to explain and sell my ideas every time. I decided that I wanted to do a graduate degree, but only if I could do something on the intersection of literature and linguistics. I didn’t want to keep having to explain it, I wanted to just be able to do it. (Obviously, I still have to explain it, but for different reasons now.)

My best advice about all of this is to figure out what you’re most interested in. A postgraduate degree is a little more specialized than an undergrad degree. You will want to be in a department with people who have research which is similar to the sorts of things you would want to do. Google any possible combination of what you want to be studying and ask mentors or advisers if they have any ideas. I wanted to study literary linguistics, and it turns out that Strathclyde has been heavily involved in literary linguistics and stylistics. These two fields never really caught on in the US, but they’re doing OK here in Europe, which works in my favor.

I hope this is not discouraging, but if you happen to be particularly interested in African American slave narratives along the Mississippi River – the people you will probably want to work with will probably not be in the UK. I’m not saying that you can’t or shouldn’t – but really think about what you want to do. Before you even start looking at schools, look for people you might want to work with. In the end, you’re not coming here for the school name but the expertise that school can offer you over all other institutions.

Once you have made a list of a few people you’d be interested in working with, contact them! Tell them that you’re interested in their program, in what they’re doing, and ask questions. The first ones which spring to mind for me are What will the course be like? Will my work abroad would be transferable back to my home country? Find out if these people are even considering taking on more graduate students – it would be a bit like structuring your entire undergrad career around taking one class you’ve waited to take in your last semester of your last year only to find out that the lecturer is on research leave. It’ll be heartbreaking to hear now, but way less so than finding out after you move specifically to work with someone. You surely will have better, more relevant questions than I will about your field – ask them. Find out if the place you want to go is a good fit for you.

Though I applied to three universities in the UK, and contacted one person at each institution, I actually only contacted one of my now-supervisors initially and had no idea that my now-primary supervisor (or his project) even existed. This has turned out to be quite the happy accident for me; I certainly can’t guarantee this success rate for everyone else. But I wouldn’t have known at all if I hadn’t contacted Nigel and asked questions in the first place.

So! You have a department (or two or three) you want to apply to. Now what? Ask more questions. I had to decode the British university application without much help (“Qualifications? What do you mean by that?”). What degree are you applying for? I did a Masters of Research, which is different than an MPhil and an MA. Again, if you’re not sure: email someone and ask. I don’t think I can stress this enough. I distinctly remember asking a postgraduate coordinator about GREs and getting an email response back of “I don’t know what those are, so I would advise you not to worry about them.” Not Having To Take the GREs was definitely a plus in favor of graduate school abroad, I won’t lie, but trying to figure out how my GPA fit into the UK degree ranking system was a nightmare. In the end, it’s the responsibility of the universities to figure out all these conversions – most transcripts should have an explanation of the grading system.

I still feel guilty about the GRE thing when I talk to my friends who are in graduate school in the US, for what it’s worth.

This is the big one! You are trying to figure out how, exactly, you are going to pay for this. Currently I think postgraduate student tuition in the UK is around £9000. This does not sound like a lot, but when you do the conversions (depending on day, position of the sun, stock markets, whether or not the moon is rising in Aries, etc) it ends up being about $18k. This is the good news, because that is a LOT less expensive than any US program.*

* When you factor in the cost of moving abroad, this number will go up quite a bit. (Cost of living, flights to/from the US and getting settled is an entirely different story.)

Ask someone who you’d want to work with what funding is like in your field and how it might apply to you, especially as a foreign student. The bad news is that as a non-UK/EU citizen you will not be eligible for much funding here (sorry- this is truly the bad news). We have a number of research councils here (the one that would cover me is the AHRC. Google ‘UK Research Councils’ for this information). They often offer a variety full studentships (scholarships) at different universities. They also fund big projects, so if someone is looking for a Masters student to join them as part of an ESRC-funded project on eye tracking in reading, for instance – again, making these up off the top of my head – they might mention that, and you’ll have to ask how that will work.

The US often sponsors students going abroad (see Rhodes & Fulbright scholarships, among others); there will probably be specific ones for your countries. I think the Erasmus Mundi program also sponsors scholars going abroad, but I’m not sure of the details of it. Apply for these before you leave the US, as some of them are not applicable to people who have lived outside the US in their place of research. This turned out to be a problem, as I missed the deadline for Fulbright scholarships the day I applied to Strathclyde. Because I’ve lived in the UK for over a year consecutively now, I can’t apply to the Fulbright grant system. I’m still kicking myself over this, for the record. That does not mean there aren’t other grants for you. My officemate funded 3 years of his PhD through a series of small, private grants from various organizations.

Good luck! If your university has any sort of international or study abroad office, get in touch with them, too. Take all the help you can get, seriously – you’ll be glad you did.