digital humanities

Suggested Ways of Citing Digitized Early Modern Texts

On 1 January 2015, 25,000 hand-keyed Early Modern texts entered the public domain and were publicly posted on the EEBO-TCP project’s GitHub page, with an additional 28,000 or so forthcoming into the public domain in 2020. This project is, to say the least, a massive undertaking and marks a massive sea change in scholarly study of the Early Modern period. Moreover, we nearly worked out how to cite the EEBO texts (the images of the books themselves) just before this happened: Sam Kaislaniemi has an excellent blogpost on how one should cite books in the EEBO Interface (May, 2014), but his main point is replicated here:

When it comes to digitized sources, many if not most of us probably instinctively cite the original source, rather than the digitized version. This makes sense – the digital version is a surrogate of the original work that we are really consulting. However, digitizing requires a lot of effort and investment, so when you view a book on EEBO but only cite the original work, you are not giving credit where it is due. After all, consider how easy it now is to access thousands of books held in distant repositories, simply by navigating to a website (although only if your institution has paid for access). This kind of facilitation of research should not be taken for granted.

In other words, when you use digitized sources, you should cite them as digitized sources. I do see lots of discussions about how to best access and distribute (linked) open data, but these discussion tend to avoid the question of citation. In my perfect dream world every digital repository would include a suggested citation in their README files and on their website, but alas we do not live in my perfect dream world.

For reasons which seem to be related to the increasingly widespread use of the CC-BY licences, which allow individuals to use, reuse, and “remix” various collections of texts, citation can be a complicated aspect of digital collections, although it doesn’t have to be. For example, this site has a creative commons license, but we have collectively agreed that blog posts etc are due citation; the MLA and APA offer guidelines on how to cite blog posts (and tweets, for that matter). If you use Zotero, for example, you can easily scrape the necessary metadata for citing this blog post in up to 7,819 styles (at the time of writing). This is great, except when you want to give credit where credit is due for digitized text collections, which are less easy to pull into Zotero or other citation managers. And without including this information somewhere in the corpus or documentation, it’s increasingly difficult to properly cite the various digitized sources we often use. As Sam says so eloquently, it is our duty as scholars to do so.

Corpus repositories such as CoRD include documentation such as compiler, collaborators, associated institutions, wordcounts, text counts, and often include a recommended citation, which I would strongly encourage as a best practice to be widely adopted.

Here is a working list of best citation practices outlined for several corpora I am using or have encountered. These have been cobbled together from normative citation practices with input from the collection creators. (Nb. collection creators: please contact me with suggestions to improve these citations).

This is a work in progress, and I will be updating it occasionally where appropriate. Citations below follow MLA style, but should be adaptable into the citation model of choice.

Non-EEBOTCP
Folger Shakespeare Library. Shakespeare’s Plays from Folger Digital Texts. Ed. Barbara Mowat, Paul Werstine, Michael Poston, and Rebecca Niles. Folger Shakespeare Library, dd mm yyyy. http://folgerdigitaltexts.org/

Mueller, M. “Wordhoard Shakespeare”. Northwestern University, 2004- 2013. Available online: http://wordhoard.northwestern.edu/userman/index.html

Mueller, M. “Standardized Spelling WordHoard Early Modern Drama corpus, 1514- 1662”. Northwestern University, 2010. Available online: http://wordhoard.northwestern.edu.

Mueller, M. “Shakespeare His Contemporaries: a corpus of Early Modern Drama 1550-1650”. Northwestern University, 2015. Available online: https://github.com/martinmueller39/SHC/

EEBO-TCP access points:
There are several access points to the EEBOTCP texts, and one problem is that the text IDs included don’t always correspond to the same texts in all EEBO viewers as Paul Schnaffer describes below.

@heatherfro @Rwelzenb @OxfordEEBOTCP Problem of common ids between TCP instances. TCP ID eg A12345 works everywhere except PQ site #EEBOTCP

— Paul Schaffner (@pfs) August 5, 2015

Benjamin Armintor has been exploring the implications of this on his blog, but in general if you’re using the full-text TCP files, you should be citing which TCP database you are using to access the full-text files. Where appropriate, I’ve included a sample citation as well.

1. For texts from http://quod.lib.umich.edu/e/eebogroup/, follow the below formula: EEBOTCP michgan

Author. Title. place: year, Early English Books Online Text Creation Partnership, Phase 1, Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. quod.umich.edu/permalink date accessed: dd mm yyyy

Webster, John. The tragedy of the Dutchesse of Malfy As it was presented priuatly, at the Black-Friers; and publiquely at the Globe, by the Kings Maiesties Seruants. The perfect and exact coppy, with diuerse things printed, that the length of the play would not beare in the presentment. London: 1623, Early English Books Online Text Creation Partnership, Phase 1, Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. Available online: http://name.umdl.umich.edu/A14872.0001.001, accessed 5 August 2015.

2. For the Oxford Text Creation Partnership Repository (http://ota.ox.ac.uk/tcp/) and the searchable database there

Author. Title. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015 [place: year]. Available online at http://ota.ox.ac.uk/tcp/IDNUMBER; Source available at https://github.com/TextCreationPartnership/IDNUMBER/.

Rowley, William. A Tragedy called All’s Lost By Lust. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015 [London: 1633]. Available online: http://tei.it.ox.ac.uk/tcp/Texts-HTML/free/A11/A11155.htm; Source available at: https://github.com/TextCreationPartnership/A11155/

3. The entire EEBO-TCP Github repository

Early English Books Online Text Creation Partnership, Phase I. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. Available online: https://github.com/textcreationpartnership/Texts

If you are citing bits of the TCP texts as part of the whole corpus of EEBO-TCP, it makes the most sense to parenthetically cite the TCP ID as its identifying characteristic (following corpus linguistic models). So for example, citing a passage from Dutchess of Malfi above would include a parenthetical including the unique TCPID (A14872).

(Presumably other Text Creation Partnership collections, such as ECCO and EVANS, should be cited in the same manner.)

Of time, of numbers and due course of things

[This is the text, more or less, of a paper I presented to the audience of the Scottish Digital Humanities Network’s “Getting Started In Digital Humanities” meeting in Edinburgh on 9 June 2014. You can view my slides here (pdf)]

Computers help me ask questions in ways that are much more difficult to achieve as a reader. This may sound obvious: reading a full corpus of plays, or really any text, takes time, and by the time I closely read all of them, I will have either have not noticed the minutae of all the texts or I will not have remembered some of them. Here, for example, is J. O. Halliwell-Phillipp’s The Works of William Shakespeare; the Text Formed from a New Collation of the Early Editions: to which are Added, All the Original Novels and Tales, on which the Plays are Founded; Copious Archæological Annotations on Each Play; an Essay on the Formation of the Text: and a Life of the Poet, which takes up quite a bit of space on a shelf:

This isn’t a criticism, nor is it an excuse for not reading; it just means that humans are not designed to remember the minutae of collections of words. We remember the thematic aboutness of them, but perhaps not always the smaller details. Having closely read all these plays (though not in this particular edition: I have read the Arden editions, which were much more difficult to stick on one imposing looking shelf), all I remember what they were about, but perhaps not at the level of minutae I might want to have. So today I’m going to illustrate how I might go from sixteen volumes of Shakespeare to a highly specific research question, and to do that, I’m going to start with a calculator.

A calculator is admittedly a rather old and rather simple piece of technology; it’s one that is not particularly impressive now that we have cluster servers that can crunch thousands of data points for us, but it remains useful nonetheless. Without using technology which is more advanced than our humble calculator, I’m going to show how the simple task of counting and a little bit of basic arithmetic can raise some really interesting questions. Straightforward counting is starting to get a bit of a bad rap in digital humanities discourse (cf Jockers and Mimno 2013, 3 and Goldstone and Underwood 2014, 3-4): yes, we can count, but that is simple. We can also complicate this process with calculation and get even more exciting results! This is, of course, true, and provides many new insights to texts which were otherwise unobtainable. Eventually today I will get to more advanced calculation, but for now, let’s stay simple and count some things.

Except that counting is not actually all that simple: decisions have to be made about what to count and how to decide what to count, and then how you are going to do that. I happen to be interested in gender, which I think is one of the more quantifiable social identity variables in textual objects, though it certainly isn’t the only one. Let’s say I wanted to find three historically relevant gendered noun binaries for Shakespeare’s corpus. Looking at the historical thesaurus of the OED for historical contexts, I can decide on lord/lady, man/woman, and knave/wench, as they show a range of formalities (higher – neutral – lower) and these terms are arguably semantically equivalent. The first question I would have is “how often do these terms actually appear in 38 Shakespeare plays?”

Turns out the answer is “not much”: they are right up there in the little red sliver there. My immediate next question would be “what makes up the rest of this chart?” The obvious answer is, of course, that it covers everything that is not our node words in Shakespeare. However, there are two main categories of words contained therein: the frequency of function words (those tiny boring words that make up much of language) and the frequency of content words (words that make up what each play is about). We have answers, but instantly I have another question: what does the breakdown of that little red sliver look like?

This next chart shows the frequency of both the singular and the plural form of each node word, in total, for all 38 Shakespeare plays. There are two instantly noticeable things in this chart: first, the male terms are far more frequent than the female terms, and that wench is not used very much (though we may think of wench as being a rather historical term).
individual node word plurals in Shakespeare (full)

There are more male characters than female characters in Shakespeare – by quite a large margin, regardless of how you choose to divide up gender – but surely they are talking about female characters (as they are the driving force of these plays: either a male character wants to marry or kill a female character). This is not to say that male and female characters won’t talk to each other; there just happens to be a lot more male characters. Biber and Burges (2000) have noted that in 19^th century plays, male to male talk is more frequent than male to female talk (and female to female talk). I am not going to claim this is true here, but it seems to be a suggestive model, as male characters dominate speech quantities in the plays. There are lots of questions we can keep asking from this point, and I will return to some of them later, but I want to ask a bigger question: how does Shakespeare’s use of these binaries compare to a larger corpus of his contemporaries, 1512-1662?

It is worth noting that this corpus contains 332 plays, even though it is called the 400 play corpus; some things, I suppose, sound better when rounded up. These terms are still countable, though, and we see a rather different graph for this corpus:

The 400 play corpus includes Shakespeare, so we are now comparing Shakespeare to himself and 54 other dramatists.[1] The male nouns are noticeably more frequent than the female nouns, which suggests that maybe the proportions of male to female characters from Shakespeare is true here too. Interestingly, lord is less frequent than man, which is the opposite of what we saw previously. The y axis is different for this graph, as this is a much larger corpus than Shakespeare’s, but it seems like the female nouns are consistent.

One glaring problem with this comparison is that I am looking at two different-sized objects. A corpus of 332 plays is going to be, generally speaking, larger than a corpus of 38 plays.[2] McEnery and Wilson note that comparisons of corpora often require adjustment: “it is necessary in those cases to normalize the data to some proportion […] Proportional statistics are a better approach to presenting frequencies” (2003, 83). When creating proportions, Adam Kilgariff notes “the thousands or millions cancel out when we do the division, it makes no difference whether we use thousands or millions” (2009, 1), which follows McEnery and Wilson’s assertion that “it is not crucial which option is selected” (2003, 84). For my proportions, I choose parts per million.

Shakespeare is rather massively overusing lord in his plays compared to his contemporaries, but he is also underusing the female nouns compared to contemporaries. Now we have a few research questions to address, all of which are very interesting:

Why does Shakespeare use lord so much more than the rest of Early Modern dramatists?
Why do the rest of Early Modern dramatists use wench so much more than Shakespeare?
Why is lady more frequent than woman overall in both corpora?

I’m not going to be able to answer all of these today, though they but let’s talk a little bit about lord. This is a pretty noticeable difference for a term which seems pretty typical of Early Modern drama, which is full of noblemen. If I had to guess, I would say that lord might be more frequent in history plays compared to the tragedies or the comedies. I say this because as a reader I know there are most definitely noblemen, and probably defined as such, in these plays.

So what if we remove the histories from Shakespeare’s corpus, count everything up again, and make a new graph comparing Shakespeare minus the histories to all of Shakespeare? By removing the history plays it is possible to see how Shakespeare’s history plays as a unit compare to his comedy & tragedy plays as a unit. [3]

Female nouns fare better in Shakespeare Without Histories than in Shakespeare Overall, possibly because the female characters are more directly involved in the action of tragedies and comedies than they are in histories (though we know the Henry 4 plays are an exception to that), so that is perhaps not all that interesting. What is interesting, though, is the difference between lord in Shakespeare Without Histories and Shakespeare With Histories. What is going on in the histories? How do Shakespeare’s histories compare to all histories in the 400 play corpus?

Now we have even more questions, especially “what on earth is going on with lord in Shakespeare” and “why is wench more frequent in all of the histories?” I’m going to leave the wench question for now, though: not because it’s uninteresting but because it is less noticeable compared to what I’ve been motioning at with lord, which is clearly showing some kind of generic variation.

Remember, we haven’t done anything more complex than counting and a little bit of arithmetic yet, and we have already created a number of questions to address. Now we can create an admittedly low-tech visualization of where in the history plays these terms show up: each black line is one instance, and you read these from left to right (‘start’ to ‘finish’):

And now I instantly have more questions (why are there entire sections of plays without lord? Why do they cluster only in what clearly are certain scenes? etc) but what looks most interesting to me is King John, which has the fewest examples. On a first glance, King John and Richard 3 appear to be outliers (that is, very noticeably different from the others: 42 instances vs 236 instances). Having read King John, I know that there are definitely nobles in the play: King John, King Philip, the Earls of Sudbury, Pembroke, Essex and the excellently named Lord Bigot. And, again, having read the play I know that it is about the relationships between fathers, mothers and brothers – the play centers around Philip the Bastard’s claim to the throne – and also is about the political relationship (or lack thereof) between France and England. From a reader’s perspective, none of that is particularly thematically unique to this play compared to the rest Shakespeare’s history plays, though.

I can now test my reader’s perspective using a statistical measure of keyness called log likelihood, which asks which words are more or less likely to appear in an analysis text compared to a larger corpus. This process will provide us with words which are positively and negatively ranked overall with a ranking of statistical significance (more stars means more statistically significant). Now I am asking the computer to compare King John to all of Shakespeare’s histories. I have excluded names from this analysis, as a reader definitely knows hubert arthur robert philip faulconbridge geoffrey are in this play without the help of the computer.

However, you can see that the absence of lord in King John is highly statistically significant (marked with four *s, compared to others with fewer *s). Now, we saw this already with the line plots, though it is nice to know that this is in fact one of the most significant differences between King John and the rest of the histories.

All of this is nice, and very interesting, as it is something we might not have ever noticed as a reader: because it is a history play with lords in it, it is rather safe to assume that it will contain the word lord more often than it actually does. Revisiting E.A.J. Honingmann’s notes on his Arden edition of King John, there have been contentions about the use of king in the First Folio (2007, xxxiii-xliii), most notably around the confusions surrounding King Lewis, King Philip and King John all labeled as ‘king’ in the Folio (see xxxiv-xxxvii for evidence). But none of this is answering our question about lord’s absence. So what is going on with lord? We can identify patterns with a concordancer, and we get a number of my lords:
This is looking like a fairly frequent construction: we might want to see what other words are likely to appear near lord in Shakespeare overall: is my one of them? As readers, we might not notice how often these two words appear together. I should stress that we still have not answered our initial question about lord in King John, though we are trying to.

Using a conditional probability of the likelihood of one lemma (word) to appear next to another lemma (word) in a corpus using the dice coefficiency test, which is the mean of two conditional probabilities: P(w1,w2) and P(w2,w1). Assuming the 2^nd word in the bigram appears given the 1st word, and the 1^st word in the bigram appears given the 2^nd word, this relationship can be computed on a scale from 0-1. 0 would mean there is no relationship; 1 means they always appear together. With this information, you can then show which words are uniquely likely to appear near lord in Shakespeare and contrast that to the kinds of words which are uniquely likely to appear next to lady – and again for the other binaries as well. Interestingly, my only shows up with lord!

This is good, because it shows that lord does indeed appear very differently to our other node words in Shakespeare’s corpus, and suggests that there’s something highly specific going on here with lord, all of which is still suggestive that there is something about lord which is notable. However, I’m still not sure what is happening with lord in King John. Why are there so few instances of it?

Presumably if there is an absence of one word or concept, there will be more of a presence a second word or concept. One such example might be king, but the log-likelihood analysis shows that this is comparatively more frequent in King John than in the rest of Shakespeare’s histories (note the second entry on this list)

Now we have two questions: why is lord so absent, and why is this so present? From here I might go back to our concordance plot visualizations, but this is addressable at the level of grammar: this is a demonstrative pronoun, which Jonathan Hope defines in Shakespeare’s Grammar as “distinguish[ing] number (this/these) and distance (this/these = close; that/those = distant). Distance may be spatial or temporal (for example ‘these days’ and ‘those days’)” (Hope 2003, 24). Now we have a much more nuanced question to address, which a reader would never have noticed: Does King John use abstract, demonstrative pronouns to make up for a lack of the concrete content word lord in the play? I admit I have no idea: does anybody else know?

WORKS CITED
Halliwell-Phillipps, J.O. (1970. [1854].) The works of William Shakespeare, the text formed from a new collation of the early editions: to which are added all the original novels and tales on which the plays are founded; copious archæological annotations on each play; an essay;on the formation of the text; and a life of the poet. New York: AMS press.

“Early English Books Online: Text Creation Partnership”. Available online: http://quod.lib.umich.edu/e/eebogroup/ and http://www.proquest.com/products-services/eebo.html.

“Early English Books Online: Text Creation Partnership”. Text Creation Partnership. Available online: http://www.textcreationpartnership.org/

Anthony, L. (2012). AntConc (3.3.5m) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.antlab.sci.waseda.ac.jp/

Biber , Douglas, and Jená Burges. (2000) “Historical Change in the Language Use of Women and Men: Gender Differences in Dramatic Dialogue”. Journal of English Linguistics 28 (1): 21-37.

DEEP: Database of Early English Playbooks. Ed. Alan B. Farmer and Zachary Lesser. Created 2007. Accessed 4 June 2014. Available online:http://deep.sas.upenn.edu.

Froehlich, Heather. (2013) “How many female characters are there in Shakespeare?” Heather Froehlich. 8 February 2013. https://hfroehlich.wordpress.com/2013/02/08/how-many-female-characters-are-there-in-shakespeare/

Froehlich, Heather. (2013). “How much do female characters in Shakespeare actually say?” Heather Froehlich. 19 February 2013. https://hfroehlich.wordpress.com/2013/02/19/how-much-do-female-characters-in-shakespeare-actually-say/

Froehlich, Heather. (2013). “The 400 play corpus (1512-1662)”. Available online: http://db.tt/ZpHCIePB [.csv file]

Goldstone, Andrew, and Ted Underwood. “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us.” New Literary History, forthcoming.

Hope, Jonathan. (2003). Shakespeare’s Grammar. The Arden Shakespeare. London: Thompson Learning.

Jockers, M.L. and Mimno, D. (2013). Significant themes in 19th-century literature. Poetics. http://dx.doi.org/10.1016/j.poetic.2013.08.005

Kay, Christian, Jane Roberts, Michael Samuels, and Irené Wotherspoon (eds.). (2014) The Historical Thesaurus of English. Glasgow: University of Glasgow. http://historicalthesaurus.arts.gla.ac.uk/.

Kilgariff, Adam. (2009). “Simple Maths for Keywords”. Proceedings of the Corpus Linguistics Conference 2009, University of Liverpool. Ed. Michaela Mahlberg, Victorina González Díaz, and Catherine Smith. Article 171. Available online: http://ucrel.lancs.ac.uk/publications/CL2009/#papers

McEnery, Tony and Wilson, Andrew. (2003). Corpus Linguistics: An Introduction. Edinburgh: Edinburgh University Press, 2^nd Edition. 81-83

Mueller, Martin. WordHoard. [Computer Software]. Evanston, Illinois: Northwestern University. http://wordhoard.northwestern.edu/

Shakespeare, William. (2007). King John. Ed. E. A. J. Honigmann. London: Arden Shakespeare / Cengage Learning.

[1] Please see http://db.tt/ZpHCIePB [.csv file] for the details of contents in the corpus.

[2] This is not always necessarily true: counting texts does not say anything about how big the corpus is! A lot of very short texts may actually be the same size as a very small corpus containing a few very long texts.

[3] The generic decisions described in this essay have been lifted from DEEP and applied by Martin Mueller at Northwestern University. I am very slowly compiling an update to these generic distinctions from DEEP, which uses Annals of English Drama, 975-1700, 3rd edition, ed. Alfred Harbage, Samuel Schoenbaum, and Sylvia Stoler Wagonheim (London: Routledge, 1989) as its source to Martin Wiggins’ more recent British Drama: A Catalog, volumes 1-3 (Oxford: Oxford UP, 2013a, 2013b, 2013c) for further comparison.

An introductory bibliography to corpus linguistics

This is a short bibliography meant to get you started in corpus linguistics – it is by no means comprehensive, but should serve to be a good introductory overview of the field.

>>This page is updated semi-regularly; if you find any dead links please contact me at hgf5 at psu dot edu. Thanks!<<

1.0 General resources
Froehlich, H. “Intro to Text Analysis”. Penn State University Library Guides. (30 May 2018), http://guides.libraries.psu.edu/textanalysis
Froehlich, H. “Text mining: Web-based resources”. Penn State University Library Guides. (10 October 2018) https://guides.libraries.psu.edu/textmining/web

1.1 Books (and two articles)
Baker, Paul, Andrew Hardie and Tony McEnery. (2006). A Glossary of Corpus Linguistics. Edinburgh, Edinburgh UP.
Atkins, Sue, Jeremy Clear and Nicholas Oster. (1991) “Corpus Design Criteria”. http://www.natcorp.ox.ac.uk/archive/vault/tgaw02.pdf
Biber, Douglas (1993). “Representativeness in Corpus Design”. Literary and Linguistic Computing, 8 (4): 243-257. http://llc.oxfordjournals.org/content/8/4/243.abstract
Biber, Douglas, Susan Conrad and Randi Reppen (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge UP.
Granger, Sylviane, Joseph Hung and Stephanie Peych-Tyson. (2002). Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins.
Hoey, Michael, Michaela Stubbs, Michaela Mahlberg, and Wolfgang Teubert. (2011). Text, Discourse and Corpora. London: Continuum.
Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press.
Mahlberg, Michaela. (2013). Corpus Stylistics and Dickens’ Fiction. London: Routledge.
McEnery, T. and Hardie, A. (2012). Corpus Linguistics: Method, theory and practice. Cambridge: Cambridge UP.
O’Keefe, Anne and Michael McCarthy, eds. (2010).The Routledge Handbook of Corpus Linguistics. London: Routledge.
Sinclair, John and Ronald Carter. (2004). Trust the Text. London: Routledge.
Sinclair, John. (1991) Corpus Concordance Collocation. Oxford: Oxford UP.
Wynne, M (ed.) (2005). Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow Books. http://www.ahds.ac.uk/creating/guides/linguistic-corpora/

1.2 Scholarly Journals
Corpora http://www.euppublishing.com/journal/cor
ICAME http://icame.uib.no/journal.html
IJCL https://benjamins.com/#catalog/journals/ijcl
Literary and Linguistic Computing http://llc.oxfordjournals.org/

1.3 Externally compiled bibliographies and resources
David Lee’s Bookmarks for corpus-based linguistics http://www.uow.edu.au/~dlee/CBLLinks.htm
Costas Gabrielatos has been compiling a bibliography of Critical Discourse Analysis using corpora, 1982-present https://www.edgehill.ac.uk/englishhistorycreativewriting/staff/dr-costas-gabrielatos/?tab=docs-bibliography
Members of the corpus linguistics working group UCREL at Lancaster University have compiled some of their many publications here http://ucrel.lancs.ac.uk/pubs.html; see also their LINKS page http://ucrel.lancs.ac.uk/links.html
Michaela Mahlberg is one of the leading figures in corpus stylistics (especially of interest if you want to work on literary texts) http://www.michaelamahlberg.com/publications.shtml; in 2006 she helped compile a corpus stylistics bibliography (pdf) with Martin Wynne.
Lots of work is done on Second Language Acquisition using learner corpora. Here’s a compendium of learner corpora http://www.uclouvain.be/en-cecl-lcworld.html

Corpora-List (mailing list) http://torvald.aksis.uib.no/corpora/
CorpusMOOC https://www.futurelearn.com/courses/corpus-linguistics, run out of Lancaster University, is an amazingly thorough resource. Even if you can’t do everything in their course, there’s lots of step-by-step how-tos, videos, notes, readings, and help available for everyone from experts to absolute beginners.

1.4 Compiled Corpora
Xiao, Z. (2009). Well-Known and Influential Corpora, A Survey http://www.lancaster.ac.uk/staff/xiaoz/papers/corpus%20survey.htm, based on Xiao (2009), “Theory-driven corpus research: using corpora to inform aspect theory”. In A. Lüdeling & M. Kyto (eds) Corpus Linguistics: An International Handbook [Volume 2]. Berlin: Mouton de Gruyter. 987-1007.
Various Historical Corpora http://www.helsinki.fi/varieng/CoRD/corpora/index.html
Oxford Text Archive http://ota.ahds.ac.uk/
Linguistic Data Consortium http://catalog.ldc.upenn.edu/
CQPWeb, a front end to various corpora https://cqpweb.lancs.ac.uk/
BYU Corpora http://corpus.byu.edu/
NLTK Corpora http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
1.5 DIY Corpora (some work required)
Project Gutenberg http://gutenberg.org
LexisNexis Newspapers https://www.lexisnexis.com/uk/nexis/
LexisNexis Law https://www.lexisnexis.com/uk/legal
BBC Script Library http://www.bbc.co.uk/writersroom/scripts

1.6 Concordance and Other software
No one software is better than another, though some are better at certain things than others. Much here comes down to personal taste, much like Firefox vs Chrome or Android vs iPhone. While AntConc, which is what I use, is great it is far from the only software available. (Note that these may require a licencing fee.)
AntConc http://www.laurenceanthony.net/software/antconc/
Wordsmith http://lexically.net/
Monoconc http://www.monoconc.com/
CasualConc https://sites.google.com/site/casualconc/
Wmatrix http://ucrel.lancs.ac.uk/wmatrix/
SketchEngine http://www.sketchengine.co.uk/
R http://www.rstudio.com/ide/docs/using/source (for the advanced user)
200+ software resources for corpus analysis https://corpus-analysis.com/
Anthony, Laurence. (2013). “A critical look at software tools in corpus linguistics.” Linguistic Research 30(2), 141-161.

1.7 Annotation You may want to annotate your corpus for certain features, such as author, location, specific discourse markers, parts of speech, transcription, etc. Some of the compiled corpora might come with included annotation.
Text Encoding Initiative http://www.tei-c.org/index.xml
A Gentle Introduction to XML http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html
Hardie, A (2014) ““Modest XML for Corpora: Not a standard, but a suggestion”. ICAME Journal 38: 73-103.
UAM Corpus Tool does both concordance work and annotation http://www.wagsoft.com/CorpusTool/

1.7.1 Linguistic Annotation
Natural Language Toolkit http://nltk.org& the NLTK book http://www.nltk.org/book/ch01.html
Stanford NLP Parser http://nlp.stanford.edu/software/corenlp.shtml (includes Named Entity Recognition, semantic parser, and grammatical part-of-speech tagging)
CLAWS, a part of speech tagger http://ucrel.lancs.ac.uk/claws/
USAS, a semantic tagger http://ucrel.lancs.ac.uk/usas/

1.8 Statistics Help 1.8.1 Not Advanced
Wikipedia http://wikipedia.com (great for advanced concepts written for the non-mathy type)
Log Likelihood, explained http://ucrel.lancs.ac.uk/llwizard.html
AntConc Videos https://www.youtube.com/user/AntlabJPN
WordSmith Getting Started Files http://www.lexically.net/downloads/version6/HTML/index.html?getting_started.htm
Oakes, M. (1998): Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. Baroni, M. and S. Evert. (2009): “Statistical methods for corpus exploitation”, in A. Lüdeling and M. Kytö (eds.), Corpus Linguistics: An International Handbook Vol. 2. Berlin: de Gruyter. 777-803. 1.8.2 Advanced Stefan Th. Gries’ publications: http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html Adam Kilgarriff’s publications: Pre-2009 http://www.kilgarriff.co.uk/publications.htm Post-2009 https://www.sketchengine.co.uk/documentation/wiki/AK/Papers
Baayen, R.H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.

heather froehlich // last updated 13 June 2019

On Teaching Literature to Computer Science Students

[Previously: On Teaching Coding to English Studies Students]

Recently I wrote about English studies students learning to code in an interdisciplinary computer science and English class. In that post I mentioned that this class (running for the third consecutive year) comes with a variety of challenges – some strictly institutional, some cross-departmental, and some pedagogical. I’ve been collecting a number of these and will be blogging about them in the future. In that post I also mentioned that there are two very oppositional learning curves at play: one is getting the English studies students to think about computers in a critical way and the other is getting the computer science students to read literature.

We have just hit the very exciting point in the course where the computer science students are learning to read and the English studies students are truly hitting their stride, which is a dramatic turnaround from the start of the course. Last week we asked each group to give a very short, informal presentation about their assigned Shakespeare plays in relation to the rest of the Shakespeare corpus. It was no surprise that in every group the English studies students gave an overview of the plot of their play and a few key themes whereas the computer science students reported what they had deemed to be a finding. English studies students have been studying how to analyze texts and the computer science students haven’t done that in the same way.

Here in Scotland, students begin to track either towards arts & humanities or science long before they hit university – they start to track in high school, and take school leaving exams in a number of subjects (“Highers”), from a rather long list which you can read here. Once you choose your track, it’s rare (though not unheard of) to have much overlap between A&H and science in one’s Highers qualifications. Most degree programs will have preferred subjects for applicants which guide students’ decisions about which Highers to take; Strathclyde’s entry requirements for a student wishing to be a Computer Science undergraduate can be found here (pdf). Unless you’re going into a joint honours in Computer Science and Law, English literature or language is not a required Higher for prospective students in computer science at my university. It goes the other way, too – a Higher in Maths is not a requirement for a prospective English studies student unless they plan on taking a joint honours with Mathematics (again, pdf). Some students may take Highers strictly out of interest (or uncertainty about which route to take), but like the SAT II or AP exams, this is not necessarily something you’d do for fun – these are high-stakes exams.

If faced with that choice I would definitely have taken the Arts & Humanities track despite liking science and being (told I was) very bad at math. I suspect that a lot of computer science students may have liked history or media studies but were bad at writing (or told they were…) and that was enough to turn them away from taking an Arts & Humanities track. It might be that students who sign up for a degree in Computer Science are just really passionate about computers, or that they like practical problem solving, or they’ve been told that computer science is a lucrative field. I have no idea – I’m not them*. On the surface, it can look like they have a lot of missing cultural information for not knowing things about literature – but they also know a whole lot more than we do about very different things.

That’s not to say these students don’t read of their own accord or aren’t interested in books. However, by the time they show up to my class, they have rather successfully avoided close reading for a few years, whereas English studies students have been practicing this skill for a while now. This is something English studies students are very comfortable with, and have now learned enough about the way that computers “think” (or lack thereof) to reach a common ground with the CS students.

However, in the same way the computer science students found the first half of the class easy and the English studies students found it tremendously daunting, the computer science students suddenly feel like they’ve been thrown in the deep end. The English studies students have to teach them how to analyze a text.

In our in-class presentations, each group had to discuss a discovery they’ve made about their play and explain why it was interesting. Without fail, the computer science students had lots to say about various discoveries they had found about kinds of words that were more or less frequent in their play compared to all of Shakespeare’s plays. And yet when they were pressed about why they thought it was happening, they weren’t really sure. The English studies students could postulate theories about why there were more or less of a specific kind of feature in their text, because they know how to approach this problem.

Over the next few weeks we’re letting the students self-guide their own projects and produce explanations for their discoveries, which means the computer science students are on a crash course on close-reading from their in-group local expert. They’re learning that data isn’t everything when it comes to understanding what makes their play in some way different (or similar) from other plays. In fact, they’re learning the limitations of data and ways that close-reading is not just supplementary but essential to a model of distance reading with computational methods. And if the student presentations I saw last week are any indication, I suspect I have some very exciting work coming my way in a few weeks’ time.

* I did my dual-major undergrad degree in English lit and Linguistics; I didn’t get involved in computers until my masters.

(with thanks to Kat Gupta for comments on this post)

CEECing new directions with Digital Humanities

[editor’s note: this post is cross-posted to Kielen kannoilla, the VARIENG blog. I am extremely grateful to Anni Sairio and Tanja Säily for all their organization behind making this visit, and thus this blog post, possible.]

This past week I was talking about the relationships between corpus linguistics and digital humanities as a visiting scholar at VARIENG, a very well known historical sociolinguistics and corpus linguistics working group. Corpus linguistics is a very text-oriented approach to language data, with much interest in curation, collection, annotation, and analysis – all things of much concern to digital humanists. If corpus linguistics is primarily concerned with text, digital humanities can be argued to be primarily be concerned about images: how to visualize textual information in a way that helps the user understand and interact with large data sets.

VARIENG has been compiling the Corpus of Early English Correspondence (CEEC) for a number of years, and one of their primary concerns is ‘what else can we do with all this metadata we’ve created’? Together, we discussed three main themes of corpus linguistics and digital humanities: access, ability, and the role of supplementary vs created knowledge. Digital humanities runs on a form of knowledge exchange, but this raises questions of who knows what, how, and how to access them.

Approaching a computer scientist with a bunch of historical letters may raise some “so what” eyebrows, but likewise, a computer scientist approaching a linguist with a software package to pull out lexical relationships might raise similar “so what” eyebrows: why should we care about your work and what can we do with it? Because both groups walk in with very different kinds of expertise, one of the very big challenges of digital work is to be able to reach a common language between the disciplines: both have very established, very theoretically-embedded systems of working.

All of this is to say that the takeaway factor for corpus linguistics research, and indeed any kind of digitally-inflected project, is very high. As Matti Rissanen says, and rightly so, “research begins when counting ends”. The so-what factor of counting requires heavy contextualization, human brainpower, time, funding, systems and communication – and none of these features are unique to corpus linguistics. Digitally-inflected scholarship requires complementary expertise in techniques, working and interacting with data; we need humanistic questions which can be pushed further with digital methods, not digital methods which (we hope) will push humanistic questions further. While it is nice to show what we already understand by condensing lots of information into a pretty picture, there are deeper questions to ask. If digital humanities currently serves mostly to supplement knowledge, rather than create new knowledge, we need to start thinking forward to ask “What else can we do with this data we’ve been curating?”

One thing we can do with this data is view it in new tools and learn to ask different questions, as we did with Docuscope, a rhetorical analysis software developed at Carnegie Mellon University. Digital tools and techniques are question-making machines, not answer-providing packages. Here we may ask ourselves why F_1720-39.txt has a low count of Personal Pronouns in Docuscope, and the answer may be that what we consider to be personal pronouns (grammatically) are categorized otherwise by Docuscope and that other constructions are used instead. This isn’t magic and this can’t be quiet handwaving: we should be pushing ourselves towards asking questions which were previously impossible at the scale of sentence-level or lexical-level of detail, because suddenly we can.

Resources:
Slides from last week’s workshops (right-click to save as pdf files):

Day one: What is digital humanities?
- Blogpost by Timo Honkela, summarizing Day 1
Day two: What do you do with millions of words?
Day three: Projects

On teaching coding to English studies students

This term I am teaching English studies students how to code. This seemingly goes against quite a lot that I’ve been saying online for a while, so let me back up.

I’ve been involved in an interdisciplinary digital humanities course called Textlab for the past two years, as part of a university-sponsored, research-oriented, cross-faculty project; this time around I’ve stepped into a larger role of co-convening rather than simply being support staff for it. (It’s week 2 and I’m already planning a blog post called Things I Learned From Co-Convening A Interdisciplinary, Interyear Course, so stay tuned for that.) The premise of Textlab is that Computer Science and English studies will work together in small groups which study a specific Shakespeare play with computational and literary methods (for what this looks like in practice, please see this paper). The goals of knowledge exchange here is pretty clear: English studies students learn hands-on computer skills and the computer science students learn how to practically apply their knowledge on a real-world project. In essence, the computer science students have to revisit Literature and the English studies students learn Computers. The learning curve on both sides is, effectively, massive.

In many ways the first half of this course is digital literacies for the English studies students and literacy literacies for the Computer Science students in the latter half of the 10 week course. We are obviously covering a lot of ground here; in the past we have had students blogging and wiki-ing and tweeting, but this year we’re scaling back to address textual analysis from the ground up, what that looks like, and how computers can supplement our literary understandings of texts.

Because this is a practical, hands-on, interdisciplinary class, the students all need to be on roughly the same page early on. And that is how I end up back with teaching English studies students how to code – we are starting with really basic Unix commands to understand how computers work, and build on these principles of commands to programs to understanding how computers can supplement human inquiry. Predictably, the CS students fly through the early Unix stuff and can really struggle with the reading, whereas the English studies students succeed wildly with the reading stuff and can really struggle with the computational stuff.

I’ve been sitting on a blog post for a long time about why I don’t think everyone needs to learn how to code and I continue to think that. This blog post may never see the light of Internet-Day; we’ll see. But the main thing I want to highlight is that I have no “official” training in programming or coding; my degrees are all in literature and linguistics and I have self-taught myself almost everything I know about computational analysis now. Three years ago I couldn’t have told you what a command line is and now I deal with it on a semi-regular basis. So I totally understand where the English studies students are coming from, and I also understand the potential of what the computer science students are capable of. The problem in much of the Everyone Must Learn to Code Now dogma is that you need a practical problem to care about, otherwise it is essentially meaningless. I don’t really care that Mary has three watermelons and Jane has seven oranges and how many apples does John have? Learning to code is all very well & good but the practical aspect has to be there, otherwise it just feels like our fruit distribution word problem.

So as I sat in the computer lab today with my students, I saw the English studies students struggling with commands like wc, -l, -w, mkdir, uniq, grep, cat. Part of the problem was that the students had no conception of why they were typing these letters into some computer. “I don’t understand,” they said. So I asked them what they had done so far, and they said they had just typed in some things and now they aren’t sure what to do with it.

We sat down and talked about what each step was, and what each of these commands meant, while walking through parts of the lab together. Sometimes they weren’t sure, and we had to address what to do about that. “Cat” is not a very googleable command: we aren’t looking for furry creatures with pointy ears and whiskers. But as a command, cat does a lot and what it does is not super transparent, so we talked about how to get information about what these letters mean.

The other thing I found a lot of my English studies students doing was feeling self-conscious about not knowing how to go back and fix what they understood to be a mistake. It’s easy enough to go back to the folder we are working in, but we hadn’t given them that command and there’s no easy BACK button in a terminal. Here’s a secret about me: I always have to look this up. I have to look up a lot of information when I want to do something computational. I would never claim to be a programmer, let alone a proficient one, but even after three years of this kind of stuff I have to look it up. One of the big problems they were having is that they didn’t know how, or where, to get that kind of information, so we talked about that too.

I don’t like or support the idea that computational language is like a natural language (it’s not) but the easiest analogy to make here is that learning what these letters mean is a lot like learning a language. If you’re learning French you need to know what the sounds you’re mashing together represent, and what kind of meaning they hold. Je voudrai un ananas may be meaningless to an English speaker, but to a French speaker that phrase holds meaning. Likewise, asking a computer sort -n file | uniq | sort –r > sorted_file holds a specific kind of meaning. If you don’t speak French you probably don’t understand what I just said; if you don’t speak “computer” you don’t understand what you just said either. Simply replicating letters in order doesn’t allow the students to critically engage with a task the way we might want it to. The goal of the course is to get the English studies students to understand how a computer works more fully, but producing replication tasks makes this just another black box: type this, MAGIC HAPPENS, you will have results.

Next week we are addressing pipelines more fully. My role as a teacher, educator, and mentor is to help my students understand what we’re doing, and one of the many ways this class is challenging is that I don’t want to be standing in front of them lecturing if I can avoid it. Standing in front of my students and telling them how to do things isn’t hands-on learning. But talking about how to find resources and how to ask questions is hands-on learning. In the meantime, I am thinking about how I can support the computer science students when the coin is flipped and getting them talking about literature in a way that feels tangible and relevant to them.

Early “English” Books Online?

Early English Books Online, or EEBO, is what might be technically known as “a hot mess”. (If you’re unfamiliar with EEBO and its messiness, I highly recommend Ian Gadd’s “The Use and Misuse of Early English Books Online” which summarizes how we arrived at this hot mess, Sarah Werner’s blogpost on the kinds of things EEBO doesn’t show us well, and Daniel Powell’s roundup of EEBO weirdness). I want to stress that this isn’t necessarily a bad thing, as it’s a product of time and technology from a while ago. It’s being rekeyed by humans (the TCP enterprise), and overall it is just a really big dataset of Early Modern English. When you’re looking at giant datasets like EEBO it doesn’t really matter if parts of it are imperfect. It will always be imperfect.

I’ve been looking at spelling variation for various gender terms and collocational patterns surrounding gender terms in EEBO lately because it is a really big dataset and those tend to be useful for testing our perceptions of language, especially when they contain a number of different kinds of texts. One of the ones I was looking at was hir, a known variable spelling of her. One example of this is can be found in Shakespeare’s Merry Wives of Windsor (V.ii.2150); Melchiori’s Arden Shakespeare edition has a note about the phrase “his muffler”: the Folio edition of Wives reads his, but the Quarto edition read “her muffler”. This “may be Evans’ confusion, but more likely Shakespeare’s slip or a printer’s misreading of ‘hir’, an alternative spelling of her” (2000: 253).

So in looking for examples of hir I found myself suddenly looking at Welsh. Specifically, this text, Ymadroddion bucheddol ynghylch marvvolaeth o waith Dr. Sherlock (all links will go to the Michigan Text Creation Partnership permalinks, for ease of reference. Because I’m based the UK, my access comes from JISC Historic Ebooks, not the Chadwyck interface, meaning that generated permalinks might not work – further problems!). The below image is from the ‘text’ option on the JISC interface for Ymadroddion:

I don’t speak – or read – Welsh, let alone Early Modern Welsh, so I turned first to google translate and secondly to twitter, where I joyfully found a number of people who either work with or speak/read Welsh (and one person who studied Medieval Welsh in undergrad – officially winning the title of ‘most obscure gen ed ever’. The internet continues to amaze.)

In welsh, hir means ‘long’, so it’s not a pronoun but an adjective. I was curious about the structures of grammatical gender in Welsh, namely if it would have agreement by gender in ways that Old English, for example, did. This answer was a little bit more complicated to elucidate but it was declared that yes, there is a gender system in Welsh; and no, it should not affect hir. [1] So, that’s good to know. But here’s a question: When we say ‘Early English Books (Online)’, do we really mean English the place, or English the language?

Linguistically, Welsh is rather decidedly not English, as the extremely useful BBC Modern Welsh Grammar will illustrate. But I was rather surprised to find Welsh being considered part of “English” in this set. So, I went back to the EEBO-TCP site , where they say the following about text selection:

Selection is based on the New Cambridge Bibliography of English Literature (NCBEL). Works are eligible to be encoded if the name of their author appears in NCBEL. Anonymous works may also be selected if their titles appear in the bibliography. The NCBEL was chosen as a guideline because it includes foundational works as well as less canonical titles related to a wide variety of fields, not just literary studies.
In general, we prioritize selection of first editions and works in English (although in the past we have also tackled Latin and Welsh texts). Because our funding is limited, we aim to key as many different works as possible, in the language in which our staff has the most expertise. However, exceptions for specific works may be made upon request.
A work will not be passed over for encoding simply because it is available in another electronic collection. Not only is the quality of these collections sometimes uncertain, a text’s presence outside of EEBO will not allow it to be searched through the same interface as the EEBO encoded texts.
Titles requested by users at partner institutions are placed at the head of the production queue.

There is quite a lot of Latin in EEBO, because it was in some ways considered a prestige language in the earlier early modern period. Many early printed books were in Latin, so it is generally unsurprising that there’s a lot of it in the EEBO set. Again, this is not English-the-language but English-The-Place. Curiously, the place of imprinting for Ymadroddion bucheddol ynghylch marwolaeth o waith Dr. Sherlock is listed as “gan Leon Lichfield, i John March yn Cat-Eaten-Street, ag i Charles Walley yn Aldermanbury, […] yn Llundain” [by Leon Lichfield, John March-Eaten-in Cat Street, with Charles Walley in Aldermanbury, London], suggesting that “English” refers to place rather than strictly language- and it gets the following metadata:

Publication Country : England
Language : Welsh

Interesting.

Scotland joins with England in 1603 when James VI, King of Scotland inherits the throne to become James I, King of England, but the two countries remain largely independent states until the Acts of Union in 1707. But would we find examples of Scots in EEBO? Scots, like Welsh, is an example of another localized language, though arguably Scots gets more English influence. Kirk is a nice Scots word meaning ‘church’, and here’s an example from William Dunbar’s The tua mariit wemen and the wedo. And other poems from around 1507:

Curiously, this is listed in the records as

Publication Country: Scotland
Language: English

As above, I’m not sure everyone would agree that this is “English”. Nor is it printed in “England”. But these books (and more) are there as part of Early English Books Online.

[1] Thanks to Jonathan Morris (@jonmorris83), a marketing assistant at Palgrave Linguistics, Alun Withey (@DrAlun), Liz Edwards (@eliz_edw) and Sarah Courtney (@sgcourtney)

Counting things in Early Modern Plays So You Don’t Have To: Type/Token Ratios

If you’re just joining me, I’ve been working on word frequencies of six highly-prototypical lexical items in a corpus of slightly less than 400 Early Modern London plays. I recommend starting with my research notes and then looking at some quick & dirty results.

As I noted in my quick & dirty results, these numbers hadn’t been normalized in any way: it was all raw data. In an effort to move beyond just raw data, I compiled the total number of words in each play in the corpus. I initially was interested in how play length might be a variable over time my corpus, so I graphed that. The bulk of my plays are from the early 1600s, as you can see:

Overall, plays do seem to get longer until about 1600, at which point they start to get shorter again. 1662 looks to be an outlier here, as the plays in a straight line on the far right-hand side are mostly by Margaret Cavendish. (I am currently trying to figure out how to color my graphs by author, so if you have advice on that, please let me know: I’m rather haphazardly teaching myself to graph in R as I go.)

OK, so I have the total number of tokens in each text. What if treated every instance of my prototypical lexical items as a specific type, and plotted them as type/token ratios? Type/token ratios have a bit messy history in corpus linguistics, as they’re mostly used to calculate vocabulary denseness (Type/Token Ratios: what do they really tell us?, Richards 1987 [pdf]), but this would show a ratio of the raw frequency of each lexical item of interest in each play compared to the length of each play, which would normalize my data a bit.

Click to zoom:

First of all, it’s notable that the lexical-frequency-to-play-length ratio make some pretty clear bell-curve shapes; I haven’t tried to calculate standard deviations of play-length. (I suppose I could do that next.) The average length of an Early-Modern London play in my corpus was 22086.5 words.

It seems that as plays get longer, they’re more likely to use man (and, to some extent, wom*n) in ways that are not true for lord/lady and knave/wench. It’s also worth looking at scales here: there are nearly double the number of lords than ladys, although man/woman and knave/wench are more comparable. Also, there are way fewer instances of knave and wench in my corpus overall, which suggests that maybe these words are not nearly as popular as we might like to think.

Counting things in Early Modern Plays So You Don’t Have To: Some Quick & Dirty Results

I was given a corpus of 400 plays for my PhD on gender in Early Modern London plays. Up to this point I had previously been focusing largely on Shakespeare, but have recently been moving into the larger corpus. So what does one do with 400 plays? My solution was “get to know them a little bit.” I was counting the raw frequencies for lord/lady, man/wom*n, and knave/wench in the entire corpus using AntConc, manually recording it, and then transcribing this data into a spreadsheet. I had selected these terms on the basis that I had recently spent a lot of time looking at likely collocates for these terms, as these binaries represent a high-, neutral-, and low- formality distinction.

Several of my twitter followers asked why I was just looking at wom*n and not also m*n, and the answer is that without a regular expression I was going to get a fair quantity of noise from m*n (including but certainly not limited to man, men, mean, moon, maiden, maintain, morn, mutton…). Wom*n, I had found, was a highly successful use of a wildcard, only picking up woman and women in the corpus. While this category remains somewhat imbalanced, it presents a pretty clear scope of the quantities for more neutral forms. Now that I have a better sense of what my corpus is like beyond “those files in that folder on my computer”, I can always go back and get other information pretty easily.

What can we learn from a corpus of 400 plays?
For starters, there’s not actually 400 plays in the 400-play-corpus, but 325 plays. I knew when I started this project that this corpus was less than 400, and that it did not cover everything. It is a representative corpus, but I was a bit surprised at how much less than 400 plays I actually had. These 325 plays cover 53 individual authors from the years 1514-1662,* which looks like this:

Each dot represents a year of publication. You will note that some authors are more represented than others (Shirley, for example, has 33 plays in the corpus, spanning a number of years, whereas someone like Beza has only one play in the corpus.) The average year for a play to be published was in 1613, and an overwhelming majority of these plays have been published in the late 1500s into the first half of the 1600s.

Once I had the raw frequencies for everything, I was curious to see how these terms performed diachronically. For ease I’m going to keep calling it the 400-play corpus, and as you’re reading, remember that this is very quick & dirty. There’s a lot more to say & do with this data, but I think talking about raw data is a useful endeavor in that speaks volumes about the sample itself.

These graphs suggest that the use of lady and wom*n look more frequent in the corpus from the late 1500s onwards (they’re both almost in a parabola shape) whereas the use of lord and man begins to decline around 1600, creating more of a bell curve effect.

And what about knave and wench? We see there’s a distinct decrease in usage for both just after the early 1600s, though knave was more frequent earlier in the corpus:

Two of these three sets of binaries show very similar graphs, but that’s because this is raw data: there’s simply more instances of plays occurring around the late 1530s onwards.

This was my first time using R for any graphing ever, so I’m going to dive back in and see what I can do with a more normalized corpus next.

—
Additionally, I owe a great debt to the following people, who were very selfless and helpful:
Sarah Werner, Julia Flanders, Shawn Moore, Douglas Clark, Simon Davies. Thank you.

Choosing tools, or why your computer is an abacus

If you wanted to put your small basil plant in your garden, would you find a backhoe to do it? Probably not, as you could dig a reasonable sized hole pretty quickly with a small shovel. Just like you don’t really need to bring out heavy machinery to do a simple gardening task, you probably don’t need complex tools to do small bits of text analysis. Is it impressive? Sure. Is it really necessary? Um, no, probably not.

When it comes to choosing and using digital tools for text analysis it’s a bit like gardening: you want to choose the right tool for the task. Some projects, like planting our basil, don’t really require anything complex, and you might be better off doing something the “old-fashioned” way of reading than overcomplicating with things that aren’t actually adding anything to your analysis.

Computers are very good at counting things. There are a nearly-endless number of tools which will help you count a variety of things in texts; that link probably doesn’t cover all of them. How do you know if you’re using the right one? Digital projects can be great, and digital analysis can be really useful, but if you can see it with your own eyes you probably don’t need a computer to tell it to you. Thematic elements often come out as being specific when comparing texts against one another. In The Tempest, words like ‘drown’ ‘island’ ‘isle’ ‘fish’ and ‘sea’ are more likely to appear – but you really don’t need a computer or complex statistics to tell you that, as Jonathan Hope points out. Digital tools that count things are much better suited to projects which are larger and when you’re looking for something much less thematic and much more specific.

So how do you know if you’re using the right tool for the task at hand? Well, you don’t always. Currently I have at least six tools for straight-up text analysis installed on my computer, and I can access more than a few others from my web browser. I’m compiling one myself. Do I really need all of these? In a word: yes. One is not better than the others. One might be more robustly informative than the others, depending on what I’m looking for.

In my recent research on the Shakespeare corpus I’ve found myself cross-slicing between a concordance program (AntConc), a statistical analysis tool (WordHoard), and the texts themselves (Open Source Shakespeare), and I will pull in others as they’re useful. It’s not that these tools individually aren’t doing enough, it’s that between the three of them, I can get a much more clear picture of what’s actually happening in my texts. Professor Alan Bryman has an excellent paper on triangulation from 2004 (pdf), where he argues for a three-check system “to enhance credibility and persuasiveness of a research account” (2004: 4). In other words: can you find it once, that’s exciting; if you can find it twice, even better, but if you can find it three times it’s a truth. Justifiably, it’s even more exciting when someone using entirely different tools and asking an entirely different question can arrive at the same conclusion that you did, albeit on a much larger scale. Of course, I have the unspoken benefit of working on Shakespeare, who is widely digitized: but I’d return to the texts regardless of who I’m working on – I just might have to change my approach a bit.

When it comes to choosing tools for text analysis, “it was there so I used it” is not an acceptable answer. You should know what your tool can and cannot do; its benefits and its limitations, and you should be able to account for them. A tool is just an interpretation of data, as I said previously, and what you can see in one tool might not be enough to justify your claim. Trying a variety of approaches might show you something that you missed the first, second, third time around: a small detail can lead to much bigger and better questions than simply accepting the first thing you try. A KWIC concordance might not be showing you enough of your data; a log-likelihood analysis might be telling you too much, and your wordcloud might not be showing you anything useful at all. Like anything else, I have my favorite tools and I’m likely to turn to them first and recommend them above other text analysis tools. Are they right for your project? In all honesty: I don’t know.

But all of this shouldn’t stop you from using digital tools, though. I occasionally use KWIC tools as a search engine for a specific corpus, and I will introduce friends and colleagues to them for that purpose, which is probably poor scholarship. But much more interesting things can happen when you break the rules of what the tool should do, which is another blogpost in and of itself.