7 reasons why I think this Hebrew-Latin book from 1683 is really cool

A few years ago I wrote about non-English language printing in EEBO, a post which still gets a fair amount of traffic and a lot of people asking me about Welsh. So when I found a bilingual Latin/Hebrew book in EEBO on Friday night while searching for something else just as I was getting ready to go meet some friends for dinner, I was overjoyed. This this is a book printed in Cambridge, England, in 1683 and contains two languages which are very much not English.

JISC’s EEBO portal lists the title as “Komets leshon ha-koresh ve-ha-limudim = Manipulus linguae sanctae & eruditorum : in quo, quasi, manipulatim, congregantur sequentia, I. index generalis difficilorum vocum Hebraeo-Biblicarum, irregularium, & defectivarum, ad suas proprias radices, & radicum conjugationes, tempora, & personas, &c. reductarum” (R1614 Wing), describing it briefly as a Hebrew grammar (with the first four words in the title transliterated from Hebrew). My years of Hebrew school did not leave me a fluent Hebrew speaker or reader; I have no formal Latin or bibliographic training, but this book is really cool. Here are some reasons why…

This isn’t the title page, but it is the introductory material and you can see that it contains Hebrew, Latinate and Greek characters on the same page:

Screen Shot 2015-12-12 at 3.45.54

For starters, this is a bilingual grammar and index to the Old Testament, serving in some ways as a precursor to my digital concordances. But it also is fascinating because it involves several different typefaces representing several different languages, so someone in 1683 had either created a typeface for Hebrew or had access to a Hebrew typeface to print this book. Furthermore, Hebrew has a script form and a block letter form; the block letters are often used in printing whereas script is much more common elsewhere. Torahs are hand-copied onto vellum (even today!), so it is plausible someone may have had to transform each scripted character into block letters for this.

Hebrew is read from right to left, whereas Latin is read from left to right, so this book had to be very carefully typeset to put these two languages back to back. It also has a vowel system which is optional in print, but they are usually found under the consonants. Torahs often do not use the vowel system so the inclusion of them here (look for the lines, dots and small T’s) is interesting and an extra complication for typesetting.

Screen Shot 2015-12-12 at 3.46.25

The catchwords at the bottom of the page are printed in Hebrew here, but the book uses Latinate numbering. And – as my mother pointed out – entries are listed alphabetically in Hebrew (not in Latin).

Screen Shot 2015-12-12 at 3.46.39

It also includes a list of ambiguities, still written in both languages, and still juxtaposed with a left-to-right and right-to-left language.

Screen Shot 2015-12-12 at 3.46.53

So this is already interesting from a printing perspective, but then there are also grammatical notes and commentaries included, with descriptions of how to use this grammar. And still the juxtaposition of both languages on the same line is really fascinating:

Screen Shot 2015-12-12 at 3.47.06 Screen Shot 2015-12-12 at 3.47.21

From the grammatical guide, here  is a table of conjugations in Hebrew, marked with Latin descriptions (active, passive, future, participles, etc): Screen Shot 2015-12-12 at 3.45.25

And finally it ends in a two-column translation of Hebrew text into Latin:

Screen Shot 2015-12-12 at 4.06.43

Download the EEBO scan as a PDF for more.


Ways of Accessing EEBO(TCP)

On October 28, 2015, the Renaissance Society of America sent an email to all members announcing the demise of their previous partnership with ProQuest (now in control of ExLibris too). Their email to all of us, in full:

The RSA Executive Committee regrets to announce that ProQuest has canceled our subscription to the Early English Books Online database (EEBO). The basis for the cancellation is that our members make such heavy use of the subscription, this is reducing ProQuest’s potential revenue from library-based subscriptions. We are the only scholarly society that has a subscription to EEBO, and ProQuest is not willing to add more society-based subscriptions or to continue the RSA subscription. We hoped that our special arrangement, which lasted two years, would open the door to making more such arrangements possible, to serve the needs of students and scholars. But ProQuest has decided for the moment not to include any learned societies as subscribers. Our subscription will end a few days from now, on October 31. We realize this is very late notice, but the RSA staff have been engaged in discussions with ProQuest for some weeks, in the hope of negotiating a renewal. If they change their mind, we will be the first to re-subscribe.

This is truly terrible news, especially for anyone whose institution did not/could not subscribe to the ProQuest interface.

**EDIT 29 Oct 8:05pm**: the RSA confirms that access to EEBO via ProQuest will continue:

We are delighted to convey the following statement from ProQuest:

“We’re sorry for the confusion RSA members have experienced about their ability to access Early English Books Online (EEBO) through RSA. Rest assured that access to EEBO via RSA remains in place. We value the important role scholarly societies play in furthering scholarship and will continue to work with RSA — and others — to ensure access to ProQuest content for members and institutions.”

The RSA subscription to EEBO will not be canceled on October 31, and we look forward to a continued partnership with ProQuest.

Perhaps because the first set of TCP editions of the EEBO texts are now part of the public domain, this is supposed to be sufficient for scholars’ use. Of course, this is not true: the TCP texts are a facsimile of the EEBO images (themselves facsimiles of facsimiles). However inadequate the TCP texts are for someone without an EEBO subscription, I have been collecting a number of links for a number of years about how to access and use EEBO(TCP). Despite overturning this decision, the benefit of having all these resources listed together seems to justify their continued existence here. They are also available on my links page, but in the interest of accessibility, here they are replicated:

1 EEBO(TCP) documentation
Text Creation Partnership http://www.textcreationpartnership.org/
EEBO-TCP documentation http://www.textcreationpartnership.org/docs/
EEBO-TCP Tagging Cheatsheet: Alphabetical list of tags with brief descriptions http://www.textcreationpartnership.org/docs/dox/cheat.html
Text Creation Partnership Character Entity List http://www.textcreationpartnership.org/docs/code/charmap.htm
The History of Early English Books Online http://folgerpedia.folger.edu/History_of_Early_English_Books_Online
Using Early English Books Online http://folgerpedia.folger.edu/Using_Early_English_Books_Online

2 Access to EEBO(TCP) full texts (searchable)
Early English Books Online (EEBO): JISC historical books interface (UK, paywall, free access from the British Library Reading Room) http://historicaltexts.jisc.ac.uk/
Early English Books Online (EEBO): Chadwyck-Healey interface (outside UK, paywall; your mileage may vary by country) http://eebo.chadwyck.com/home

The Dutch National Library has off-site access, including full EEB (European books), ECCO (18C), TEMPO (pamphlets), for members, 15€/yr. Register online: inschrijven.kb.nl/index.php

EEBO-TCP Texts on Github https://github.com/textcreationpartnership/Texts
UMichigan TCP repository https://umich.app.box.com/s/nfdp6hz228qtbl2hwhhb/
UMichigan EEBO-TCP full text search http://quod.lib.umich.edu/e/eebogroup/*
University of Oxford Text Archive TCP full text search http://ota.ox.ac.uk/tcp/*
* These sites are mirrors of each other
See also 10 things you can do with EEBOTCP

EEBO-TCP Ngram reader, concordancer, & text counts  http://earlyprint.wustl.edu/
CQPWeb EEBO-TCP, phase I (and many others) https://cqpweb.lancs.ac.uk/
(Video guide to CQPWeb: https://www.youtube.com/watch?v=Yf1KxLOI8z8&list=PL2XtJIhhrHNTxjyZ5VSKUr0-4EuzJJDbe)
BYU Corpora front end to EEBO-TCP (*not completely full text but will be soon*) http://corpus.byu.edu/eebo

3 Other resources
English Short Title Catalogue (ESTC), http://estc.bl.uk/
Universal Short Title Catalogue (USTC) http://www.ustc.ac.uk/
Items in the English Short Title Catalogue (ESTC), via Hathitrust http://babel.hathitrust.org/cgi/mb?a=listis;c=247770968
LUNA, Folger Library Digital Image Collection http://luna.folger.edu/
Internet Archive Books https://archive.org/details/early-european-books
The Folger Digital Anthology of Early Modern English Drama http://digitalanthology.folger.edu/

Sarah Werner’s compendium of resources, incl digitised early books http://sarahwerner.net/blog/a-compendium-of-resources/
Laura Estill’s Digital Renaissance wiki page covers online book catalogues, digitised facsimiles, early modern playtexts online, print and book history, etc http://digitalrenaissance.pbworks.com/w/page/54277828/EarlyModernDigitalResources
(see also her very thorough guide to manuscripts online http://manuscriptresearch.pbworks.com/w/page/48026041/FrontPage)
Claire M. L. Bourne’s Early Modern Plays on Stage & Page resource list: http://www.ofpilcrows.com/resources-early-modern-plays-page-and-stage

Large Digital Libraries of Pre-1800 Printed Books in Western Languages http://archiv.twoday.net/stories/6107864/
The University of Toronto has a large number of Continental Renaissance text-searchable books online http://link.library.utoronto.ca/booksonline/
30+ digitised STC titles at Penn (free to use, from their collection) http://franklin.library.upenn.edu/search.html?filter.library_facet.val=Rare%20Book%20and%20Manuscript%20Library&q=STC%20collection%20%22sceti%22&sort=publication_date_sort%20asc,%20title_sort%20asc

UCSB Broadside Ballads Archive http://ebba.english.ucsb.edu/
Broadside Ballads Online http://ballads.bodleian.ox.ac.uk/

A database of early modern printers & sellers culled from the eMOP source documents https://github.com/Early-Modern-OCR/ImprintDB
(And their mirror of the ECCO-TCP texts: https://github.com/Early-Modern-OCR/TCP-ECCO-texts)

Database of Early English Playbooks (DEEP) http://deep.sas.upenn.edu/

How to save and download pdfs from the Chadwyck EEBO Interface https://www.youtube.com/watch?v=6u2B_MagrPc

And a crucial read from Laura Mandell and Elizabeth Grumbach on the digital existence of ECCO (Eighteenth Century Collections Online): http://src-online.ca/src/index.php/src/article/view/226/448

this page will update with more resources as they are available. email me with links: heathergfroehlich at gmail dot com // 15 Aug 2016

Suggested Ways of Citing Digitized Early Modern Texts

On 1 January 2015, 25,000 hand-keyed Early Modern texts entered the public domain and were publicly posted on the EEBO-TCP project’s GitHub page, with an additional 28,000 or so forthcoming into the public domain in 2020.  This project is, to say the least, a massive undertaking and marks a massive sea change in scholarly study of the Early Modern period. Moreover, we nearly worked out how to cite the EEBO texts (the images of the books themselves) just before this happened: Sam Kaislaniemi has an excellent blogpost on how one should cite books in the EEBO Interface (May, 2014), but his main point is replicated here:

When it comes to digitized sources, many if not most of us probably instinctively cite the original source, rather than the digitized version. This makes sense – the digital version is a surrogate of the original work that we are really consulting. However, digitizing requires a lot of effort and investment, so when you view a book on EEBO but only cite the original work, you are not giving credit where it is due. After all, consider how easy it now is to access thousands of books held in distant repositories, simply by navigating to a website (although only if your institution has paid for access). This kind of facilitation of research should not be taken for granted.

In other words, when you use digitized sources, you should cite them as digitized sources. I do see lots of discussions about how to best access and distribute (linked) open data, but these discussion tend to avoid the question of citation. In my perfect dream world every digital repository would include a suggested citation in their README files and on their website, but alas we do not live in my perfect dream world.

For reasons which seem to be related to the increasingly widespread use of the CC-BY licences, which allow individuals to use, reuse, and “remix” various collections of texts, citation can be a complicated aspect of digital collections, although it doesn’t have to be. For example, this site has a creative commons license, but we have collectively agreed that blog posts etc are due citation; the MLA and APA offer guidelines on how to cite blog posts (and tweets, for that matter). If you use Zotero, for example, you can easily scrape the necessary metadata for citing this blog post in up to 7,819 styles (at the time of writing). This is great, except when you want to give credit where credit is due for digitized text collections, which are less easy to pull into Zotero or other citation managers. And without including this information somewhere in the corpus or documentation, it’s increasingly difficult to properly cite the various digitized sources we often use. As Sam says so eloquently, it is our duty as scholars to do so.

Corpus repositories such as CoRD include documentation such as compiler, collaborators, associated institutions, wordcounts, text counts, and often include a recommended citation, which I would strongly encourage as a best practice to be widely adopted.

Screen Shot 2015-08-05 at 11.15.04

Here is a working list of best citation practices outlined for several corpora I am using or have encountered. These have been cobbled together from normative citation practices with input from the collection creators. (Nb. collection creators: please contact me with suggestions to improve these citations).

This is a work in progress, and I will be updating it occasionally where appropriate. Citations below follow MLA style, but should be adaptable into the citation model of choice.

Folger Shakespeare Library. Shakespeare’s Plays from Folger Digital Texts. Ed. Barbara Mowat, Paul Werstine, Michael Poston, and Rebecca Niles. Folger Shakespeare Library, dd mm yyyy. http://folgerdigitaltexts.org/

Mueller, M. “Wordhoard Shakespeare”. Northwestern University, 2004- 2013. Available online: http://wordhoard.northwestern.edu/userman/index.html

Mueller, M. “Standardized Spelling WordHoard Early Modern Drama corpus, 1514- 1662”. Northwestern University, 2010. Available online: http://wordhoard.northwestern.edu.

Mueller, M. “Shakespeare His Contemporaries: a corpus of Early Modern Drama 1550-1650”. Northwestern University, 2015. Available online:  https://github.com/martinmueller39/SHC/

EEBO-TCP access points:
There are several access points to the EEBOTCP texts, and one problem is that the text IDs included don’t always correspond to the same texts in all EEBO viewers as Paul Schnaffer describes below.

Benjamin Armintor has been exploring the implications of this on his blog, but in general if you’re using the full-text TCP files, you should be citing which TCP database you are using to access the full-text files. Where appropriate, I’ve included a sample citation as well.

1. For texts from http://quod.lib.umich.edu/e/eebogroup/, follow the below formula:EEBOTCP michgan

Author. Title. place: year, Early English Books Online Text Creation Partnership, Phase 1, Oxford, Oxfordshire and Ann Arbor, Michigan, 2015.  quod.umich.edu/permalink date accessed: dd mm yyyy

Webster, John. The tragedy of the Dutchesse of Malfy As it was presented priuatly, at the Black-Friers; and publiquely at the Globe, by the Kings Maiesties Seruants. The perfect and exact coppy, with diuerse things printed, that the length of the play would not beare in the presentment. London: 1623, Early English Books Online Text Creation Partnership, Phase 1, Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. Available online:  http://name.umdl.umich.edu/A14872.0001.001, accessed 5 August 2015.

2. For the Oxford Text Creation Partnership Repository (http://ota.ox.ac.uk/tcp/) and the searchable database thereOxford TCP search page

Author. Title. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015 [place: year]. Available online at http://ota.ox.ac.uk/tcp/IDNUMBER; Source available at https://github.com/TextCreationPartnership/IDNUMBER/.

Rowley, William. A Tragedy called All’s Lost By Lust. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015 [London: 1633]. Available online: http://tei.it.ox.ac.uk/tcp/Texts-HTML/free/A11/A11155.htm; Source available at: https://github.com/TextCreationPartnership/A11155/

3. The entire EEBO-TCP Github repositoryGithub EEBOTCP

Early English Books Online Text Creation Partnership, Phase I. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. Available online: https://github.com/textcreationpartnership/Texts

If you are citing bits of the TCP texts as part of the whole corpus of EEBO-TCP, it makes the most sense to parenthetically cite the TCP ID as its identifying characteristic (following corpus linguistic models). So for example, citing a passage from Dutchess of Malfi above would include a parenthetical including the unique TCPID  (A14872).

(Presumably other Text Creation Partnership collections, such as ECCO and EVANS, should be cited in the same manner.)

A cautionary concordance plot tale

In my previous post I addressed how to produce a view of many concordance plots at once, and presented concordance plots for twelve vocatives which are indicative of social class in Shakespeare and a larger reference corpus of Early Modern Drama.

After double-checking all the concordance plot files using a hand-numbered master sheet, I normalised the files using the command convert plot*.jpg -size 415x47! plot*.jpg (on the off chance that any files weren’t ultimately the same size), created a new folder of the normalised files, and pulled out the examples which matched the numbers I had for Shakespeare’s plays for further analysis. I hadn’t addressed titles, as I wasn’t really aiming to look at individual authors, so each file is named plot1, plot2, plot234, etc. I went on to compile the results for these plays, felt confident about the fact that I had isolated Shakespeare, and wrote up my previous blog post.

This morning I had a nagging thought: What if those weren’t Shakespeare’s plays? After all, I had broken my #1 rule about using computational methods – assuming that everything at every step of the process worked the way I thought it did. I am probably a self-parodying pendant when it comes to computational methods, because when something goes wrong at some stage in the coding process it may *never* be visible or even noticed in the final output, and this gives me reason to seriously distrust automated processes for analysis.

Ultimately, I decided I would double-check the plays I had deemed to be “Shakespeare”’s. Even though I hadn’t done much automated processing with the image files, I had assumed that the normalisation process would only change the file names to represent a modified version: so that plot10 would become plot0-10, plot 11 would become plot0-11, plot234 would become plot0-234. I had assumed the information in these files wouldn’t change, and the names would correspond to the original files.

This was not true. Instead, I had isolated a very nice sample of 36 plays which I thought matched Shakespeare’s plays in numbering, but turned out to be sampled from throughout the corpus. Matching the sampled “Shakespeare” concordance plots to the master document of concordance plots, I found that I had at least one Middleton play and at least one Seneca play in addition to some (but not all) Shakespeare plays.[1] At this point I was worried, so I re-created Shakespeare’s concordance plots from the master document of concordance plots. By redoing the concordance plots, I could guarantee that these were at least all Shakespeare’s plays in the first instance. Then I normalised them again for size, and went back to see what happens in that process. The first files were a perfect match, as I had hoped. But once I moved to the second concordance plot, I was in trouble.

Below is an image showing the unmodified concordance plot for The Taming of the Shrew (shx2), outlined in red and on the top left-hand side.The other eight concordance plots in this image are normalised for size, and even without great detail you can tell that none of these match the original file. You don’t even need to see the whole image to see this:

Screen Shot 2015-02-26 at 3.26.03

In other words, as I had suspected, the names of the normalised files didn’t correspond to the original file names, though they were all there.[2] More worryingly, I hadn’t caught it because I had assumed that the files were fine after running a process on them. The files produced results, and if I hadn’t double-checked (really, at this point, triple-checked), I wouldn’t have caught this discrepancy.

So what do concordance plots for Shakespeare’s plays look like in composite for the vocatives attached to a name in a bigram (reminder of search terms: lord [A-Z]|sir [A-Z]|master [A-Z]|duke [A-Z]|earl [A-Z]|king [A-Z]|signior [A-Z]|lady [A-Z]|mistress [A-Z]|madam [A-Z]|queen [A-Z]|dame [A-Z]) look like? Well, surprisingly, not so different from the sample curated previously, which may be less indicative of a specific authorial style:


Remember, we read these from left to right; now there’s a lot of use of vocatives in the very beginning of the plays, which stay quite strong near the rising action until there’s a relative absence just before and around the climax and the start of the falling action. Curiously, the heavy double hit || towards the end is still very visible, as well as a few more dark lines leading up to the conclusion. In some ways, the absence of these vocatives is almost more consistent, and therefore the white bits are more visible.

In the meantime I’m having a fascinating discussion with Lauren Ackerman about how to best address pixel density and depth of detail (especially in the larger EM play corpus), so maybe there will be a third instalment of concordance plots in the future.

[1] Seneca’s plays were published in the 1550s and 1560s, which is why they are included in this data set of printed plays in Early Modern London.

[2] The benefits of working with a smaller set like this means that there are are much smaller, finite number of texts to address: rather than n = 332332 possible combinations, I was now only looking at a possibility of n = 3636. So that was an improvement. In case you’re wondering what happened to one play, because previously I had claimed there were 37 Shakespeare plays, one play doesn’t have any instances of the vocatives being addressed in a bigram with a capital letter.

How to address many concordance plots at once

What if you could take many concordance plots and layer them to get a composite view of many concordance plots in one image? I wanted to see if vocatives which mark for high-status individuals attached to a name appear in any particular pattern which resembles Freytag’s model of dramatic structure.[1]

I selected 12 vocatives which clearly illustrate social class attached to a word beginning with a capital letter for analysis, all of which are relatively frequent in the corpus of 332 plays comprising of 7,305,366 words. In order to get my concordance plots for vocatives attached to a name, I used regular expressions searching for the vocative in question in a bigram with a capital letter strung together by pipelines, so the resulting search looked like this (signior is spelled incorrectly; this is the spelling which produced hits – I suspect something happened in the spelling normalisation stage):
lord [A-Z]|sir [A-Z]|master [A-Z]|duke [A-Z]|earl [A-Z]|king [A-Z]|signior [A-Z]|lady [A-Z]|mistress [A-Z]|madam [A-Z]|queen [A-Z]|dame [A-Z]

Although the regular expression I used picked up examples of queen I and the like, the examples of a capital letter representing the start of a name was far more frequent overall. In the case of mistress, Alison Findlay’s definition (“usually a first name or surname, is a form of polite address to a married woman, or an unmarried woman or girl” (2010, 271) ) accounts for its inclusion here. Though there are certainly complicated readings of this title, I consider instances of mistress to be at the very least a vocative relating to social class in Early Modern England.

The obvious solution to doing this kind of work is R, as people such as Douglas Duhaime and Ted Underwood have been making some gorgeous composite graphs with R for a number of years. To be honest, I didn’t really want to go through the process of addressing a corpus by writing an entire script to produce something that I know can be done quickly and easily in AntConc‘s concordance plot view: I had one specific need; AntConc is an existing framework for producing concordance plots which are normalised for length, as well as a KWIC viewer and several other statistical analyses. I knew that if i wanted to check anything, I could do it easily. I didn’t feel any real need to reinvent the wheel by scripting to accomplish my task, unlike the general DIY process presented by R or Python.[2] The only real downside is that if you want to do more with the output, you have to move into another software package to do that, but even that is not the end of the world.

Ultimately, what I wanted to do was take concordance plots for 332 plays and layer them for a composite picture of how they appear, rather than address them as individual views on a play-by-play basis. Layering images is a common way of addressing edits in printed books; Chris Forster has done exactly that with magazine page size; he suggested I use ImageMagick, a command line processing tool for image compositioning.[3] I have a similarly normalised view of texts at my disposal, as each concordance line is normalized for length. Moreover, Chris and I are of the same mind when it comes to not introducing more complicated software for the sake of using software, so when he told me about this I was willing to give it a try, especially as he has successfully done exactly what I was trying to do. But first I needed concordance plots.

AntConc produces concordance plots but won’t export them, which is annoying but not as annoying as you may think. 38 screen grabs later, I had .pngs of each play’s concordance line. Here they are in AntConc:

womp womp(If you’re not used to reading concordance lines, you read them from left to right (from “start” to “finish”, in narrative terms); each | = 1 hit; the more hits closer together, the darker the line will look.)

I turned these screenshots into a very large jpg with the help of an open source image editing program, just to have them all in one document together. The most well-known is probably GIMP but both lifehacker and Oliver Mason offer Seashore as a more mac-friendly alternative to the GIMP.[4]
Then I broke the master document into individual concordance plots, sized 415×47, using Seashore’s really good select-copy-make new document from pasteboard option, which let you keep and move the select box around the master document, as seen below. Screen Shot 2015-02-23 at 12.15.10So far I have only used regular expressions, command-shift-4, copy, paste, save as .jpg, and pen & paper to record what I was doing. Nothing complicated! It took a while, but in the process I got to know these results really well. Not all the plays in the corpus contain all, or in fact, any, of each vocative: in some instances, there are plays that didn’t use any of the above titles, and aren’t included in this output; some plays only use one vocative out of the twelve investigated or any combination of vocatives which do not represent the full twelve.

As a test, I separated out Shakespeare’s plays to see what a bunch of concordance plots looked like in composite. To do this, I opened a terminal, moved to the correct directory, which comprised moving through 6 directories. Then I normalised everything to the same size with
convert plot*.jpg -size 415x47! plot*.jpg, just in case.
I put those in a new folder of normalised images.
Then, from the directory of normalised images: convert plot*.jpg -evaluate-sequence mean average_page.jpg.

Here’s what 12 vocatives for social class in 37 Shakespeare plays look like in composite:
average shx play_monochrome
There are a few things that I notice in this plot: There’s a quick use of naming vocatives near the beginning of the plays, a relative absense immediately after, but during the rising action and climax there are clear sections which use these vocative quite heavily- especially in the build-up to the climax. Usage drops in the falling action, until just before the denoument; there is a point where vocatives are used quite consistently heavily, marked by || but surrounded by white on both sides. If you can’t see it, here is the concordance plot again, with that point highlighted in red.

If you repeat the above process for the 332 plays, you get the following composite image. Although the amount of information in some ways obfuscates what you’re trying to see, there are darker and lighter bits to this image.
avg EM drama play_monochrome
Most notably, the rising action has a similar cluster of class-status vocative use at the tail end of the introduction and into the rising action, a relative absence until the climax, and then the use of vocatives for social class seem to pick up towards the falling action and end of the plays. Interestingly, the same kind of || notation is visibible towards the conclusion, though it reduplicates itself twice. (Again, if you can’t see it, I’ve highlighted it in red here).

Now to address the details of these plots… But you should also read the follow-up post about this, as well.

tl;dr version:

Can you look at many concordance plots at the same time? Yes.

Do vocatives attached to a name which mark for class status have recognizable patterns in dramatic structure? MAYBE.

[1] Matthew Jockers (1, 2) and Benjamin Schmidt have been doing interesting things with regards to computationally analyzing dramatic structure. I’m not going anywhere near their levels of engagement with dramatic arcs in this post, but they are interesting reads nonetheless. (Followup: Annie Swafford’s blog post on Jockers’ analyses are worth a read as well)

[2] If you particularly enjoy using R to achieve relatively simple tasks like concordance plots, Stefan Gries’ 2009 cookbook Quantitative corpus linguistics with R: a practical introduction and Matthew Jockers’ 2014 cookbook Text Analysis with R for Students of Literature both outline how to do this.

[3] Okay, so this required a few more steps of code, most of which were install scripts which require very little work on the human end beyond following directions of ‘type this, wait for computer to return the input command’. If you are on a mac, you will need to get Xcode to download macports to download ImageMagick, and then X11 to display output. X11 seems optional, especially if you keep your finder window open nearby. Setting all this up took about two hours.

[4] It transpired that I could have done this with ImageMagick using the command
convert -append plots*.png out.png.
Oh well. Seashore also offers layering capabilities for the more graphic design driven amongst you but perhaps more importantly for me, it looks a lot like my dearly beloved MS Paint, a piece of software I’ve been trying to find a suitable replacement for since I joined The Cult of Mac in 2006.

Of time, of numbers and due course of things

[This is the text, more or less, of a paper I presented to the audience of the Scottish Digital Humanities Network’s “Getting Started In Digital Humanities” meeting in Edinburgh on 9 June 2014. You can view my slides here (pdf)]

Computers help me ask questions in ways that are much more difficult to achieve as a reader. This may sound obvious: reading a full corpus of plays, or really any text, takes time, and by the time I closely read all of them, I will have either have not noticed the minutae of all the texts or I will not have remembered some of them. Here, for example, is J. O. Halliwell-Phillipp’s The Works of William Shakespeare; the Text Formed from a New Collation of the Early Editions: to which are Added, All the Original Novels and Tales, on which the Plays are Founded; Copious Archæological Annotations on Each Play; an Essay on the Formation of the Text: and a Life of the Poet, which takes up quite a bit of space on a shelf:IMG_20140604_160953

This isn’t a criticism, nor is it an excuse for not reading; it just means that humans are not designed to remember the minutae of collections of words. We remember the thematic aboutness of them, but perhaps not always the smaller details. Having closely read all these plays (though not in this particular edition: I have read the Arden editions, which were much more difficult to stick on one imposing looking shelf), all I remember what they were about, but perhaps not at the level of minutae I might want to have. So today I’m going to illustrate how I might go from sixteen volumes of Shakespeare to a highly specific research question, and to do that, I’m going to start with a calculator.

A calculator is admittedly a rather old and rather simple piece of technology; it’s one that is not particularly impressive now that we have cluster servers that can crunch thousands of data points for us, but it remains useful nonetheless. Without using technology which is more advanced than our humble calculator, I’m going to show how the simple task of counting and a little bit of basic arithmetic can raise some really interesting questions. Straightforward counting is starting to get a bit of a bad rap in digital humanities discourse (cf Jockers and Mimno 2013, 3 and Goldstone and Underwood 2014, 3-4): yes, we can count, but that is simple. We can also complicate this process with calculation and get even more exciting results! This is, of course, true, and provides many new insights to texts which were otherwise unobtainable. Eventually today I will get to more advanced calculation, but for now, let’s stay simple and count some things.

Except that counting is not actually all that simple: decisions have to be made about what to count and how to decide what to count, and then how you are going to do that. I happen to be interested in gender, which I think is one of the more quantifiable social identity variables in textual objects, though it certainly isn’t the only one. Let’s say I wanted to find three historically relevant gendered noun binaries for Shakespeare’s corpus. Looking at the historical thesaurus of the OED for historical contexts, I can decide on lord/lady, man/woman, and knave/wench, as they show a range of formalities (higher – neutral – lower) and these terms are arguably semantically equivalent. The first question I would have is “how often do these terms actually appear in 38 Shakespeare plays?”

shx minus node words pie chart

Turns out the answer is “not much”: they are right up there in the little red sliver there. My immediate next question would be “what makes up the rest of this chart?” The obvious answer is, of course, that it covers everything that is not our node words in Shakespeare. However, there are two main categories of words contained therein: the frequency of function words (those tiny boring words that make up much of language) and the frequency of content words (words that make up what each play is about). We have answers, but instantly I have another question: what does the breakdown of that little red sliver look like?

This next chart shows the frequency of both the singular and the plural form of each node word, in total, for all 38 Shakespeare plays. There are two instantly noticeable things in this chart: first, the male terms are far more frequent than the female terms, and that wench is not used very much (though we may think of wench as being a rather historical term).
individual node word plurals in Shakespeare (full)

There are more male characters than female characters in Shakespeare – by quite a large margin, regardless of how you choose to divide up gender – but surely they are talking about female characters (as they are the driving force of these plays: either a male character wants to marry or kill a female character). This is not to say that male and female characters won’t talk to each other; there just happens to be a lot more male characters. Biber and Burges (2000) have noted that in 19th century plays, male to male talk is more frequent than male to female talk (and female to female talk). I am not going to claim this is true here, but it seems to be a suggestive model, as male characters dominate speech quantities in the plays. There are lots of questions we can keep asking from this point, and I will return to some of them later, but I want to ask a bigger question: how does Shakespeare’s use of these binaries compare to a larger corpus of his contemporaries, 1512-1662?

It is worth noting that this corpus contains 332 plays, even though it is called the 400 play corpus; some things, I suppose, sound better when rounded up. These terms are still countable, though, and we see a rather different graph for this corpus:
400 play corpus full node words frequencies

The 400 play corpus includes Shakespeare, so we are now comparing Shakespeare to himself and 54 other dramatists.[1] The male nouns are noticeably more frequent than the female nouns, which suggests that maybe the proportions of male to female characters from Shakespeare is true here too. Interestingly, lord is less frequent than man, which is the opposite of what we saw previously. The y axis is different for this graph, as this is a much larger corpus than Shakespeare’s, but it seems like the female nouns are consistent.

One glaring problem with this comparison is that I am looking at two different-sized objects. A corpus of 332 plays is going to be, generally speaking, larger than a corpus of 38 plays.[2] McEnery and Wilson note that comparisons of corpora often require adjustment: “it is necessary in those cases to normalize the data to some proportion […] Proportional statistics are a better approach to presenting frequencies” (2003, 83). When creating proportions, Adam Kilgariff notes “the thousands or millions cancel out when we do the division, it makes no difference whether we use thousands or millions” (2009, 1), which follows McEnery and Wilson’s assertion that “it is not crucial which option is selected” (2003, 84). For my proportions, I choose parts per million.
Shakespeare from Martin's corpus 12.16.10 and Martin's Corpus, normalized plural node words graphed
Shakespeare is rather massively overusing lord in his plays compared to his contemporaries, but he is also underusing the female nouns compared to contemporaries. Now we have a few research questions to address, all of which are very interesting:

  • Why does Shakespeare use lord so much more than the rest of Early Modern dramatists?
  • Why do the rest of Early Modern dramatists use wench so much more than Shakespeare?
  • Why is lady more frequent than woman overall in both corpora?

I’m not going to be able to answer all of these today, though they but let’s talk a little bit about lord. This is a pretty noticeable difference for a term which seems pretty typical of Early Modern drama, which is full of noblemen. If I had to guess, I would say that lord might be more frequent in history plays compared to the tragedies or the comedies. I say this because as a reader I know there are most definitely noblemen, and probably defined as such, in these plays.

So what if we remove the histories from Shakespeare’s corpus, count everything up again, and make a new graph comparing Shakespeare minus the histories to all of Shakespeare? By removing the history plays it is possible to see how Shakespeare’s history plays as a unit compare to his comedy & tragedy plays as a unit. [3]
Shakespeare minus histories compared to shakespeare with histories per million
Female nouns fare better in Shakespeare Without Histories than in Shakespeare Overall, possibly because the female characters are more directly involved in the action of tragedies and comedies than they are in histories (though we know the Henry 4 plays are an exception to that), so that is perhaps not all that interesting. What is interesting, though, is the difference between lord in Shakespeare Without Histories and Shakespeare With Histories. What is going on in the histories? How do Shakespeare’s histories compare to all histories in the 400 play corpus?
history plays, shx vs history plays from 400 play corpus
Now we have even more questions, especially “what on earth is going on with lord in Shakespeare” and “why is wench more frequent in all of the histories?” I’m going to leave the wench question for now, though: not because it’s uninteresting but because it is less noticeable compared to what I’ve been motioning at with lord, which is clearly showing some kind of generic variation.

Remember, we haven’t done anything more complex than counting and a little bit of arithmetic yet, and we have already created a number of questions to address. Now we can create an admittedly low-tech visualization of where in the history plays these terms show up: each black line is one instance, and you read these from left to right (‘start’ to ‘finish’):
Screen shot 2014-06-06 at 4.30.36
And now I instantly have more questions (why are there entire sections of plays without lord? Why do they cluster only in what clearly are certain scenes? etc) but what looks most interesting to me is King John, which has the fewest examples. On a first glance, King John and Richard 3 appear to be outliers (that is, very noticeably different from the others: 42 instances vs 236 instances). Having read King John, I know that there are definitely nobles in the play: King John, King Philip, the Earls of Sudbury, Pembroke, Essex and the excellently named Lord Bigot. And, again, having read the play I know that it is about the relationships between fathers, mothers and brothers – the play centers around Philip the Bastard’s claim to the throne – and also is about the political relationship (or lack thereof) between France and England. From a reader’s perspective, none of that is particularly thematically unique to this play compared to the rest Shakespeare’s history plays, though.

I can now test my reader’s perspective using a statistical measure of keyness called log likelihood, which asks which words are more or less likely to appear in an analysis text compared to a larger corpus. This process will provide us with words which are positively and negatively ranked overall with a ranking of statistical significance (more stars means more statistically significant). Now I am asking the computer to compare King John to all of Shakespeare’s histories. I have excluded names from this analysis, as a reader definitely knows hubert arthur robert philip faulconbridge geoffrey are in this play without the help of the computer.
Screen shot 2014-06-03 at 10.20.23
However, you can see that the absence of lord in King John is highly statistically significant (marked with four *s, compared to others with fewer *s). Now, we saw this already with the line plots, though it is nice to know that this is in fact one of the most significant differences between King John and the rest of the histories.

All of this is nice, and very interesting, as it is something we might not have ever noticed as a reader: because it is a history play with lords in it, it is rather safe to assume that it will contain the word lord more often than it actually does. Revisiting E.A.J. Honingmann’s notes on his Arden edition of King John, there have been contentions about the use of king in the First Folio (2007, xxxiii-xliii), most notably around the confusions surrounding King Lewis, King Philip and King John all labeled as ‘king’ in the Folio (see xxxiv-xxxvii for evidence). But none of this is answering our question about lord’s absence. So what is going on with lord? We can identify patterns with a concordancer, and we get a number of my lords:Screen shot 2014-06-03 at 10.37.59
This is looking like a fairly frequent construction: we might want to see what other words are likely to appear near lord in Shakespeare overall: is my one of them? As readers, we might not notice how often these two words appear together. I should stress that we still have not answered our initial question about lord in King John, though we are trying to.

Using a conditional probability of the likelihood of one lemma (word) to appear next to another lemma (word) in a corpus using the dice coefficiency test, which is the mean of two conditional probabilities: P(w1,w2) and P(w2,w1). Assuming the 2nd word in the bigram appears given the 1st word, and the 1st word in the bigram appears given the 2nd word, this relationship can be computed on a scale from 0-1. 0 would mean there is no relationship; 1 means they always appear together. With this information, you can then show which words are uniquely likely to appear near lord in Shakespeare and contrast that to the kinds of words which are uniquely likely to appear next to lady – and again for the other binaries as well. Interestingly, my only shows up with lord!

Screen shot 2014-06-03 at 10.49.51

This is good, because it shows that lord does indeed appear very differently to our other node words in Shakespeare’s corpus, and suggests that there’s something highly specific going on here with lord, all of which is still suggestive that there is something about lord which is notable. However, I’m still not sure what is happening with lord in King John. Why are there so few instances of it?

Presumably if there is an absence of one word or concept, there will be more of a presence a second word or concept. One such example might be king, but the log-likelihood analysis shows that this is comparatively more frequent in King John than in the rest of Shakespeare’s histories (note the second entry on this list)
Screen shot 2014-06-03 at 10.20.23

Now we have two questions: why is lord so absent, and why is this so present? From here I might go back to our concordance plot visualizations, but this is addressable at the level of grammar: this is a demonstrative pronoun, which Jonathan Hope defines in Shakespeare’s Grammar as “distinguish[ing] number (this/these) and distance (this/these = close; that/those = distant). Distance may be spatial or temporal (for example ‘these days’ and ‘those days’)” (Hope 2003, 24). Now we have a much more nuanced question to address, which a reader would never have noticed: Does King John use abstract, demonstrative pronouns to make up for a lack of the concrete content word lord in the play? I admit I have no idea: does anybody else know?


Halliwell-Phillipps, J.O. (1970. [1854].) The works of William Shakespeare, the text formed from a new collation of the early editions: to which are added all the original novels and tales on which the plays are founded; copious archæological annotations on each play; an essay;on the formation of the text; and a life of the poet. New York: AMS press.

“Early English Books Online: Text Creation Partnership”. Available online: http://quod.lib.umich.edu/e/eebogroup/ and http://www.proquest.com/products-services/eebo.html.

“Early English Books Online: Text Creation Partnership”. Text Creation Partnership. Available online: http://www.textcreationpartnership.org/

Anthony, L. (2012). AntConc (3.3.5m) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.antlab.sci.waseda.ac.jp/

Biber , Douglas, and Jená Burges. (2000) “Historical Change in the Language Use of Women and Men: Gender Differences in Dramatic Dialogue”. Journal of English Linguistics 28 (1): 21-37.

DEEP: Database of Early English Playbooks. Ed. Alan B. Farmer and Zachary Lesser. Created 2007. Accessed 4 June 2014. Available online:http://deep.sas.upenn.edu.

Froehlich, Heather. (2013) “How many female characters are there in Shakespeare?” Heather Froehlich. 8 February 2013. https://hfroehlich.wordpress.com/2013/02/08/how-many-female-characters-are-there-in-shakespeare/

Froehlich, Heather. (2013). “How much do female characters in Shakespeare actually say?” Heather Froehlich. 19 February 2013. https://hfroehlich.wordpress.com/2013/02/19/how-much-do-female-characters-in-shakespeare-actually-say/

Froehlich, Heather. (2013). “The 400 play corpus (1512-1662)”. Available online: http://db.tt/ZpHCIePB [.csv file]

Goldstone, Andrew, and Ted Underwood. “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us.” New Literary History, forthcoming.

Hope, Jonathan. (2003). Shakespeare’s Grammar. The Arden Shakespeare. London: Thompson Learning.

Jockers, M.L. and Mimno, D. (2013). Significant themes in 19th-century literature. Poetics. http://dx.doi.org/10.1016/j.poetic.2013.08.005

Kay, Christian, Jane Roberts, Michael Samuels, and Irené Wotherspoon (eds.). (2014) The Historical Thesaurus of English. Glasgow: University of Glasgow. http://historicalthesaurus.arts.gla.ac.uk/.

Kilgariff, Adam. (2009). “Simple Maths for Keywords”. Proceedings of the Corpus Linguistics Conference 2009, University of Liverpool. Ed. Michaela Mahlberg, Victorina González Díaz, and Catherine Smith. Article 171. Available online: http://ucrel.lancs.ac.uk/publications/CL2009/#papers

McEnery, Tony and Wilson, Andrew. (2003). Corpus Linguistics: An Introduction. Edinburgh: Edinburgh University Press, 2nd Edition. 81-83

Mueller, Martin. WordHoard. [Computer Software]. Evanston, Illinois: Northwestern University. http://wordhoard.northwestern.edu/

Shakespeare, William. (2007). King John. Ed. E. A. J. Honigmann. London: Arden Shakespeare / Cengage Learning.

[1] Please see http://db.tt/ZpHCIePB [.csv file] for the details of contents in the corpus.

[2] This is not always necessarily true: counting texts does not say anything about how big the corpus is! A lot of very short texts may actually be the same size as a very small corpus containing a few very long texts.

[3] The generic decisions described in this essay have been lifted from DEEP and applied by Martin Mueller at Northwestern University. I am very slowly compiling an update to these generic distinctions from DEEP, which uses Annals of English Drama, 975-1700, 3rd edition, ed. Alfred Harbage, Samuel Schoenbaum, and Sylvia Stoler Wagonheim (London: Routledge, 1989) as its source to Martin Wiggins’ more recent British Drama: A Catalog, volumes 1-3 (Oxford: Oxford UP, 2013a, 2013b, 2013c) for further comparison.

An introductory bibliography to corpus linguistics

This is a short bibliography meant to get you started in corpus linguistics – it is by no means comprehensive, but should serve to be a good introductory overview of the field.

>>This page is updated semi-regularly; if you find any dead links please contact me at hgf5 at psu dot edu. Thanks!<<

1.0 General resources
Froehlich, H. “Intro to Text Analysis”. Penn State University Library Guides. (30 May 2018), http://guides.libraries.psu.edu/textanalysis
Froehlich, H. “Text mining: Web-based resources”. Penn State University Library Guides. (10 October 2018) https://guides.libraries.psu.edu/textmining/web

1.1 Books (and two articles)
Baker, Paul, Andrew Hardie and Tony McEnery. (2006). A Glossary of Corpus Linguistics. Edinburgh, Edinburgh UP.
Atkins, Sue, Jeremy Clear and Nicholas Oster. (1991) “Corpus Design Criteria”. http://www.natcorp.ox.ac.uk/archive/vault/tgaw02.pdf
Biber, Douglas (1993). “Representativeness in Corpus Design”. Literary and Linguistic Computing, 8 (4): 243-257. http://llc.oxfordjournals.org/content/8/4/243.abstract
Biber, Douglas, Susan Conrad and Randi Reppen (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge UP.
Granger, Sylviane, Joseph Hung and Stephanie Peych-Tyson. (2002). Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins.
Hoey, Michael, Michaela Stubbs, Michaela Mahlberg, and Wolfgang Teubert. (2011). Text, Discourse and Corpora. London: Continuum.
Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press.
Mahlberg, Michaela. (2013). Corpus Stylistics and Dickens’ Fiction. London: Routledge.
McEnery, T. and Hardie, A. (2012). Corpus Linguistics: Method, theory and practice. Cambridge: Cambridge UP.
O’Keefe, Anne and Michael McCarthy, eds. (2010).The Routledge Handbook of Corpus Linguistics. London: Routledge.
Sinclair, John and Ronald Carter. (2004). Trust the Text. London: Routledge.
Sinclair, John. (1991) Corpus Concordance Collocation. Oxford: Oxford UP.
Wynne, M (ed.) (2005). Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow Books. http://www.ahds.ac.uk/creating/guides/linguistic-corpora/

1.2 Scholarly Journals
Corpora http://www.euppublishing.com/journal/cor
ICAME http://icame.uib.no/journal.html
IJCL https://benjamins.com/#catalog/journals/ijcl
Literary and Linguistic Computing http://llc.oxfordjournals.org/

1.3 Externally compiled bibliographies and resources
David Lee’s Bookmarks for corpus-based linguistics http://www.uow.edu.au/~dlee/CBLLinks.htm
Costas Gabrielatos has been compiling a bibliography of Critical Discourse Analysis using corpora, 1982-present https://www.edgehill.ac.uk/englishhistorycreativewriting/staff/dr-costas-gabrielatos/?tab=docs-bibliography
Members of the corpus linguistics working group UCREL at Lancaster University have compiled some of their many publications here http://ucrel.lancs.ac.uk/pubs.html; see also their LINKS page http://ucrel.lancs.ac.uk/links.html
Michaela Mahlberg is one of the leading figures in corpus stylistics (especially of interest if you want to work on literary texts) http://www.michaelamahlberg.com/publications.shtml; in 2006 she helped compile a corpus stylistics bibliography (pdf) with Martin Wynne.
Lots of work is done on Second Language Acquisition using learner corpora. Here’s a compendium of learner corpora http://www.uclouvain.be/en-cecl-lcworld.html

Corpora-List (mailing list) http://torvald.aksis.uib.no/corpora/
CorpusMOOC https://www.futurelearn.com/courses/corpus-linguistics, run out of Lancaster University, is an amazingly thorough resource. Even if you can’t do everything in their course, there’s lots of step-by-step how-tos, videos, notes, readings, and help available for everyone from experts to absolute beginners.

1.4 Compiled Corpora
Xiao, Z. (2009). Well-Known and Influential Corpora,  A Survey http://www.lancaster.ac.uk/staff/xiaoz/papers/corpus%20survey.htm, based on Xiao (2009), “Theory-driven corpus research: using corpora to inform aspect theory”. In A. Lüdeling & M. Kyto (eds) Corpus Linguistics: An International Handbook [Volume 2]. Berlin: Mouton de Gruyter. 987-1007.
Various Historical Corpora http://www.helsinki.fi/varieng/CoRD/corpora/index.html
Oxford Text Archive http://ota.ahds.ac.uk/
Linguistic Data Consortium http://catalog.ldc.upenn.edu/
CQPWeb, a front end to various corpora https://cqpweb.lancs.ac.uk/
BYU Corpora http://corpus.byu.edu/
NLTK Corpora http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
1.5 DIY Corpora (some work required)
Project Gutenberg http://gutenberg.org
LexisNexis Newspapers https://www.lexisnexis.com/uk/nexis/
LexisNexis Law https://www.lexisnexis.com/uk/legal
BBC Script Library http://www.bbc.co.uk/writersroom/scripts

1.6 Concordance and Other software
No one software is better than another, though some are better at certain things than others. Much here comes down to personal taste, much like Firefox vs Chrome or Android vs iPhone. While AntConc, which is what I use, is great it is far from the only software available. (Note that these may require a licencing fee.)
AntConc http://www.laurenceanthony.net/software/antconc/
Wordsmith http://lexically.net/
Monoconc http://www.monoconc.com/
CasualConc https://sites.google.com/site/casualconc/
Wmatrix http://ucrel.lancs.ac.uk/wmatrix/
SketchEngine http://www.sketchengine.co.uk/
R http://www.rstudio.com/ide/docs/using/source (for the advanced user)
200+ software resources for corpus analysis https://corpus-analysis.com/
Anthony, Laurence. (2013). “A critical look at software tools in corpus linguistics.” Linguistic Research 30(2), 141-161.

1.7 Annotation You may want to annotate your corpus for certain features, such as author, location, specific discourse markers, parts of speech, transcription, etc. Some of the compiled corpora might come with included annotation.
Text Encoding Initiative http://www.tei-c.org/index.xml
A Gentle Introduction to XML http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html
Hardie, A (2014) ““Modest XML for Corpora: Not a standard, but a suggestion”. ICAME Journal 38: 73-103.
UAM Corpus Tool does both concordance work and annotation http://www.wagsoft.com/CorpusTool/

1.7.1 Linguistic Annotation
Natural Language Toolkit http://nltk.org& the NLTK book http://www.nltk.org/book/ch01.html
Stanford NLP Parser http://nlp.stanford.edu/software/corenlp.shtml (includes Named Entity Recognition, semantic parser, and grammatical part-of-speech tagging)
CLAWS, a part of speech tagger http://ucrel.lancs.ac.uk/claws/
USAS, a semantic tagger http://ucrel.lancs.ac.uk/usas/

1.8 Statistics Help 1.8.1 Not Advanced
Wikipedia http://wikipedia.com (great for advanced concepts written for the non-mathy type)
Log Likelihood, explained http://ucrel.lancs.ac.uk/llwizard.html
AntConc Videos https://www.youtube.com/user/AntlabJPN
WordSmith Getting Started Files http://www.lexically.net/downloads/version6/HTML/index.html?getting_started.htm
Oakes, M. (1998): Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. Baroni, M. and S. Evert. (2009): “Statistical methods for corpus exploitation”, in A. Lüdeling and M. Kytö (eds.), Corpus Linguistics: An International Handbook Vol. 2. Berlin: de Gruyter. 777-803. 1.8.2 Advanced Stefan Th. Gries’ publications: http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html Adam Kilgarriff’s publications: Pre-2009 http://www.kilgarriff.co.uk/publications.htm Post-2009 https://www.sketchengine.co.uk/documentation/wiki/AK/Papers
Baayen, R.H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.

heather froehlich // last updated 13 June 2019

Things I Learned From Co-Convening An Interdisciplinary, Interyear Course

I’ve written a bit about Textlab before (here & here), but as my third time teaching it winds down, I’ve been thinking a lot about the various challenges that come with this course. At the risk of repeating myself, this course is sponsored by the Vertically-Integrated Projects initiative at my university, and is one of several (currently 8, I think) projects running. This structure has been borrowed from the Georgia Institute of Technology and has been running at Strathclyde for the past three years.

Textlab was one of first round of projects to be launched, building off work with Visualizing English Print. Jonathan Hope, Anouk Lang, George Weir, Richard Jason Whitt and I co-designed a course for English studies students and computer science students to collaboratively work as a microcosm of the Visualizing English Print project over the course of a term. Admittedly, this class is kind of a strange beast for all involved, including us.

Over the past three years, some things have changed quite a lot from our initial course plan and some things haven’t changed at all. Some challenges remain challenges and some kinks have been ironed out, but here’s some things I learned from co-convening this course:

    Interdisciplinary work is best done when there is an infrastructure for it. We don’t currently have a general education structure in place for undergraduates. Though I suspect the sciences have more cross-faculty students than English and Computer Science tend to, this makes registration and ensuring all students get the appropriate number of credits, with everyone teaching affected courses aware of our system into a bit of a nightmare. Furthermore, depending on which department is leading the course (this year it’s CIS; previously it’s been English), at least half of the students on the course are lost at sea because the other department doesn’t know what to do with them or what their course requirements are.
    It’s my understanding that the university is trying to hire someone whose entire job would be doing administration for the Vertically Integrated Projects and taking care of these tasks, but in the meantime, there is no structure and that’s not great for the students (or us).
    Too many modalities are overwhelming. Giving students lots of small tasks to do, like blogging, tweeting, recording, writing, and reading, is great because it requires time management skills and genreblending. But eventually that’s just too much to grapple with. This is a difficult class to begin with! I’m not sure how useful it is for students to have to write performatively for the public alongside getting to terms with new software, new methods, and new approaches to something they are only semifamiliar with AT BEST. This is not to say my students are stupid – quite the opposite – but that giving them too much new stuff to do is just really overwhelming. How do you know what to focus energies on when there’s so much to address? Over time, we’ve scaled back a lot in our course expectations, which is something I think is really good. It gives the students scope to focus on their work, and once they feel they have a grasp on it, THEN they can start disseminating in really exciting and productive ways.
    Speaking of Structure… When we started Textlab, we gave the students minimal instruction but lots of software, with the model of the VIPs being that “students are smart. Give them the tools, and set them off; they’ll do great things.” Our students did, and continue to do, great things – but the minimal instruction is extremely frustrating for the students. We (I hope) make it clear that this is a research-driven course from the beginning, and that they are expected to be self-driven and relatively autonomous. We’re there to help, obviously, but it’s not really our project, it’s theirs, and they have to make their own decisions about what is useful and what is not.
    Similarly: Too many cooks spoil a dish vs a pot without water will never boil. We very quickly realized in the first year of Textlab that having upwards of 5 people roaming around trying to help wasn’t really helpful, because everyone would make very different suggestions and the students were totally lost when it came to figuring out what to address first. To rectify this in the second iteration of Textlab we instituted a “the less we say the better” policy: we showed them software and told them to figure out what to do with it. There are pros and cons to both pedagogical approaches, but as a middle ground this year we tried worksheets with guided exercises to introduce the students to the software. From my perspective as teacher this seemed pretty successful as long as at least one of the staff members was on-hand to address questions, which requires the students to feel comfortable asking us for help.
    Different departments do coursework differently. This comes as no surprise (I hope). Computer science labs generally involve a task or two to solve in a given lab time, but you can do it on your own time if you wish, as long as you practice and understand the concepts discussed. Contrast this with English studies discussion sections, which are often lead by someone, very often require the full allocated time, and are less about the hands-on, practical implication of a concept and more of an abstract unpacking of a concept or text.
    Combining these two approaches in a lab setting is a challenge: how do you get the students to reach a middle ground? I’m not sure we ever really solved this problem: we can’t make students stay in a classroom, especially not when it’s research-driven and we can’t quite tell them they’re not required to show up as long as they do the work. Our solution was to put students into small groups, assign them a play to work on, and offer them lab space once a week for 2-3 hours to let them work out what needs to be done from there with available staff support. A lot of this course is achieved outside of class, which requires everyone’s schedules to mesh a least a little bit. Even trying to get a class time that didn’t clash with both faculties timetables was difficult.
    Groupwork: still not a lot of fun. Look, I hated groupwork as an undergraduate, and I totally cannot blame anyone for not liking it: suddenly your success in a course is dependent on one to four other people who are not you. This sucks. To account for this we had students fill out a sheet where they would independently report on their group’s participation. If a five person group had labor split evenly, great, they’d get the same mark for participation. If everyone agrees that someone doesn’t pull their weight, then the participation mark for that person goes down while everyone else’s would go up. Etc.
    Different departments have different kinds of students enrolled. Again, I hope this is a no duh moment, but the kind of student who wants a computer science degree is going to have different interests than a student who wants to get a degree in English. In the first two weeks of Textlab this year, two separate computer science students approached me to say they had done extracurricular tech work: one taught themselves Python and was concerned that Unix was too far removed from their needs whereas another had installed Ubuntu on their computer to feel more comfortable using the bash shell. Maybe I’m not a good English teacher (this is likely) but no student has ever come up to me after teaching a Milton poem to say “I liked that so much I read all of Paradise Lost for fun”. Again, maybe (probably) this is my fault, but I also suspect that the oh factor for humanities students comes much later, after the class is done: “I really liked that class and now I want to know more.”
    The beast factor makes this totally worth it. Our students learn a ton in a very short period of time and can talk in depth about very complex concepts including linguistics and statistics without having to ever make it explicit that’s what’s going on. The technologically-shy leave a lot more confident in their abilities and the technologically-savvy learn about a practical application of skills on a real-life project. Each subset of students gets to learn from the other – and it’s not just content but also communication skills and time management. But most importantly, each student who leaves this class has achieved something very significant, and have something amazing to put on a CV: imagine asking them to explain something they found challenging in a job interview!

On Teaching Literature to Computer Science Students

[Previously: On Teaching Coding to English Studies Students]

Recently I wrote about English studies students learning to code in an interdisciplinary computer science and English class. In that post I mentioned that this class (running for the third consecutive year) comes with a variety of challenges – some strictly institutional, some cross-departmental, and some pedagogical. I’ve been collecting a number of these and will be blogging about them in the future. In that post I also mentioned that there are two very oppositional learning curves at play: one is getting the English studies students to think about computers in a critical way and the other is getting the computer science students to read literature.

We have just hit the very exciting point in the course where the computer science students are learning to read and the English studies students are truly hitting their stride, which is a dramatic turnaround from the start of the course. Last week we asked each group to give a very short, informal presentation about their assigned Shakespeare plays in relation to the rest of the Shakespeare corpus. It was no surprise that in every group the English studies students gave an overview of the plot of their play and a few key themes whereas the computer science students reported what they had deemed to be a finding. English studies students have been studying how to analyze texts and the computer science students haven’t done that in the same way.

Here in Scotland, students begin to track either towards arts & humanities or science long before they hit university – they start to track in high school, and take school leaving exams in a number of subjects (“Highers”), from a rather long list which you can read here. Once you choose your track, it’s rare (though not unheard of) to have much overlap between A&H and science in one’s Highers qualifications. Most degree programs will have preferred subjects for applicants which guide students’ decisions about which Highers to take; Strathclyde’s entry requirements for a student wishing to be a Computer Science undergraduate can be found here (pdf). Unless you’re going into a joint honours in Computer Science and Law, English literature or language is not a required Higher for prospective students in computer science at my university. It goes the other way, too – a Higher in Maths is not a requirement for a prospective English studies student unless they plan on taking a joint honours with Mathematics (again, pdf). Some students may take Highers strictly out of interest (or uncertainty about which route to take), but like the SAT II or AP exams, this is not necessarily something you’d do for fun – these are high-stakes exams.

If faced with that choice I would definitely have taken the Arts & Humanities track despite liking science and being (told I was) very bad at math. I suspect that a lot of computer science students may have liked history or media studies but were bad at writing (or told they were…) and that was enough to turn them away from taking an Arts & Humanities track. It might be that students who sign up for a degree in Computer Science are just really passionate about computers, or that they like practical problem solving, or they’ve been told that computer science is a lucrative field. I have no idea – I’m not them*. On the surface, it can look like they have a lot of missing cultural information for not knowing things about literature – but they also know a whole lot more than we do about very different things.

That’s not to say these students don’t read of their own accord or aren’t interested in books. However, by the time they show up to my class, they have rather successfully avoided close reading for a few years, whereas English studies students have been practicing this skill for a while now. This is something English studies students are very comfortable with, and have now learned enough about the way that computers “think” (or lack thereof) to reach a common ground with the CS students.

However, in the same way the computer science students found the first half of the class easy and the English studies students found it tremendously daunting, the computer science students suddenly feel like they’ve been thrown in the deep end. The English studies students have to teach them how to analyze a text.

In our in-class presentations, each group had to discuss a discovery they’ve made about their play and explain why it was interesting. Without fail, the computer science students had lots to say about various discoveries they had found about kinds of words that were more or less frequent in their play compared to all of Shakespeare’s plays. And yet when they were pressed about why they thought it was happening, they weren’t really sure. The English studies students could postulate theories about why there were more or less of a specific kind of feature in their text, because they know how to approach this problem.

Over the next few weeks we’re letting the students self-guide their own projects and produce explanations for their discoveries, which means the computer science students are on a crash course on close-reading from their in-group local expert. They’re learning that data isn’t everything when it comes to understanding what makes their play in some way different (or similar) from other plays. In fact, they’re learning the limitations of data and ways that close-reading is not just supplementary but essential to a model of distance reading with computational methods. And if the student presentations I saw last week are any indication, I suspect I have some very exciting work coming my way in a few weeks’ time.

* I did my dual-major undergrad degree in English lit and Linguistics; I didn’t get involved in computers until my masters.

(with thanks to Kat Gupta for comments on this post)

CEECing new directions with Digital Humanities

[editor’s note: this post is cross-posted to Kielen kannoilla, the VARIENG blog. I am extremely grateful to Anni Sairio and Tanja Säily for all their organization behind making this visit, and thus this blog post, possible.]

This past week I was talking about the relationships between corpus linguistics and digital humanities as a visiting scholar at VARIENG, a very well known historical sociolinguistics and corpus linguistics working group. Corpus linguistics is a very text-oriented approach to language data, with much interest in curation, collection, annotation, and analysis – all things of much concern to digital humanists. If corpus linguistics is primarily concerned with text, digital humanities can be argued to be primarily be concerned about images: how to visualize textual information in a way that helps the user understand and interact with large data sets.

VARIENG has been compiling the Corpus of Early English Correspondence (CEEC) for a number of years, and one of their primary concerns is ‘what else can we do with all this metadata we’ve created’? Together, we discussed three main themes of corpus linguistics and digital humanities: access, ability, and the role of supplementary vs created knowledge. Digital humanities runs on a form of knowledge exchange, but this raises questions of who knows what, how, and how to access them.

Approaching a computer scientist with a bunch of historical letters may raise some “so what” eyebrows, but likewise, a computer scientist approaching a linguist with a software package to pull out lexical relationships might raise similar “so what” eyebrows: why should we care about your work and what can we do with it? Because both groups walk in with very different kinds of expertise, one of the very big challenges of digital work is to be able to reach a common language between the disciplines: both have very established, very theoretically-embedded systems of working.

All of this is to say that the takeaway factor for corpus linguistics research, and indeed any kind of digitally-inflected project, is very high. As Matti Rissanen says, and rightly so, “research begins when counting ends”. The so-what factor of counting requires heavy contextualization, human brainpower, time, funding, systems and communication – and none of these features are unique to corpus linguistics. Digitally-inflected scholarship requires complementary expertise in techniques, working and interacting with data; we need humanistic questions which can be pushed further with digital methods, not digital methods which (we hope) will push humanistic questions further. While it is nice to show what we already understand by condensing lots of information into a pretty picture, there are deeper questions to ask. If digital humanities currently serves mostly to supplement knowledge, rather than create new knowledge, we need to start thinking forward to ask “What else can we do with this data we’ve been curating?”

One thing we can do with this data is view it in new tools and learn to ask different questions, as we did with Docuscope, a rhetorical analysis software developed at Carnegie Mellon University.CEEC sampler of 18thC gentry professionals, as seen in Docuscope's Multiple Text Viewer F_1720-39.txt, as seen in Docuscope's Single Text ViewerDigital tools and techniques are question-making machines, not answer-providing packages. Here we may ask ourselves why F_1720-39.txt has a low count of Personal Pronouns in Docuscope, and the answer may be that what we consider to be personal pronouns (grammatically) are categorized otherwise by Docuscope and that other constructions are used instead. This isn’t magic and this can’t be quiet handwaving: we should be pushing ourselves towards asking questions which were previously impossible at the scale of sentence-level or lexical-level of detail, because suddenly we can.

Slides from last week’s workshops (right-click to save as pdf files):