How I use Twitter as an academic

An enormous amount of academic life happens in digital spaces these days. The microblogging service Twitter, which has been around since 2007, has all but replaced the academic listserv in 2017.  Despite the various ways that Twitter continues to struggle with content and user management,  it has perhaps accidentally become a widely used professional space for academics to network, exchange ideas, and collaborate. When Twitter works like this, it is brilliant. When Twitter does not work like this, it becomes an incredibly dangerous place for people who are already in precarious positions (for any number of reasons: rank, social identity or identities, job market status, any intersection of the above, etc)

It is becoming increasingly clear to me that academics in general – especially early-stage graduate students – are in desperate need of training, support, & guidance on professional social media spaces. Although more senior scholars are often on social media, they are also secure enough in their positions and academic life in general that their experiences of social media and social networking can be very different than their graduate students.

I’ve been on Twitter since 2010, and I have seen this play out more than a few times, including as a graduate student myself. In these seven years I have maintained what I hope is a very professional profile and I have accidentally amassed a rather large following (in the 1000s). I would not go so far as to say that I am Internet Famous but certainly it is rare I walk into a room now and I don’t know someone there. I try to be very modest about my internet life but I also recognize that is quite difficult when I occupy this space.

People have asked me for years about how I do this. I tweeted yesterday about how I manage my own social media presence, which unexpectedly  got a lot of interest. I thought it might be good to have these up more permanently for reference, as the very nature of Twitter is ephemeral (except for when it’s not). I’ve kept these in mostly 140ch-bites, because brevity is often better than verbosity.

1. Don’t tweet anything that I would not want to see associated with me/my name/my likeness in the international news.

2. Mute people you have to follow for whatever political reason but actively dislike. (it happens.) They don’t know you have them on mute.

3. Mute words that you don’t need to see. I’m not a basketball fan so I have “March Madness” on mute. For example.
I use Tweetdeck to mute individual words. This link explains how you do that on Regular Twitter

4. Everything you say on Twitter is public and reaches lots of people you don’t know.

5. 140 characters (or 280, depending on who you are now) is very, very flattening. Assume the recipient will have the worst interpretation.

6. Twitter is a space for networking and making friends, but also your seniors are watching you. They will write your letters of rec one day. (If you are up for tenure, whatever, they are writing those letters too.)

7. Yelling about politics does not make you a better person. but it does make you feel part of a larger culture of dissatisfication. If that makes you feel better, good for you. It can be more performative than anything else.

8. There is an art to being quiet about some things. This one is hard and takes practice.

9. You like the thing you study, so tweet about what you are doing. Be generous about what you know.

10. Give yourself regular days away from Twitter. People will still be there when you come back. Go outside, watch a film, have a life.


Heather’s 3 rules of doing digital scholarship

In my new job as Digital Scholarship Fellow in Quantitative Text Analysis, I’m starting to work with students and faculty in the Liberal Arts on the hows and ways of counting words in lots of texts. This is a lot of fun, especially because I get to hear a lot about what excites different scholars from across different disciplines and what they think is fascinating about the data they work with (or want to work with).

One thing that is kind of strange about my job – and there are several aspects of my job that have required some adjustment – is that my background is broadly in corpus linguistics and English literature, so I don’t always think of the work I do as being explicitly “DH”. These distinctions are quite frankly boring unless you are knee-deep in the digital humanities, and even then I am not convinced it is an interesting discussion to have. Ultimately, people have lots of preconceived notions about what DH is and why it matters. I suspect that different disciplines within the Humanities writ large have different ideas about this too – certainly the major disciplines I cut across (English, linguistics, history, computer science, sociology) have very different perspectives on the value and experience of digital scholarship. And of course, doing “digital” work in the humanities is kind of redundant anyway: we don’t talk about “computer physics” or “test tube chemistry”, as Mike Witmore and others have pointed out.

Being mindful of this, I have acquired a few rules for doing digital scholarship over the years, and I find myself saying them a lot these days. They are as follows:

1. “Can you do X” is not a research question.

The answer to “can you do X” is almost always “yes”. Should you do X? That’s another story. Can you observe every time Dickens uses the word ‘poor’? Of course you can. But what does it tell you about poverty in Dickens’ novels? Without more detail, this just tells you that Dickens uses the word ‘poor’ in his books about the working class in 19th century Britain — and you almost certainly didn’t need a computer to tell you that. But should you observe every time Dickens uses the word ‘poor’? Maybe, if it means he uses this over other synonyms for the same concept, or if it tells us something about how characters construct themselves as working-class, or if it tells us how higher status characters understand lower-status individuals, or whatever else. These are all research questions which require further investigation, and tell us something new about Dickens’ writing.

2. Programming and other computational approaches are not findings.

So you have learned to execute a bunch of scripts (or write your own scripts) to identify something about your object of study! That’s great. Especially if you are in the humanities, this requires a certain kind of mind-bending that requires you to think about logic structures, understand how computers process information we provide, and in some cases overcome the deeply irregular rules which make your computer language of choice work. This is hard to do! You deserve a lot of commendation for having figured out how to do all of this, especially if your background is not in STEM. But – and this is hard to hear – having done this is not specifically a scholarly endeavour. This is a methodological approach. It is a means to an end, not a fact about the object(s) under investigation, and most importantly, it is not a finding. This is intrinsically tied to point #1: Should you use this package or program or script to do something? Maybe, but then you have to be ready to explain why this matters to someone who does not care about programming or computers, but cares very deeply about whatever the object of investigation is.

3. Get your digital work in front of people who want to use its findings.

Digitally inflected people already know you can use computers to do whatever it is you’re doing. It may be very exciting to them to learn that this particular software package or quantitative metric exists and should be used for this exact task, but unless they also care about your specific research question, there is a limited benefit for them beyond “oh, we should use that too”. However, if you tell a bunch of people in your specific field something very new that they couldn’t have seen without your work, that is very exciting! And that encourages new scholarship, exploring these new issues to those people your findings matter most to. You can tell all the other digital people about the work you’ve done as much as you want, but if your disciplinary base isn’t aware of it, they can’t cite it, they can’t expand on your research, and the discipline as a whole can’t move forward with this fact. Why wouldn’t you want that?

What’s a “book” in Early English Books Online?

Recently I have been employed by the Visualising English Print project, where one of the things we are doing is looking at improving the machine-readability of the TCP texts. My colleague Deidre has already released a plain-text VARDed version of the TCP corpus, but it is our hope to further improve the machine-readability of these texts.

One of the issues that came up in modernising and using the TCP texts has to do with non-English print. It has been previously documented that there are several non-English languages in EEBO – including Latin, Greek, Dutch, French, Welsh, German, Hebrew, Turkish and Algonquin. Our primary issue is if there is a transcription that is not in English in the corpus, it will be very difficult for an English-language text parser or word vector model to account for this material.

So our solution has been to isolate the texts which are printed in a non-English language, either monolingually (e.g. a book in Latin) or a bi- or tri-lingual text (e.g. Dutch/English book, with a Latin foreword). Looking at EEBO-the-books is a helpful way to identify languages in print, as there are all sorts of printed cues to suggest linguistic variation, such as different fonts or italics to set a different language off from the primary language. It also means I get a chance to look at many of these non-English texts as they were printed and transcribed initially.

Three years ago, I wrote a blog post about some Welsh language material that I found in EEBOTCP Phase I. In the intervening time I still have not learned Welsh (though I am endlessly fascinated by it), still get lots of questions and clicks to this site related to Early Modern Welsh (hello Early Modern Welsh fans), and I have since learned quite a lot more about how texts were chosen to be included in EEBO (it involves the STC; Kichuk 2007 is an excellent read on this topic to the previously uninitiated). So while that previous post asked “What makes a text in EEBO an English text”, this post will ask “what makes a text in EEBO a book?”

In general, I think we can agree that in order to be considered a book or booklet or pamphlet, a printed object has to have several pages. These pages can either created through folding one broadside sheet, or it will have collection of these (called gatherings). It may or may not have a cover, but it would be one of several sizes (quarto, folio, duodecimo, etc). To this end, Sarah Werner has an excellent exercise on how to fold a broadside paper into a gathering which builds the basis for many, but probably not all, early books. Here is an example of a broadside that has clearly been folded up; it’s been unfolded for digitization.

folded broadsheeet A17754           TCPID A17754

So it has been folded in a style that suggests it could be read like a book, but it is not necessarily a book in the sense that there is a distinct sense of each individual page and that some of the verso/recto pages would be rendered unreadable unless they had been cut, etc.

In order to be available for digitization from the original EEBO microfilms, a text needed to be included in a short title catalogue. The British Library English Short Title Catalogue describes itself as

a comprehensive, international union catalogue listing early books, serials, newspapers and selected ephemera printed before 1801. It contains catalogue entries for items issued in Britain, Ireland, overseas territories under British colonial rule, and the United States. Also included is material printed elsewhere which contains significant text in English, Welsh, Irish or Gaelic, as well as any book falsely claiming to have been printed in Britain or its territories.

I select the British Library ESTC here because it covers several short title catalogues (Wing and Pollard & Redgrave are both included) and it’s my go-to short title catalogue database. Including “ephemera” is important, because it allows any number objects to be considered as items of early print, even if they’re not really ‘books’ per se.

Such as this newspaper (TCPID A85603)…

newspaper A85603

Or this this effigy, in Latin, printed on 1 broadside (TCP id A01919); click to see full-sizeeffigy A01919

Or this proclamation, also printed on 1 broadside (TCPID A09573)

proclamation A09573

Or this sheet of paper, listing locations in Wales (Wales! Again!) (TCPID A14651); click to see full-size

Screen Shot 2016-07-12 at 9.00.48 pm


Or this acrostic (TCPID A96959); click to see full-size


Interestingly, these are all listed as “one page” in the Jisc Historical books metadata, though they are perhaps more accurately “one sheet”. While there’s no definitive definition of “English” in Early English Books Online, it’s becoming increasingly clear to me that there’s no definitive definition of “book” either. And thank god for that, because EEBO is the gift that keeps giving when it comes to Early Modern printed materials.

10 Things You Can Do with EEBO-TCP Phase I

The following are a list of resources I presented at Yale University (New Haven, CT, USA) on 4 May 2016 as part of my visit to the Yale Digital Humanities Lab. Thank you again for having me! This resource list includes work by colleagues of mine from the Visualising English Print project at the University of Strathclyde. You can read more about their work on our website, or read my summary blog post at this link.

You can download the corresponding slides at this link and the corresponding worksheet at this link.

EEBO and the TCP initiative
Official page
The History of Early English Books Online
Transcription guidelines & other documentation
EEBO-TCP Tagging Cheatsheet: Alphabetical list of tags with brief descriptions
Text Creation Partnership Character Entity List

Access the EEBOTCP-1 corpus
#1 – Download the XML files
#2 – Search the full text transcription repository online *
or *
* these are mirrors of each other

#3 – Find a specific transcription
STC number vs ESTC number vs TCPID number
STC = specific book a transcription is from
ESTC number = “English Short Title Catalogue” (see
TCPID = specific transcription (A00091)

#4 – Search the full text corpus of EEBOTCP*
(*Can include EEBO-TCP phase I, Phase I and Phase II; read documentation carefully)

EEBO-TCP Ngram reader, concordancer & text counts (big picture)
CQPWeb EEBO-TCP, phase I
#5 – identify variant spellings BYU Corpora front end to EEBO-TCP (*not completely full text but will be soon*) (potential variant spellings; see also EEBO NGram Viewer, above)

Find specific information in the TCP texts…
#6 – Find a specific language
e.g. Welsh: *ddg*
#7 a specific term or concept using the Historical Thesaurus of the OED, trace with resources listed above (

#8 Curate corpora
Alan Hogarth’s Super Science corpus uses EEBO-TCP provided metadata + disciplinary knowledge to curate texts about scientific writing

#9 Clean up transcriptions – Shota Kikuchi’s work
Using the VARD spelling moderniser + PoS tagging (Stanford PoS tagger and TregEx → improvements for tagging accuracy, syntactic parsing)

#10 Teach with it!
Language of Shakespeare’s plays web resource lead by Rebecca Russell, undergraduate student (University of Strathclyde Vertically Integrated Project, English Studies / Computer Science)

Introducing “Social Identity In Early Modern Print: Shakespeare, Drama (1514-1662) and EEBO-TCP Phase I”

Today I am enormously pleased to introduce my PhD thesis, entitled “Social Identity In Early Modern Print: Shakespeare, Drama (1514-1662), and EEBO-TCP Phase I”. It investigates how Shakespeare’s use of terms marking for social identity compares to his dramatist contemporaries and in Early Modern print more generally using the Historical Thesaurus of the Oxford English Dictionary and the Early English Books Online Text Creation Partnership Phase I corpus.

Here are some facts about it: It ended up being 267 pages long; without the bibliography/front matter/appendices it’s 51,399 words long, which is about the length of 2 ⅓ Early Modern plays. There are 5 chapters and 3 appendices. It’s dedicated to Lisa Jardine, whose 1989 book Still Harping on Daughters: Women and Drama in the Age of Shakespeare changed my life, getting me interested in social identity in the Early Modern period in the first place. I am so grateful for Jonathan Hope and Nigel Fabb’s guidance. My examination is scheduled to be within the next two months, which is relatively soon. I’m excited to have it out in the world.

In lieu of an abstract I have topic modelled most of the content chapters for you (I had to take out tables and graphs, and some non-alphanumeric characters; I’m sure you will forgive me). For the unfamiliar, topic modelling looks for weighted lexical co-occurance in text. You can read more about it at this link.

A note on abbreviations: The SSC (“Standardized Spelling WordHoard Early Modern Drama corpus, 1514-1662”, Martin Mueller, ed. 2010) is a subcorpus of Early Modern drama from EEBO-TCP phase I which use to set up comparisons between Shakespeare and his dramatist contemporaries. It generally serves as a prototype for Mueller’s later Shakespeare his Contemporaries corpus (2015). EEBOTCP is Early English Books Online Text Creation Partnership phase I and HTOED is the Historical Thesaurus of the Oxford English Dictionary. Wordhoard is a complete set of Shakespeare’s plays, deeply tagged for quantitative analysis.

1. play act suggests titus scene time bigram tragic iv strategy naming change setting hamlet shift iii downfall beginning serves construction established lack direct tamora revenge hero moment modes cleopatra andronicus
2. forms similar considered general categories surrounding directly position multiple common found listed contexts illustrates examples plays contrast concept variant sample kinds small included covering past perceived suggested thinking variety investigated
3. concordance search query view plot title instance queen lines cqpweb plots image string process line antconc average regular aggregated queries images normalized imagemagick size page length produce expressions bible adjacent
4. gentry status figures table models classes social high members nevalainen raumolin upper archer brunberg model nobility part identifying explicitly servants references presented clergy commoner account commoners noble nobles system ranking
5. england make acts paul writing true thou evidence back primary conversion festus thee threat noble art long sermon doctor christianity madde effect john audience speak verse church apostles truth argument
6. man lady generally native understood men latinate positive good beauteous virtuous character contrast register night proper adjectives poor takes lost love stylistic day raising formal gentle sweet honourable qualities dream
7. social class vocatives status titles specific marking vocative comedies identify tragedies histories indicative feature busse absence broadly pattern courteous moments middle start markers visualisation names individuals world finish amount presence
8. characters plays character title necessarily number total play mistress high roles list speaking rank comedy considered minor remains duke romeo ephesus syracuse juliet marriage analysis attendants lords measure merchant taming
9. social historical study literary language identity linguistic features studies culture evidence primarily feminist sociolinguistic criticism thesis claims community cultural approach methods scholars personal identify research previously renaissance understand central jardine
10. part discuss question bedlam stage represent order shown socially shakespearean describe honest invoking earlier ford duchess run discussion instability neely husband alongside ridge acting issues bellamont put accounts considers whore
11. woman collocates lady register knave nouns lower wench man terms unique term binary negative low lord high status variation show describe gendered collocate lexical semantic higher describes formality individuals noun
12. individual lord ways include reference sir politeness set result king henry applied strategies referenced parts address othello court philaster semantically marker due master wife introduce kind cardinal correlation question structures
13. female male characters gender women means dialogue patterns feminine find fewer marked idea role objects masculine grammatical race dependent skewed understood heavily ambiguous distinctions body rackin finally century negation interest
14. shows figure compared million authors instances unlike difference clear difficult contemporary specifically previous usage frequency comparatively ssc picture rest consistently shakespeare begin strong scenes times john place possibility england thomas
15. shakespeare plays show contemporaries larger chapter variation compare history dramatists relationship dramatist genre attention observe writing compares style divide edited internal exceptional assigned studied cover challenge correlate finding worthy lear
16. mad sanity actions phrase claim religious reason threatening face gram suggest establish dangerous prove suggesting potentially claims orleans biting attempts state explored evidence accusations grams justify corvino recurring circumstances accused
17. corpus corpora section ssc spelling included including making full range data meaning genres author frequently identify drama represents wider investigation comparisons nature side discussions observe metadata wide separate dialogues smaller
18. present hand http made oed malvolio works project head makes ii complete issue end fool ability thesaurus www requires illustrate kay dictionary org gutenberg hope oxford referencing initially house uk
19. early modern tcp eebo drama english phase period print online books creation widely playwrights partnership focus limited handful short jonson pronouns catalogue publication diachronic selected initiative recently years markedly norm
20. words examples frequent function highly word potential table left phrases node analysis ranked span pronoun content unclear multi lexical keyword context sense luna verbs collocate count finds unit highlights large
21. dramatic structure narrative action characterization model outlined freytag shakespearean visible work understanding point argues schmidt end tragedy jockers genre introduced constructions construct plausible description sudden suggest neuberg fiske climax reasons
22. texts text information based digital lists wordhoard personae dramatis mueller editions printed view folger selection designed source finally print includes burns ultimately discussing machine nameless edition readable time critical provided
23. madness terms mental htoed synonyms relevant illness form crazy melancholy historically health person related synonymous brained symptoms insanity scientific physical medical dement linked term relating semantic crazed variations definition discussed
24. test lemma collocates probability results likelihood collocational information analysis log dice lemmata top specific tests coefficient collocation based squared method frequency mutual relationships conditional chi appears phi produce methods wordhoard
25. form texts speaker culpeper discourse speech world context category spoken constructed kyt hearer addressee al biber analyses interaction writing written lutzky interactions fictional imaginary relevant circumstances addition person people real

7 reasons why I think this Hebrew-Latin book from 1683 is really cool

A few years ago I wrote about non-English language printing in EEBO, a post which still gets a fair amount of traffic and a lot of people asking me about Welsh. So when I found a bilingual Latin/Hebrew book in EEBO on Friday night while searching for something else just as I was getting ready to go meet some friends for dinner, I was overjoyed. This this is a book printed in Cambridge, England, in 1683 and contains two languages which are very much not English.

JISC’s EEBO portal lists the title as “Komets leshon ha-koresh ve-ha-limudim = Manipulus linguae sanctae & eruditorum : in quo, quasi, manipulatim, congregantur sequentia, I. index generalis difficilorum vocum Hebraeo-Biblicarum, irregularium, & defectivarum, ad suas proprias radices, & radicum conjugationes, tempora, & personas, &c. reductarum” (R1614 Wing), describing it briefly as a Hebrew grammar (with the first four words in the title transliterated from Hebrew). My years of Hebrew school did not leave me a fluent Hebrew speaker or reader; I have no formal Latin or bibliographic training, but this book is really cool. Here are some reasons why…

This isn’t the title page, but it is the introductory material and you can see that it contains Hebrew, Latinate and Greek characters on the same page:

Screen Shot 2015-12-12 at 3.45.54

For starters, this is a bilingual grammar and index to the Old Testament, serving in some ways as a precursor to my digital concordances. But it also is fascinating because it involves several different typefaces representing several different languages, so someone in 1683 had either created a typeface for Hebrew or had access to a Hebrew typeface to print this book. Furthermore, Hebrew has a script form and a block letter form; the block letters are often used in printing whereas script is much more common elsewhere. Torahs are hand-copied onto vellum (even today!), so it is plausible someone may have had to transform each scripted character into block letters for this.

Hebrew is read from right to left, whereas Latin is read from left to right, so this book had to be very carefully typeset to put these two languages back to back. It also has a vowel system which is optional in print, but they are usually found under the consonants. Torahs often do not use the vowel system so the inclusion of them here (look for the lines, dots and small T’s) is interesting and an extra complication for typesetting.

Screen Shot 2015-12-12 at 3.46.25

The catchwords at the bottom of the page are printed in Hebrew here, but the book uses Latinate numbering. And – as my mother pointed out – entries are listed alphabetically in Hebrew (not in Latin).

Screen Shot 2015-12-12 at 3.46.39

It also includes a list of ambiguities, still written in both languages, and still juxtaposed with a left-to-right and right-to-left language.

Screen Shot 2015-12-12 at 3.46.53

So this is already interesting from a printing perspective, but then there are also grammatical notes and commentaries included, with descriptions of how to use this grammar. And still the juxtaposition of both languages on the same line is really fascinating:

Screen Shot 2015-12-12 at 3.47.06 Screen Shot 2015-12-12 at 3.47.21

From the grammatical guide, here  is a table of conjugations in Hebrew, marked with Latin descriptions (active, passive, future, participles, etc): Screen Shot 2015-12-12 at 3.45.25

And finally it ends in a two-column translation of Hebrew text into Latin:

Screen Shot 2015-12-12 at 4.06.43

Download the EEBO scan as a PDF for more.

Ways of Accessing EEBO(TCP)

On October 28, 2015, the Renaissance Society of America sent an email to all members announcing the demise of their previous partnership with ProQuest (now in control of ExLibris too). Their email to all of us, in full:

The RSA Executive Committee regrets to announce that ProQuest has canceled our subscription to the Early English Books Online database (EEBO). The basis for the cancellation is that our members make such heavy use of the subscription, this is reducing ProQuest’s potential revenue from library-based subscriptions. We are the only scholarly society that has a subscription to EEBO, and ProQuest is not willing to add more society-based subscriptions or to continue the RSA subscription. We hoped that our special arrangement, which lasted two years, would open the door to making more such arrangements possible, to serve the needs of students and scholars. But ProQuest has decided for the moment not to include any learned societies as subscribers. Our subscription will end a few days from now, on October 31. We realize this is very late notice, but the RSA staff have been engaged in discussions with ProQuest for some weeks, in the hope of negotiating a renewal. If they change their mind, we will be the first to re-subscribe.

This is truly terrible news, especially for anyone whose institution did not/could not subscribe to the ProQuest interface.

**EDIT 29 Oct 8:05pm**: the RSA confirms that access to EEBO via ProQuest will continue:

We are delighted to convey the following statement from ProQuest:

“We’re sorry for the confusion RSA members have experienced about their ability to access Early English Books Online (EEBO) through RSA. Rest assured that access to EEBO via RSA remains in place. We value the important role scholarly societies play in furthering scholarship and will continue to work with RSA — and others — to ensure access to ProQuest content for members and institutions.”

The RSA subscription to EEBO will not be canceled on October 31, and we look forward to a continued partnership with ProQuest.

Perhaps because the first set of TCP editions of the EEBO texts are now part of the public domain, this is supposed to be sufficient for scholars’ use. Of course, this is not true: the TCP texts are a facsimile of the EEBO images (themselves facsimiles of facsimiles). However inadequate the TCP texts are for someone without an EEBO subscription, I have been collecting a number of links for a number of years about how to access and use EEBO(TCP). Despite overturning this decision, the benefit of having all these resources listed together seems to justify their continued existence here. They are also available on my links page, but in the interest of accessibility, here they are replicated:

1 EEBO(TCP) documentation
Text Creation Partnership
EEBO-TCP documentation
EEBO-TCP Tagging Cheatsheet: Alphabetical list of tags with brief descriptions
Text Creation Partnership Character Entity List
The History of Early English Books Online
Using Early English Books Online

2 Access to EEBO(TCP) full texts (searchable)
Early English Books Online (EEBO): JISC historical books interface (UK, paywall, free access from the British Library Reading Room)
Early English Books Online (EEBO): Chadwyck-Healey interface (outside UK, paywall; your mileage may vary by country)

The Dutch National Library has off-site access, including full EEB (European books), ECCO (18C), TEMPO (pamphlets), for members, 15€/yr. Register online:

EEBO-TCP Texts on Github
UMichigan TCP repository
UMichigan EEBO-TCP full text search*
University of Oxford Text Archive TCP full text search*
* These sites are mirrors of each other
See also 10 things you can do with EEBOTCP

EEBO-TCP Ngram reader, concordancer, & text counts
CQPWeb EEBO-TCP, phase I (and many others)
(Video guide to CQPWeb:
BYU Corpora front end to EEBO-TCP (*not completely full text but will be soon*)

3 Other resources
English Short Title Catalogue (ESTC),
Universal Short Title Catalogue (USTC)
Items in the English Short Title Catalogue (ESTC), via Hathitrust;c=247770968
LUNA, Folger Library Digital Image Collection
Internet Archive Books
The Folger Digital Anthology of Early Modern English Drama

Sarah Werner’s compendium of resources, incl digitised early books
Laura Estill’s Digital Renaissance wiki page covers online book catalogues, digitised facsimiles, early modern playtexts online, print and book history, etc
(see also her very thorough guide to manuscripts online
Claire M. L. Bourne’s Early Modern Plays on Stage & Page resource list:

Large Digital Libraries of Pre-1800 Printed Books in Western Languages
The University of Toronto has a large number of Continental Renaissance text-searchable books online
30+ digitised STC titles at Penn (free to use, from their collection),%20title_sort%20asc

UCSB Broadside Ballads Archive
Broadside Ballads Online

A database of early modern printers & sellers culled from the eMOP source documents
(And their mirror of the ECCO-TCP texts:

Database of Early English Playbooks (DEEP)

How to save and download pdfs from the Chadwyck EEBO Interface

And a crucial read from Laura Mandell and Elizabeth Grumbach on the digital existence of ECCO (Eighteenth Century Collections Online):

this page will update with more resources as they are available. email me with links: heathergfroehlich at gmail dot com // 15 Aug 2016

Suggested Ways of Citing Digitized Early Modern Texts

On 1 January 2015, 25,000 hand-keyed Early Modern texts entered the public domain and were publicly posted on the EEBO-TCP project’s GitHub page, with an additional 28,000 or so forthcoming into the public domain in 2020.  This project is, to say the least, a massive undertaking and marks a massive sea change in scholarly study of the Early Modern period. Moreover, we nearly worked out how to cite the EEBO texts (the images of the books themselves) just before this happened: Sam Kaislaniemi has an excellent blogpost on how one should cite books in the EEBO Interface (May, 2014), but his main point is replicated here:

When it comes to digitized sources, many if not most of us probably instinctively cite the original source, rather than the digitized version. This makes sense – the digital version is a surrogate of the original work that we are really consulting. However, digitizing requires a lot of effort and investment, so when you view a book on EEBO but only cite the original work, you are not giving credit where it is due. After all, consider how easy it now is to access thousands of books held in distant repositories, simply by navigating to a website (although only if your institution has paid for access). This kind of facilitation of research should not be taken for granted.

In other words, when you use digitized sources, you should cite them as digitized sources. I do see lots of discussions about how to best access and distribute (linked) open data, but these discussion tend to avoid the question of citation. In my perfect dream world every digital repository would include a suggested citation in their README files and on their website, but alas we do not live in my perfect dream world.

For reasons which seem to be related to the increasingly widespread use of the CC-BY licences, which allow individuals to use, reuse, and “remix” various collections of texts, citation can be a complicated aspect of digital collections, although it doesn’t have to be. For example, this site has a creative commons license, but we have collectively agreed that blog posts etc are due citation; the MLA and APA offer guidelines on how to cite blog posts (and tweets, for that matter). If you use Zotero, for example, you can easily scrape the necessary metadata for citing this blog post in up to 7,819 styles (at the time of writing). This is great, except when you want to give credit where credit is due for digitized text collections, which are less easy to pull into Zotero or other citation managers. And without including this information somewhere in the corpus or documentation, it’s increasingly difficult to properly cite the various digitized sources we often use. As Sam says so eloquently, it is our duty as scholars to do so.

Corpus repositories such as CoRD include documentation such as compiler, collaborators, associated institutions, wordcounts, text counts, and often include a recommended citation, which I would strongly encourage as a best practice to be widely adopted.

Screen Shot 2015-08-05 at 11.15.04

Here is a working list of best citation practices outlined for several corpora I am using or have encountered. These have been cobbled together from normative citation practices with input from the collection creators. (Nb. collection creators: please contact me with suggestions to improve these citations).

This is a work in progress, and I will be updating it occasionally where appropriate. Citations below follow MLA style, but should be adaptable into the citation model of choice.

Folger Shakespeare Library. Shakespeare’s Plays from Folger Digital Texts. Ed. Barbara Mowat, Paul Werstine, Michael Poston, and Rebecca Niles. Folger Shakespeare Library, dd mm yyyy.

Mueller, M. “Wordhoard Shakespeare”. Northwestern University, 2004- 2013. Available online:

Mueller, M. “Standardized Spelling WordHoard Early Modern Drama corpus, 1514- 1662”. Northwestern University, 2010. Available online:

Mueller, M. “Shakespeare His Contemporaries: a corpus of Early Modern Drama 1550-1650”. Northwestern University, 2015. Available online:

EEBO-TCP access points:
There are several access points to the EEBOTCP texts, and one problem is that the text IDs included don’t always correspond to the same texts in all EEBO viewers as Paul Schnaffer describes below.

Benjamin Armintor has been exploring the implications of this on his blog, but in general if you’re using the full-text TCP files, you should be citing which TCP database you are using to access the full-text files. Where appropriate, I’ve included a sample citation as well.

1. For texts from, follow the below formula:EEBOTCP michgan

Author. Title. place: year, Early English Books Online Text Creation Partnership, Phase 1, Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. date accessed: dd mm yyyy

Webster, John. The tragedy of the Dutchesse of Malfy As it was presented priuatly, at the Black-Friers; and publiquely at the Globe, by the Kings Maiesties Seruants. The perfect and exact coppy, with diuerse things printed, that the length of the play would not beare in the presentment. London: 1623, Early English Books Online Text Creation Partnership, Phase 1, Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. Available online:, accessed 5 August 2015.

2. For the Oxford Text Creation Partnership Repository ( and the searchable database thereOxford TCP search page

Author. Title. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015 [place: year]. Available online at; Source available at

Rowley, William. A Tragedy called All’s Lost By Lust. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015 [London: 1633]. Available online:; Source available at:

3. The entire EEBO-TCP Github repositoryGithub EEBOTCP

Early English Books Online Text Creation Partnership, Phase I. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. Available online:

If you are citing bits of the TCP texts as part of the whole corpus of EEBO-TCP, it makes the most sense to parenthetically cite the TCP ID as its identifying characteristic (following corpus linguistic models). So for example, citing a passage from Dutchess of Malfi above would include a parenthetical including the unique TCPID  (A14872).

(Presumably other Text Creation Partnership collections, such as ECCO and EVANS, should be cited in the same manner.)

A cautionary concordance plot tale

In my previous post I addressed how to produce a view of many concordance plots at once, and presented concordance plots for twelve vocatives which are indicative of social class in Shakespeare and a larger reference corpus of Early Modern Drama.

After double-checking all the concordance plot files using a hand-numbered master sheet, I normalised the files using the command convert plot*.jpg -size 415x47! plot*.jpg (on the off chance that any files weren’t ultimately the same size), created a new folder of the normalised files, and pulled out the examples which matched the numbers I had for Shakespeare’s plays for further analysis. I hadn’t addressed titles, as I wasn’t really aiming to look at individual authors, so each file is named plot1, plot2, plot234, etc. I went on to compile the results for these plays, felt confident about the fact that I had isolated Shakespeare, and wrote up my previous blog post.

This morning I had a nagging thought: What if those weren’t Shakespeare’s plays? After all, I had broken my #1 rule about using computational methods – assuming that everything at every step of the process worked the way I thought it did. I am probably a self-parodying pendant when it comes to computational methods, because when something goes wrong at some stage in the coding process it may *never* be visible or even noticed in the final output, and this gives me reason to seriously distrust automated processes for analysis.

Ultimately, I decided I would double-check the plays I had deemed to be “Shakespeare”’s. Even though I hadn’t done much automated processing with the image files, I had assumed that the normalisation process would only change the file names to represent a modified version: so that plot10 would become plot0-10, plot 11 would become plot0-11, plot234 would become plot0-234. I had assumed the information in these files wouldn’t change, and the names would correspond to the original files.

This was not true. Instead, I had isolated a very nice sample of 36 plays which I thought matched Shakespeare’s plays in numbering, but turned out to be sampled from throughout the corpus. Matching the sampled “Shakespeare” concordance plots to the master document of concordance plots, I found that I had at least one Middleton play and at least one Seneca play in addition to some (but not all) Shakespeare plays.[1] At this point I was worried, so I re-created Shakespeare’s concordance plots from the master document of concordance plots. By redoing the concordance plots, I could guarantee that these were at least all Shakespeare’s plays in the first instance. Then I normalised them again for size, and went back to see what happens in that process. The first files were a perfect match, as I had hoped. But once I moved to the second concordance plot, I was in trouble.

Below is an image showing the unmodified concordance plot for The Taming of the Shrew (shx2), outlined in red and on the top left-hand side.The other eight concordance plots in this image are normalised for size, and even without great detail you can tell that none of these match the original file. You don’t even need to see the whole image to see this:

Screen Shot 2015-02-26 at 3.26.03

In other words, as I had suspected, the names of the normalised files didn’t correspond to the original file names, though they were all there.[2] More worryingly, I hadn’t caught it because I had assumed that the files were fine after running a process on them. The files produced results, and if I hadn’t double-checked (really, at this point, triple-checked), I wouldn’t have caught this discrepancy.

So what do concordance plots for Shakespeare’s plays look like in composite for the vocatives attached to a name in a bigram (reminder of search terms: lord [A-Z]|sir [A-Z]|master [A-Z]|duke [A-Z]|earl [A-Z]|king [A-Z]|signior [A-Z]|lady [A-Z]|mistress [A-Z]|madam [A-Z]|queen [A-Z]|dame [A-Z]) look like? Well, surprisingly, not so different from the sample curated previously, which may be less indicative of a specific authorial style:


Remember, we read these from left to right; now there’s a lot of use of vocatives in the very beginning of the plays, which stay quite strong near the rising action until there’s a relative absence just before and around the climax and the start of the falling action. Curiously, the heavy double hit || towards the end is still very visible, as well as a few more dark lines leading up to the conclusion. In some ways, the absence of these vocatives is almost more consistent, and therefore the white bits are more visible.

In the meantime I’m having a fascinating discussion with Lauren Ackerman about how to best address pixel density and depth of detail (especially in the larger EM play corpus), so maybe there will be a third instalment of concordance plots in the future.

[1] Seneca’s plays were published in the 1550s and 1560s, which is why they are included in this data set of printed plays in Early Modern London.

[2] The benefits of working with a smaller set like this means that there are are much smaller, finite number of texts to address: rather than n = 332332 possible combinations, I was now only looking at a possibility of n = 3636. So that was an improvement. In case you’re wondering what happened to one play, because previously I had claimed there were 37 Shakespeare plays, one play doesn’t have any instances of the vocatives being addressed in a bigram with a capital letter.

How to address many concordance plots at once

What if you could take many concordance plots and layer them to get a composite view of many concordance plots in one image? I wanted to see if vocatives which mark for high-status individuals attached to a name appear in any particular pattern which resembles Freytag’s model of dramatic structure.[1]

I selected 12 vocatives which clearly illustrate social class attached to a word beginning with a capital letter for analysis, all of which are relatively frequent in the corpus of 332 plays comprising of 7,305,366 words. In order to get my concordance plots for vocatives attached to a name, I used regular expressions searching for the vocative in question in a bigram with a capital letter strung together by pipelines, so the resulting search looked like this (signior is spelled incorrectly; this is the spelling which produced hits – I suspect something happened in the spelling normalisation stage):
lord [A-Z]|sir [A-Z]|master [A-Z]|duke [A-Z]|earl [A-Z]|king [A-Z]|signior [A-Z]|lady [A-Z]|mistress [A-Z]|madam [A-Z]|queen [A-Z]|dame [A-Z]

Although the regular expression I used picked up examples of queen I and the like, the examples of a capital letter representing the start of a name was far more frequent overall. In the case of mistress, Alison Findlay’s definition (“usually a first name or surname, is a form of polite address to a married woman, or an unmarried woman or girl” (2010, 271) ) accounts for its inclusion here. Though there are certainly complicated readings of this title, I consider instances of mistress to be at the very least a vocative relating to social class in Early Modern England.

The obvious solution to doing this kind of work is R, as people such as Douglas Duhaime and Ted Underwood have been making some gorgeous composite graphs with R for a number of years. To be honest, I didn’t really want to go through the process of addressing a corpus by writing an entire script to produce something that I know can be done quickly and easily in AntConc‘s concordance plot view: I had one specific need; AntConc is an existing framework for producing concordance plots which are normalised for length, as well as a KWIC viewer and several other statistical analyses. I knew that if i wanted to check anything, I could do it easily. I didn’t feel any real need to reinvent the wheel by scripting to accomplish my task, unlike the general DIY process presented by R or Python.[2] The only real downside is that if you want to do more with the output, you have to move into another software package to do that, but even that is not the end of the world.

Ultimately, what I wanted to do was take concordance plots for 332 plays and layer them for a composite picture of how they appear, rather than address them as individual views on a play-by-play basis. Layering images is a common way of addressing edits in printed books; Chris Forster has done exactly that with magazine page size; he suggested I use ImageMagick, a command line processing tool for image compositioning.[3] I have a similarly normalised view of texts at my disposal, as each concordance line is normalized for length. Moreover, Chris and I are of the same mind when it comes to not introducing more complicated software for the sake of using software, so when he told me about this I was willing to give it a try, especially as he has successfully done exactly what I was trying to do. But first I needed concordance plots.

AntConc produces concordance plots but won’t export them, which is annoying but not as annoying as you may think. 38 screen grabs later, I had .pngs of each play’s concordance line. Here they are in AntConc:

womp womp(If you’re not used to reading concordance lines, you read them from left to right (from “start” to “finish”, in narrative terms); each | = 1 hit; the more hits closer together, the darker the line will look.)

I turned these screenshots into a very large jpg with the help of an open source image editing program, just to have them all in one document together. The most well-known is probably GIMP but both lifehacker and Oliver Mason offer Seashore as a more mac-friendly alternative to the GIMP.[4]
Then I broke the master document into individual concordance plots, sized 415×47, using Seashore’s really good select-copy-make new document from pasteboard option, which let you keep and move the select box around the master document, as seen below. Screen Shot 2015-02-23 at 12.15.10So far I have only used regular expressions, command-shift-4, copy, paste, save as .jpg, and pen & paper to record what I was doing. Nothing complicated! It took a while, but in the process I got to know these results really well. Not all the plays in the corpus contain all, or in fact, any, of each vocative: in some instances, there are plays that didn’t use any of the above titles, and aren’t included in this output; some plays only use one vocative out of the twelve investigated or any combination of vocatives which do not represent the full twelve.

As a test, I separated out Shakespeare’s plays to see what a bunch of concordance plots looked like in composite. To do this, I opened a terminal, moved to the correct directory, which comprised moving through 6 directories. Then I normalised everything to the same size with
convert plot*.jpg -size 415x47! plot*.jpg, just in case.
I put those in a new folder of normalised images.
Then, from the directory of normalised images: convert plot*.jpg -evaluate-sequence mean average_page.jpg.

Here’s what 12 vocatives for social class in 37 Shakespeare plays look like in composite:
average shx play_monochrome
There are a few things that I notice in this plot: There’s a quick use of naming vocatives near the beginning of the plays, a relative absense immediately after, but during the rising action and climax there are clear sections which use these vocative quite heavily- especially in the build-up to the climax. Usage drops in the falling action, until just before the denoument; there is a point where vocatives are used quite consistently heavily, marked by || but surrounded by white on both sides. If you can’t see it, here is the concordance plot again, with that point highlighted in red.

If you repeat the above process for the 332 plays, you get the following composite image. Although the amount of information in some ways obfuscates what you’re trying to see, there are darker and lighter bits to this image.
avg EM drama play_monochrome
Most notably, the rising action has a similar cluster of class-status vocative use at the tail end of the introduction and into the rising action, a relative absence until the climax, and then the use of vocatives for social class seem to pick up towards the falling action and end of the plays. Interestingly, the same kind of || notation is visibible towards the conclusion, though it reduplicates itself twice. (Again, if you can’t see it, I’ve highlighted it in red here).

Now to address the details of these plots… But you should also read the follow-up post about this, as well.

tl;dr version:

Can you look at many concordance plots at the same time? Yes.

Do vocatives attached to a name which mark for class status have recognizable patterns in dramatic structure? MAYBE.

[1] Matthew Jockers (1, 2) and Benjamin Schmidt have been doing interesting things with regards to computationally analyzing dramatic structure. I’m not going anywhere near their levels of engagement with dramatic arcs in this post, but they are interesting reads nonetheless. (Followup: Annie Swafford’s blog post on Jockers’ analyses are worth a read as well)

[2] If you particularly enjoy using R to achieve relatively simple tasks like concordance plots, Stefan Gries’ 2009 cookbook Quantitative corpus linguistics with R: a practical introduction and Matthew Jockers’ 2014 cookbook Text Analysis with R for Students of Literature both outline how to do this.

[3] Okay, so this required a few more steps of code, most of which were install scripts which require very little work on the human end beyond following directions of ‘type this, wait for computer to return the input command’. If you are on a mac, you will need to get Xcode to download macports to download ImageMagick, and then X11 to display output. X11 seems optional, especially if you keep your finder window open nearby. Setting all this up took about two hours.

[4] It transpired that I could have done this with ImageMagick using the command
convert -append plots*.png out.png.
Oh well. Seashore also offers layering capabilities for the more graphic design driven amongst you but perhaps more importantly for me, it looks a lot like my dearly beloved MS Paint, a piece of software I’ve been trying to find a suitable replacement for since I joined The Cult of Mac in 2006.