On 1 January 2015, 25,000 hand-keyed Early Modern texts entered the public domain and were publicly posted on the EEBO-TCP project’s GitHub page, with an additional 28,000 or so forthcoming into the public domain in 2020. This project is, to say the least, a massive undertaking and marks a massive sea change in scholarly study of the Early Modern period. Moreover, we nearly worked out how to cite the EEBO texts (the images of the books themselves) just before this happened: Sam Kaislaniemi has an excellent blogpost on how one should cite books in the EEBO Interface (May, 2014), but his main point is replicated here:
When it comes to digitized sources, many if not most of us probably instinctively cite the original source, rather than the digitized version. This makes sense – the digital version is a surrogate of the original work that we are really consulting. However, digitizing requires a lot of effort and investment, so when you view a book on EEBO but only cite the original work, you are not giving credit where it is due. After all, consider how easy it now is to access thousands of books held in distant repositories, simply by navigating to a website (although only if your institution has paid for access). This kind of facilitation of research should not be taken for granted.
In other words, when you use digitized sources, you should cite them as digitized sources. I do see lots of discussions about how to best access and distribute (linked) open data, but these discussion tend to avoid the question of citation. In my perfect dream world every digital repository would include a suggested citation in their README files and on their website, but alas we do not live in my perfect dream world.
For reasons which seem to be related to the increasingly widespread use of the CC-BY licences, which allow individuals to use, reuse, and “remix” various collections of texts, citation can be a complicated aspect of digital collections, although it doesn’t have to be. For example, this site has a creative commons license, but we have collectively agreed that blog posts etc are due citation; the MLA and APA offer guidelines on how to cite blog posts (and tweets, for that matter). If you use Zotero, for example, you can easily scrape the necessary metadata for citing this blog post in up to 7,819 styles (at the time of writing). This is great, except when you want to give credit where credit is due for digitized text collections, which are less easy to pull into Zotero or other citation managers. And without including this information somewhere in the corpus or documentation, it’s increasingly difficult to properly cite the various digitized sources we often use. As Sam says so eloquently, it is our duty as scholars to do so.
Corpus repositories such as CoRD include documentation such as compiler, collaborators, associated institutions, wordcounts, text counts, and often include a recommended citation, which I would strongly encourage as a best practice to be widely adopted.
Here is a working list of best citation practices outlined for several corpora I am using or have encountered. These have been cobbled together from normative citation practices with input from the collection creators. (Nb. collection creators: please contact me with suggestions to improve these citations).
This is a work in progress, and I will be updating it occasionally where appropriate. Citations below follow MLA style, but should be adaptable into the citation model of choice.
Folger Shakespeare Library. Shakespeare’s Plays from Folger Digital Texts. Ed. Barbara Mowat, Paul Werstine, Michael Poston, and Rebecca Niles. Folger Shakespeare Library, dd mm yyyy. http://folgerdigitaltexts.org/
Mueller, M. “Wordhoard Shakespeare”. Northwestern University, 2004- 2013. Available online: http://wordhoard.northwestern.edu/userman/index.html
Mueller, M. “Standardized Spelling WordHoard Early Modern Drama corpus, 1514- 1662”. Northwestern University, 2010. Available online: http://wordhoard.northwestern.edu.
Mueller, M. “Shakespeare His Contemporaries: a corpus of Early Modern Drama 1550-1650”. Northwestern University, 2015. Available online: https://github.com/martinmueller39/SHC/
EEBO-TCP access points:
There are several access points to the EEBOTCP texts, and one problem is that the text IDs included don’t always correspond to the same texts in all EEBO viewers as Paul Schnaffer describes below.
@heatherfro @Rwelzenb @OxfordEEBOTCP Problem of common ids between TCP instances. TCP ID eg A12345 works everywhere except PQ site #EEBOTCP
— Paul Schaffner (@pfs) August 5, 2015
Benjamin Armintor has been exploring the implications of this on his blog, but in general if you’re using the full-text TCP files, you should be citing which TCP database you are using to access the full-text files. Where appropriate, I’ve included a sample citation as well.
1. For texts from http://quod.lib.umich.edu/e/eebogroup/, follow the below formula:
Author. Title. place: year, Early English Books Online Text Creation Partnership, Phase 1, Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. quod.umich.edu/permalink date accessed: dd mm yyyy
Webster, John. The tragedy of the Dutchesse of Malfy As it was presented priuatly, at the Black-Friers; and publiquely at the Globe, by the Kings Maiesties Seruants. The perfect and exact coppy, with diuerse things printed, that the length of the play would not beare in the presentment. London: 1623, Early English Books Online Text Creation Partnership, Phase 1, Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. Available online: http://name.umdl.umich.edu/A14872.0001.001, accessed 5 August 2015.
2. For the Oxford Text Creation Partnership Repository (http://ota.ox.ac.uk/tcp/) and the searchable database there
Author. Title. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015 [place: year]. Available online at http://ota.ox.ac.uk/tcp/IDNUMBER; Source available at https://github.com/TextCreationPartnership/IDNUMBER/.
Rowley, William. A Tragedy called All’s Lost By Lust. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015 [London: 1633]. Available online: http://tei.it.ox.ac.uk/tcp/Texts-HTML/free/A11/A11155.htm; Source available at: https://github.com/TextCreationPartnership/A11155/
3. The entire EEBO-TCP Github repository
Early English Books Online Text Creation Partnership, Phase I. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. Available online: https://github.com/textcreationpartnership/Texts
If you are citing bits of the TCP texts as part of the whole corpus of EEBO-TCP, it makes the most sense to parenthetically cite the TCP ID as its identifying characteristic (following corpus linguistic models). So for example, citing a passage from Dutchess of Malfi above would include a parenthetical including the unique TCPID (A14872).
(Presumably other Text Creation Partnership collections, such as ECCO and EVANS, should be cited in the same manner.)
Not exactly a matter of citation form, except insofar as it
relates to identifying what is being cited, and associating
it with other copies of the same, I would point to the
importance of identifiers.
— The most important one to EEBO-TCP is the TCP ID, which
consists of five digits prefixed with an A- or B-. This
number is the key to unambiguously identifying a TCP
text, whether it is obtained from our Box.com site, or
from github, or from the OTA, or from the Michigan TCP
site, or from the Oxford (ODL) mirror, or even from the
Chicago Philologic site. I can’t speak to the Northwestern
Philologic site (http://philologic.northwestern.edu/philologic/)
since it sends me away with ‘access restricted’
or to the JISC Historic Books site, which is restricted to
users in the UK.
For example, your Duchess of Malfi example has TCP ID A14872,
which means that it may be found on the ODL site at
and can be searched under that number as a bibliographic identifier
on the Chicago Philogic site here:
or filtered for on the Oxford Text Archive site
by that number here:
or downloaded in TEI P5 form from gitHub using that number:
or found on the Michigan TCP site using that number
or located among the bulk downloads from our box.com
site using that number
— A second important number is that which the ProQuest EEBO
site calls the “ID” and we refer to as the citation ID.
It identifies the ProQuest bibliographic record for the work
(which may be linked to more than one physical copy or
image set). It is sometimes but not always the same as the
OCLC accession number. The citation ID for this book is 99854798
(a non-OCLC example), so the ProQuest URL for the book is
I suspect that the JISC portal uses this number as its
primary identifier, thus
but I can’t test that.
— A third important number is the identifier for the
image set, which ProQuest calls the “VID”. The VID
for this book is 20242. Which means that the images
can be found on the ProQuest site at:
or more briefly, at
We assigned different TCP IDs basically for *every unique
combination of citation ID and VID*. I.e., if a given work
(citation ID) is linked to two or more copies (VIDs), each copy
(each ID-VID combination) is given a separate TCP ID; conversely,
if a single image set contains images from more than work, again,
each portion of the image set (each ID-VID combo) is given its
own TCP ID. The tracking file eebodat?.sgm that we distribute with the
files on box.com contains a list of all of these IDs and
how they map to one another.
— There are of course also ‘reference’ IDs that point to
STC, Wing, Evans, and ESTC. These are important too,
and I would include them in any citation I made. The
STC/Wing numbers are the more reliable ones in our data;
ESTC numbers rather less so. STC numbers can be searched
for on most online instances, but never form the basis
of a URL.
Paul, this is really helpful, thank you. As I’m in the UK I can confirm that https://historicaltexts.jisc.ac.uk/eebo-99854798e does lead to the same Dutchess of Malfy text I discuss above (TCP ID A14872). And thank you for sending me the NU Philologic page, I hadn’t known about that before.
Would it be useful, then, to suggest including the relevant TCP ID in the EEBO-TCP citations? Or would you prefer keeping an STC/Wing/Evans/ESTC number as a citation method, perhaps alongside a TCP ID? (if so: it sounds like you would prefer an STC or Wing number)
I am more than happy to follow your lead here, and will happily amend this post to reflect this: as I say above, this is intended to be a work in progress. And if you’d prefer to move to a less public space to discuss, you can find my email address in the footer of this website.
Paul, can you elaborate what you mean by “citation id” above (“We assigned different TCP IDs basically for *every unique
combination of citation ID and VID*”)?
Heather- Great post. I suspect that over time there is going to be a versioning consideration (especially with regard to the Github references, which will taggable as they change), and ultimately a need to be able to cite by location in long documents. I know Hugh Cayless is looking at something to do with that this Fall along the lines of xlink; we should pester him.
As is James Cummings – apparently the underlying XML files can be cited by speech if not by line, though the XSLT’d files (xml -> html files on the oxford page listed above) don’t necessarily reflect that at the very moment. We have talked about the potential for doing this with the html files, but ultimately, someone needs to normalise it across all entrypoints.
Citable links as a standard for long documents bringing you to the exact moment under discussion as a norm would be a dream come true – the Folger Digital Texts are doing a great job of this; i can only hope that others will follow this model.
Comments are closed.