EEBO-TCP

What’s a “book” in Early English Books Online?

Recently I have been employed by the Visualising English Print project, where one of the things we are doing is looking at improving the machine-readability of the TCP texts. My colleague Deidre has already released a plain-text VARDed version of the TCP corpus, but it is our hope to further improve the machine-readability of these texts.

One of the issues that came up in modernising and using the TCP texts has to do with non-English print. It has been previously documented that there are several non-English languages in EEBO – including Latin, Greek, Dutch, French, Welsh, German, Hebrew, Turkish and Algonquin. Our primary issue is if there is a transcription that is not in English in the corpus, it will be very difficult for an English-language text parser or word vector model to account for this material.

So our solution has been to isolate the texts which are printed in a non-English language, either monolingually (e.g. a book in Latin) or a bi- or tri-lingual text (e.g. Dutch/English book, with a Latin foreword). Looking at EEBO-the-books is a helpful way to identify languages in print, as there are all sorts of printed cues to suggest linguistic variation, such as different fonts or italics to set a different language off from the primary language. It also means I get a chance to look at many of these non-English texts as they were printed and transcribed initially.

Three years ago, I wrote a blog post about some Welsh language material that I found in EEBOTCP Phase I. In the intervening time I still have not learned Welsh (though I am endlessly fascinated by it), still get lots of questions and clicks to this site related to Early Modern Welsh (hello Early Modern Welsh fans), and I have since learned quite a lot more about how texts were chosen to be included in EEBO (it involves the STC; Kichuk 2007 is an excellent read on this topic to the previously uninitiated). So while that previous post asked “What makes a text in EEBO an English text”, this post will ask “what makes a text in EEBO a book?”

In general, I think we can agree that in order to be considered a book or booklet or pamphlet, a printed object has to have several pages. These pages can either created through folding one broadside sheet, or it will have collection of these (called gatherings). It may or may not have a cover, but it would be one of several sizes (quarto, folio, duodecimo, etc). To this end, Sarah Werner has an excellent exercise on how to fold a broadside paper into a gathering which builds the basis for many, but probably not all, early books. Here is an example of a broadside that has clearly been folded up; it’s been unfolded for digitization.

folded broadsheeet A17754           TCPID A17754

So it has been folded in a style that suggests it could be read like a book, but it is not necessarily a book in the sense that there is a distinct sense of each individual page and that some of the verso/recto pages would be rendered unreadable unless they had been cut, etc.

In order to be available for digitization from the original EEBO microfilms, a text needed to be included in a short title catalogue. The British Library English Short Title Catalogue describes itself as

a comprehensive, international union catalogue listing early books, serials, newspapers and selected ephemera printed before 1801. It contains catalogue entries for items issued in Britain, Ireland, overseas territories under British colonial rule, and the United States. Also included is material printed elsewhere which contains significant text in English, Welsh, Irish or Gaelic, as well as any book falsely claiming to have been printed in Britain or its territories.

I select the British Library ESTC here because it covers several short title catalogues (Wing and Pollard & Redgrave are both included) and it’s my go-to short title catalogue database. Including “ephemera” is important, because it allows any number objects to be considered as items of early print, even if they’re not really ‘books’ per se.

Such as this newspaper (TCPID A85603)…

newspaper A85603

Or this this effigy, in Latin, printed on 1 broadside (TCP id A01919); click to see full-sizeeffigy A01919

Or this proclamation, also printed on 1 broadside (TCPID A09573)

proclamation A09573

Or this sheet of paper, listing locations in Wales (Wales! Again!) (TCPID A14651); click to see full-size

Screen Shot 2016-07-12 at 9.00.48 pm

 

Or this acrostic (TCPID A96959); click to see full-size

acrostic

Interestingly, these are all listed as “one page” in the Jisc Historical books metadata, though they are perhaps more accurately “one sheet”. While there’s no definitive definition of “English” in Early English Books Online, it’s becoming increasingly clear to me that there’s no definitive definition of “book” either. And thank god for that, because EEBO is the gift that keeps giving when it comes to Early Modern printed materials.

Advertisements

Ways of Accessing EEBO(TCP)

On October 28, 2015, the Renaissance Society of America sent an email to all members announcing the demise of their previous partnership with ProQuest (now in control of ExLibris too). Their email to all of us, in full:

The RSA Executive Committee regrets to announce that ProQuest has canceled our subscription to the Early English Books Online database (EEBO). The basis for the cancellation is that our members make such heavy use of the subscription, this is reducing ProQuest’s potential revenue from library-based subscriptions. We are the only scholarly society that has a subscription to EEBO, and ProQuest is not willing to add more society-based subscriptions or to continue the RSA subscription. We hoped that our special arrangement, which lasted two years, would open the door to making more such arrangements possible, to serve the needs of students and scholars. But ProQuest has decided for the moment not to include any learned societies as subscribers. Our subscription will end a few days from now, on October 31. We realize this is very late notice, but the RSA staff have been engaged in discussions with ProQuest for some weeks, in the hope of negotiating a renewal. If they change their mind, we will be the first to re-subscribe.

This is truly terrible news, especially for anyone whose institution did not/could not subscribe to the ProQuest interface.

**EDIT 29 Oct 8:05pm**: the RSA confirms that access to EEBO via ProQuest will continue:

We are delighted to convey the following statement from ProQuest:

“We’re sorry for the confusion RSA members have experienced about their ability to access Early English Books Online (EEBO) through RSA. Rest assured that access to EEBO via RSA remains in place. We value the important role scholarly societies play in furthering scholarship and will continue to work with RSA — and others — to ensure access to ProQuest content for members and institutions.”

The RSA subscription to EEBO will not be canceled on October 31, and we look forward to a continued partnership with ProQuest.

Perhaps because the first set of TCP editions of the EEBO texts are now part of the public domain, this is supposed to be sufficient for scholars’ use. Of course, this is not true: the TCP texts are a facsimile of the EEBO images (themselves facsimiles of facsimiles). However inadequate the TCP texts are for someone without an EEBO subscription, I have been collecting a number of links for a number of years about how to access and use EEBO(TCP). Despite overturning this decision, the benefit of having all these resources listed together seems to justify their continued existence here. They are also available on my links page, but in the interest of accessibility, here they are replicated:

1 EEBO(TCP) documentation
Text Creation Partnership http://www.textcreationpartnership.org/
EEBO-TCP documentation http://www.textcreationpartnership.org/docs/
EEBO-TCP Tagging Cheatsheet: Alphabetical list of tags with brief descriptions http://www.textcreationpartnership.org/docs/dox/cheat.html
Text Creation Partnership Character Entity List http://www.textcreationpartnership.org/docs/code/charmap.htm
The History of Early English Books Online http://folgerpedia.folger.edu/History_of_Early_English_Books_Online
Using Early English Books Online http://folgerpedia.folger.edu/Using_Early_English_Books_Online

2 Access to EEBO(TCP) full texts (searchable)
Early English Books Online (EEBO): JISC historical books interface (UK, paywall, free access from the British Library Reading Room) http://historicaltexts.jisc.ac.uk/
Early English Books Online (EEBO): Chadwyck-Healey interface (outside UK, paywall; your mileage may vary by country) http://eebo.chadwyck.com/home

The Dutch National Library has off-site access, including full EEB (European books), ECCO (18C), TEMPO (pamphlets), for members, 15€/yr. Register online: inschrijven.kb.nl/index.php

EEBO-TCP Texts on Github https://github.com/textcreationpartnership/Texts
UMichigan TCP repository https://umich.app.box.com/s/nfdp6hz228qtbl2hwhhb/
UMichigan EEBO-TCP full text search http://quod.lib.umich.edu/e/eebogroup/*
University of Oxford Text Archive TCP full text search http://ota.ox.ac.uk/tcp/*
* These sites are mirrors of each other
See also 10 things you can do with EEBOTCP

EEBO-TCP Ngram reader, concordancer, & text counts  http://earlyprint.wustl.edu/
CQPWeb EEBO-TCP, phase I (and many others) https://cqpweb.lancs.ac.uk/
(Video guide to CQPWeb: https://www.youtube.com/watch?v=Yf1KxLOI8z8&list=PL2XtJIhhrHNTxjyZ5VSKUr0-4EuzJJDbe)
BYU Corpora front end to EEBO-TCP (*not completely full text but will be soon*) http://corpus.byu.edu/eebo

3 Other resources
English Short Title Catalogue (ESTC), http://estc.bl.uk/
Universal Short Title Catalogue (USTC) http://www.ustc.ac.uk/
Items in the English Short Title Catalogue (ESTC), via Hathitrust http://babel.hathitrust.org/cgi/mb?a=listis;c=247770968
LUNA, Folger Library Digital Image Collection http://luna.folger.edu/
Internet Archive Books https://archive.org/details/early-european-books
The Folger Digital Anthology of Early Modern English Drama http://digitalanthology.folger.edu/

Sarah Werner’s compendium of resources, incl digitised early books http://sarahwerner.net/blog/a-compendium-of-resources/
Laura Estill’s Digital Renaissance wiki page covers online book catalogues, digitised facsimiles, early modern playtexts online, print and book history, etc http://digitalrenaissance.pbworks.com/w/page/54277828/EarlyModernDigitalResources
(see also her very thorough guide to manuscripts online http://manuscriptresearch.pbworks.com/w/page/48026041/FrontPage)
Claire M. L. Bourne’s Early Modern Plays on Stage & Page resource list: http://www.ofpilcrows.com/resources-early-modern-plays-page-and-stage

Large Digital Libraries of Pre-1800 Printed Books in Western Languages http://archiv.twoday.net/stories/6107864/
The University of Toronto has a large number of Continental Renaissance text-searchable books online http://link.library.utoronto.ca/booksonline/
30+ digitised STC titles at Penn (free to use, from their collection) http://franklin.library.upenn.edu/search.html?filter.library_facet.val=Rare%20Book%20and%20Manuscript%20Library&q=STC%20collection%20%22sceti%22&sort=publication_date_sort%20asc,%20title_sort%20asc

UCSB Broadside Ballads Archive http://ebba.english.ucsb.edu/
Broadside Ballads Online http://ballads.bodleian.ox.ac.uk/

A database of early modern printers & sellers culled from the eMOP source documents https://github.com/Early-Modern-OCR/ImprintDB
(And their mirror of the ECCO-TCP texts: https://github.com/Early-Modern-OCR/TCP-ECCO-texts)

Database of Early English Playbooks (DEEP) http://deep.sas.upenn.edu/

How to save and download pdfs from the Chadwyck EEBO Interface https://www.youtube.com/watch?v=6u2B_MagrPc

And a crucial read from Laura Mandell and Elizabeth Grumbach on the digital existence of ECCO (Eighteenth Century Collections Online): http://src-online.ca/src/index.php/src/article/view/226/448

this page will update with more resources as they are available. email me with links: heathergfroehlich at gmail dot com // 15 Aug 2016

Suggested Ways of Citing Digitized Early Modern Texts

On 1 January 2015, 25,000 hand-keyed Early Modern texts entered the public domain and were publicly posted on the EEBO-TCP project’s GitHub page, with an additional 28,000 or so forthcoming into the public domain in 2020.  This project is, to say the least, a massive undertaking and marks a massive sea change in scholarly study of the Early Modern period. Moreover, we nearly worked out how to cite the EEBO texts (the images of the books themselves) just before this happened: Sam Kaislaniemi has an excellent blogpost on how one should cite books in the EEBO Interface (May, 2014), but his main point is replicated here:

When it comes to digitized sources, many if not most of us probably instinctively cite the original source, rather than the digitized version. This makes sense – the digital version is a surrogate of the original work that we are really consulting. However, digitizing requires a lot of effort and investment, so when you view a book on EEBO but only cite the original work, you are not giving credit where it is due. After all, consider how easy it now is to access thousands of books held in distant repositories, simply by navigating to a website (although only if your institution has paid for access). This kind of facilitation of research should not be taken for granted.

In other words, when you use digitized sources, you should cite them as digitized sources. I do see lots of discussions about how to best access and distribute (linked) open data, but these discussion tend to avoid the question of citation. In my perfect dream world every digital repository would include a suggested citation in their README files and on their website, but alas we do not live in my perfect dream world.

For reasons which seem to be related to the increasingly widespread use of the CC-BY licences, which allow individuals to use, reuse, and “remix” various collections of texts, citation can be a complicated aspect of digital collections, although it doesn’t have to be. For example, this site has a creative commons license, but we have collectively agreed that blog posts etc are due citation; the MLA and APA offer guidelines on how to cite blog posts (and tweets, for that matter). If you use Zotero, for example, you can easily scrape the necessary metadata for citing this blog post in up to 7,819 styles (at the time of writing). This is great, except when you want to give credit where credit is due for digitized text collections, which are less easy to pull into Zotero or other citation managers. And without including this information somewhere in the corpus or documentation, it’s increasingly difficult to properly cite the various digitized sources we often use. As Sam says so eloquently, it is our duty as scholars to do so.

Corpus repositories such as CoRD include documentation such as compiler, collaborators, associated institutions, wordcounts, text counts, and often include a recommended citation, which I would strongly encourage as a best practice to be widely adopted.

Screen Shot 2015-08-05 at 11.15.04

Here is a working list of best citation practices outlined for several corpora I am using or have encountered. These have been cobbled together from normative citation practices with input from the collection creators. (Nb. collection creators: please contact me with suggestions to improve these citations).

This is a work in progress, and I will be updating it occasionally where appropriate. Citations below follow MLA style, but should be adaptable into the citation model of choice.

Non-EEBOTCP
Folger Shakespeare Library. Shakespeare’s Plays from Folger Digital Texts. Ed. Barbara Mowat, Paul Werstine, Michael Poston, and Rebecca Niles. Folger Shakespeare Library, dd mm yyyy. http://folgerdigitaltexts.org/

Mueller, M. “Wordhoard Shakespeare”. Northwestern University, 2004- 2013. Available online: http://wordhoard.northwestern.edu/userman/index.html

Mueller, M. “Standardized Spelling WordHoard Early Modern Drama corpus, 1514- 1662”. Northwestern University, 2010. Available online: http://wordhoard.northwestern.edu.

Mueller, M. “Shakespeare His Contemporaries: a corpus of Early Modern Drama 1550-1650”. Northwestern University, 2015. Available online:  https://github.com/martinmueller39/SHC/

EEBO-TCP access points:
There are several access points to the EEBOTCP texts, and one problem is that the text IDs included don’t always correspond to the same texts in all EEBO viewers as Paul Schnaffer describes below.

Benjamin Armintor has been exploring the implications of this on his blog, but in general if you’re using the full-text TCP files, you should be citing which TCP database you are using to access the full-text files. Where appropriate, I’ve included a sample citation as well.

1. For texts from http://quod.lib.umich.edu/e/eebogroup/, follow the below formula:EEBOTCP michgan

Author. Title. place: year, Early English Books Online Text Creation Partnership, Phase 1, Oxford, Oxfordshire and Ann Arbor, Michigan, 2015.  quod.umich.edu/permalink date accessed: dd mm yyyy

Webster, John. The tragedy of the Dutchesse of Malfy As it was presented priuatly, at the Black-Friers; and publiquely at the Globe, by the Kings Maiesties Seruants. The perfect and exact coppy, with diuerse things printed, that the length of the play would not beare in the presentment. London: 1623, Early English Books Online Text Creation Partnership, Phase 1, Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. Available online:  http://name.umdl.umich.edu/A14872.0001.001, accessed 5 August 2015.

2. For the Oxford Text Creation Partnership Repository (http://ota.ox.ac.uk/tcp/) and the searchable database thereOxford TCP search page

Author. Title. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015 [place: year]. Available online at http://ota.ox.ac.uk/tcp/IDNUMBER; Source available at https://github.com/TextCreationPartnership/IDNUMBER/.

Rowley, William. A Tragedy called All’s Lost By Lust. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015 [London: 1633]. Available online: http://tei.it.ox.ac.uk/tcp/Texts-HTML/free/A11/A11155.htm; Source available at: https://github.com/TextCreationPartnership/A11155/

3. The entire EEBO-TCP Github repositoryGithub EEBOTCP

Early English Books Online Text Creation Partnership, Phase I. Early English Books Online Text Creation Partnership, phase I: Oxford, Oxfordshire and Ann Arbor, Michigan, 2015. Available online: https://github.com/textcreationpartnership/Texts

If you are citing bits of the TCP texts as part of the whole corpus of EEBO-TCP, it makes the most sense to parenthetically cite the TCP ID as its identifying characteristic (following corpus linguistic models). So for example, citing a passage from Dutchess of Malfi above would include a parenthetical including the unique TCPID  (A14872).

(Presumably other Text Creation Partnership collections, such as ECCO and EVANS, should be cited in the same manner.)