Recently I have been employed by the Visualising English Print project, where one of the things we are doing is looking at improving the machine-readability of the TCP texts. My colleague Deidre has already released a plain-text VARDed version of the TCP corpus, but it is our hope to further improve the machine-readability of these texts.
One of the issues that came up in modernising and using the TCP texts has to do with non-English print. It has been previously documented that there are several non-English languages in EEBO – including Latin, Greek, Dutch, French, Welsh, German, Hebrew, Turkish and Algonquin. Our primary issue is if there is a transcription that is not in English in the corpus, it will be very difficult for an English-language text parser or word vector model to account for this material.
So our solution has been to isolate the texts which are printed in a non-English language, either monolingually (e.g. a book in Latin) or a bi- or tri-lingual text (e.g. Dutch/English book, with a Latin foreword). Looking at EEBO-the-books is a helpful way to identify languages in print, as there are all sorts of printed cues to suggest linguistic variation, such as different fonts or italics to set a different language off from the primary language. It also means I get a chance to look at many of these non-English texts as they were printed and transcribed initially.
Three years ago, I wrote a blog post about some Welsh language material that I found in EEBOTCP Phase I. In the intervening time I still have not learned Welsh (though I am endlessly fascinated by it), still get lots of questions and clicks to this site related to Early Modern Welsh (hello Early Modern Welsh fans), and I have since learned quite a lot more about how texts were chosen to be included in EEBO (it involves the STC; Kichuk 2007 is an excellent read on this topic to the previously uninitiated). So while that previous post asked “What makes a text in EEBO an English text”, this post will ask “what makes a text in EEBO a book?”
In general, I think we can agree that in order to be considered a book or booklet or pamphlet, a printed object has to have several pages. These pages can either created through folding one broadside sheet, or it will have collection of these (called gatherings). It may or may not have a cover, but it would be one of several sizes (quarto, folio, duodecimo, etc). To this end, Sarah Werner has an excellent exercise on how to fold a broadside paper into a gathering which builds the basis for many, but probably not all, early books. Here is an example of a broadside that has clearly been folded up; it’s been unfolded for digitization.
So it has been folded in a style that suggests it could be read like a book, but it is not necessarily a book in the sense that there is a distinct sense of each individual page and that some of the verso/recto pages would be rendered unreadable unless they had been cut, etc.
In order to be available for digitization from the original EEBO microfilms, a text needed to be included in a short title catalogue. The British Library English Short Title Catalogue describes itself as
a comprehensive, international union catalogue listing early books, serials, newspapers and selected ephemera printed before 1801. It contains catalogue entries for items issued in Britain, Ireland, overseas territories under British colonial rule, and the United States. Also included is material printed elsewhere which contains significant text in English, Welsh, Irish or Gaelic, as well as any book falsely claiming to have been printed in Britain or its territories.
I select the British Library ESTC here because it covers several short title catalogues (Wing and Pollard & Redgrave are both included) and it’s my go-to short title catalogue database. Including “ephemera” is important, because it allows any number objects to be considered as items of early print, even if they’re not really ‘books’ per se.
Such as this newspaper (TCPID A85603)…
Or this proclamation, also printed on 1 broadside (TCPID A09573)
Or this sheet of paper, listing locations in Wales (Wales! Again!) (TCPID A14651); click to see full-size
Or this acrostic (TCPID A96959); click to see full-size
Interestingly, these are all listed as “one page” in the Jisc Historical books metadata, though they are perhaps more accurately “one sheet”. While there’s no definitive definition of “English” in Early English Books Online, it’s becoming increasingly clear to me that there’s no definitive definition of “book” either. And thank god for that, because EEBO is the gift that keeps giving when it comes to Early Modern printed materials.