Moby Dick is About Whales, or Why Should We Count Words?

Why are we interested in counting words? The immediate payoff is not always clear. Many of us are familiar with what I like to call the Moby Dick is About Whales model of quantitative work, wherein we generate some kind of word-frequency chart and the most dominant words are terms that are so central to the overall story being presented.

In the case made by Moby Dick is About Whales we get words like WHALE, BOAT, CAPTAIN, SEA presented as hugely important terms. Great! There is no doubt that these terms are important to Moby Dick. However, and this is crucial: there is nothing terribly groundbreaking about discovering these words are central to the world of Moby Dick. In fact, it is nothing we couldn’t have discovered if we sat down and read the book ourselves. (Another example of this phenomenon is ‘Shakespeare’s plays are about kings and queens’, lest it sound like I am picking on the 19c Americanists.)

One of the reasons the Moby Dick is About Whales model is so popular is because both humans and computers can handle the saturation of these words. WHALE, BOAT, CAPTAIN, and SEA are indeed very high-frequency words in the novel. But so are the tiny boring words like THE, OF, I, IS, ARE, WHO, DO, FOR, ON, WITH, YOU, ARE, SHE, HE, HIS, HER, BUT, WHICH, THAT, FROM. We tend not to notice these terms so much as readers, because they serve a specific function rather than delivering specific content-driven meaning. However, these are by far the most frequent terms in any given English-language document. The overall distribution of these terms often fluctuates based on the kinds of documents we are writing/reading . [1] We can contrast these function words – which have some sort of grammatical purpose, first and foremost – against words that have some kind of content-driven purpose. These content words are often the words that make up what a document is about, which makes it easier to keep track of and care about these things.

In 1935 and 1945, G. K. Zipf formulates a now-famous postulation, now called Zipf’s law, that within a group or corpus of documents, the frequency of any word is inversely proportional to its rank in a frequency table.[2] Thus, the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. Zipf’s law, which feels a bit hard to visualize as a reader, is often presented as a logarithmic distribution that looks like Figure 1.

Figure 1. Zipf’s law, visualized.
zipf.png

Our very high-frequency function words are way up on the top of the y axis – they are everywhere, to the point of over-saturating. (You could go back and take a highlighter to every one of them in this essay and you’ll find that the majority of your document is covered in highlighter). As readers, writers, and speakers, we simply don’t notice them because there are too many examples of them to keep track of with our puny human minds. We do, however, pay a lot more attention to lower-frequency terms, in part because there is so much variation on the far right-hand side of this graph. These contentful words are much less likely to occur, so the variation in these terms is ultimately much more meaningful to readers. In the middle, where there is an L-shaped curve, is what I will call the sweet spot where medium-high-frequency words start to creep into the domain of very noticeable. Over here we have high-saliency content words like WHALES and BOATS that are hugely obvious to readers: they are approaching a high enough frequency that we notice them, but not so high-frequency to basically become invisible.

So, the interpretive argument that Moby Dick is About Whales is pretty dependent on words that are high-enough frequency to be noticeable to a linear reader, but low-frequency enough to be content-driven. The challenge in doing quantitative literary analysis, then, is pushing past the obvious stuff in the output. This is not to say that Moby Dick is About Whales is a waste of time – but it should be a way to guide a more complex analysis. Part of the reason it is so easy to criticize the Moby Dick is About Whales model of digital scholarship is that Moby Dick is a long book, and a rather complicated one at that, so it makes sense to want to get the big picture. But there is much more going on in this novel: there are big questions surrounding the ideas of isolation, homosociality, and self-discovery. One of the joys of working with literary language is that we have to offer interpretations of what is happening beyond the most obvious level. When we accept the Moby Dick is about Whales model of scholarship, we are accepting the C-student interpretation of the novel. Most of us who teach literature expect our students to be able to read and interpret beyond the most obvious models of what a book is ‘about’: It is fine if your biggest takeaway from reading novel is that it is about boats and whales, because the book is indeed about boats and whales. But this is the start of the conversation, not the end of it. When it comes to word counting, we have a quite accessible way of discussing language, style, and variation.

For example, I looked up the use of ‘ship’ or ‘ships’ in Moby Dick. Ship(s) appear 607 times in total across the entire novel. ‘Boat’ or ‘boats’ appears 484 times. Now, if you were anything like me, you spent most of your school years despairing over math problems about Mary having 64 oranges and Tom having 33 apples. Who cares? Why do these people have so much fruit? But in the context of a literary analysis, there is a fascinating question to ask: why does Melville use ‘ship’ so more frequently than ‘boat’? Doing a survey of keyword in context hits for both sets of terms, I observed some general patterns in their usage. ‘Ship’ is largely used as part of a phrase: “that ship”, “the ship”, “whale-ship”, whereas ‘boat’ is largely used in a way that shows possession: X’s boat, her boats, his boats, the boat, (a number of) boats, whale boat. In other words, ‘boat’ shows much more variation than ‘ship’ does in this novel.

Here’s another example, culled from a word cloud of most frequent contentful words in Moby Dick. ‘Old’ appears 450 times; ‘new’ appears 99 times. Old is often used as part the phrase ‘old man’ (and to a lesser extent in reference to Ahab, who is called ‘old Ahab’). Are these the same person? That’s a research question. Meanwhile, ‘new’ largely applies to places, like New England, New Bedford, and New Zealand. Although old and new are theoretically antonyms, one references a person and the other references a place. Are they serving the same purpose? Something similar happens for ‘sea’ compared to ‘ocean’ (taken from the same word cloud): ‘sea’ appears 455 times, whereas ocean appears 81 times. The use of ‘ocean’ is more specific, showing up in constructions like ‘the ocean’, ‘Pacific Ocean’, and ‘Indian ocean’. Meanwhile, ‘sea’ suggests more movement: a sea(-something), at sea, the sea.

One final point to make: Computers are very good at keeping track of presence and absence for us. There are around 15 chapters in Moby Dick in which nobody talks about whales at all and we discuss boats in great detail; this is something we may not notice as linear readers but it is something that computers are very good at showing us. As readers we continue to understand that the book remains about whales and boats. But at the level of less-obvious lexical variation, what does this look like? Capitan Ahab appears 62 times in Moby Dick; Ahab appears 517 times, and captain appears 329 times. When does Ahab get to be called Captain Ahab? When is he just called by his first name? Who calls him by which name? Why does this matter to our understanding the over the novel?

To develop these kinds of research questions we haven’t done anything more complicated than simple arithmetic. We did a little bit of addition (we added together the overall frequencies of ‘boat’ and ‘boats’), and then we compared the size of several groups of things against each other, by asking which pile was bigger than the others. Other than that, I simply returned to the text using a concordance’s Keyword In Context viewer to find examples of our terms and observed ways our terms worked in practice. Computers are good at finding patterns and people are good at interpreting patterns. Moreover, without anything too much more complicated we can ask much more interesting questions. My favourite example of this is that the word ‘she’ is comparatively underused in Macbeth than in other Shakespeare plays, a fact that my friends who study Shakespeare are always shocked by. Lady Macbeth is the most important figure! She is the driving force behind the whole play! Yes, this is true, but also nobody speaks about her in great detail until she falls ill in Act 5 Scene 1. We could have found this fact out by sitting down with highlighters and a copy of the text, but computers make this whole process so much easier, allowing us to get to more interesting questions that linear readers of a text may not always be able to observe.

“Quantification”, as Morris Halle says, “is not everything”. And in doing so, we must consider what we can do with the ability to look at texts in a non-linear fashion: the ability to move between close reading and a more bird’s eye view of a corpus is truly the most powerful thing counting words can offer us.

_____

[1] See https://www.wordfrequency.info/free.asp for a full list of the overall most frequent terms in English; genre is an important feature to consider when it comes to these fluctuations!

[2] See George K. Zipf (1935) The Psychobiology of Language. Houghton-Mifflin, and George K. Zipf (1949) Human Behavior and the Principle of Least Effort. Cambridge, Massachusetts: Addison-Wesley. The Wikipedia page is generally quite good for explaining this, too: https://en.wikipedia.org/wiki/Zipf%27s_law

Call for Papers – Revealing Meaning: Feminist Methods in Digital Scholarship

Posted August 2019 

This volume, tentatively titled Revealing Meaning: Feminist Methods in Digital Scholarship, will gather chapters in which digital methodologies can engage directly with intersectional feminist scholarship. The sheer volume of data available online (primary and secondary source material, social, political, environmental, personal, etc) offers the opportunity to investigate previously inaccessible or otherwise understudied research topics. As white, female, academic editors of this book, we recognize that the web is biased towards white, middle-to-upper class men (Wagner et al, 2015, Sengupta and Graham, 2017, Wellner and Rothman, 2019). Thus, we are interested in how digital research, broadly conceived, makes room for alternative methods and approaches and opens the conversation to new voices. 

The true promise of digital methods is being able to tell stories which were previously understudied or otherwise ignored on the basis of class, race, gender, etc.  Choices of language, data, platform, and variables reveal researchers’ biases and experiences. By sharing stories about the way we choose to do our work, we can offer a more robustly intersectional approach to data-driven humanities research. We foreground method as our unifying theme, seeking discussions of how methodological choices impact the ways of producing meaning. 

The telling of these stories is itself feminist in nature. We are specifically interested in stories about research methods that describe the lived experience of doing the work rather than creating instructional guides for replicating a specific study. We hope this volume will create conversations amongst practitioners of digital scholarship. We intend to structure our collection around topics such as the ethics of working with data, critical views of technology, interoperability, maintenance and costs of human labour, and evolving research methodologies.

We invite contributions on topics such as: 

  • Data collection methods & collections as data
  • Privacy on the web
  • Difficulties locating data/information
  • Changing research questions based on data availability
  • Choices about digital representations of things/people
  • Adapting to existing platforms/programs vs. reinventing the wheel
  • Openness of methods/platforms/programs
  • Collaboration
  • Digital infrastructure, or lack thereof
  • Development of standards (ontologies/taxonomies)
  • Failure
  • When to walk away from a digital project

 

Important dates

Abstracts (500 words) & 3-pg CVs for all contributors – November 22nd

Full draft (5000 words) – June 2020

Peer review of relevant section/chapters – Summer 2020

Final drafts – November 2020

Junior scholars, people of colour, and other under-represented communities in academia are particularly encouraged to send proposals. Please send all materials to Kim Martin (kmarti20@uoguelph.ca) and Heather Froehlich (hgf5@psu.edu) by November 22, 2019 with the headline ‘Revealing Meaning’. Feel free to contact us about any questions you may have.

 

Works Cited

Sengupta, A., and Graham, M. “We’re All Connected Now, So Why Is The Internet So White And Western?The Guardian, October 2017.

Wagner, C., Garcia, D., Jadidi, M. and Strohmaier, M., 2015, April. It’s a man’s Wikipedia? Assessing gender inequality in an online encyclopedia. Proceedings of the Ninth International AAAI conference on Web and Social Media.

Wellner, G. and Rothman, T., 2019. Feminist AI: Can We Expect Our AI Systems to Become Feminist?. Philosophy & Technology, pp.1-15.

Using Voyant Tools in the Undergraduate Research Classroom

[note: this post is cross-posted to the Early Modern Recipes Online Collective (EMROC) blog. see it here.]

This semester I have partnered with Dr Marissa Nicosia (Penn State Abington) on an undergraduate research course she runs on Early Modern recipes in collaboration with my colleague Christina Riehman-Murphy as part of the larger Early Modern Recipes Online Collective initiative. In this course, students transcribe recipes from a 17th century recipe book using Dromio (transcribe.folger.edu), learn about Early Modern food culture and history, and develop a lot of hands-on research experience with Marissa, Christina and me. This semester we and our students were focusing on a medicinal and cookery book associated with Anne Western, owned by Folger Shakespeare Library and affectionately called MS v.b.380.

This course has a pretty serious transcription element, where one of the requirements was that each student would transcribe 40 openings using Dromio throughout the semester. Since each student was responsible for submitting a Word file of their transcriptions to Marissa for grading, we then had a substantial (but not complete) coverage of the volume to work with. And it could be easily be loaded into Voyant Tools for some linguistic exploration.

Over the course of the semester, students have also grown increasingly comfortable with the differences between contemporary and early modern recipes with regards to both genre and format, so we wanted to get them to think about the language of recipes more pointedly. The students were already experts in the language and style of the author they were working with. And since the students were so intimately familiar with the work they had already done, it was a little less of a hard sell to get them to think about their work from a more birds-eye view and think about what the language of their recipes looked like in aggregate.

Before class met, we asked the students read three contemporary chocolate chip cookie recipes.[1] We had a brief discussion about the overall form of these contemporary recipes before dropping them in Voyant to practice reading from a birds-eye view. Chocolate chip cookie recipes, as you may have guessed, do not have a whole lot of variation, making this a pretty low-stakes way to introduce the various features of Voyant: word cloud, reading pane, general trends over time (not particularly useful for this genre), concordance, and some basic statistics.

3 chocolate chip cookie recipes in Voyant

three chocolate chip cookie recipes in Voyant

Once they were comfortable with the idea, we ramped up the stakes a little. In small groups, our students looked at their own transcriptions in Voyant, taking notes on what made their sections of the corpus similar and different to each other. This process was designed to get the students to think about what their recipes were doing not just stylistically but linguistically, too: what are the lexical ingredients of their recipes? This primed for discussions about polysemy (a pound of something vs pound the ingredient) and words marking for measurement (spoon). One group even discussed the importance of the verb mingling as a way to describe mixing things together in one student’s particular section of the recipe book.

Now used to the software and the process, we looked at the full class corpus (a compiled file consisting of all the students’ submissions). Students and faculty partners practiced some close reading, identifying terms of interest and looking at them using Voyant’s concordance feature, including a variety of ingredients (sugar, water, butter, mace, rosemary) and verbs for actions chefs may use (again, back to ‘mingling’ and ‘stir’).

VB380 class corpus in Voyant

looking at the class corpus in Voyant Tools

And though we had been discussing the role of VB 380 as a medicinal and cookery book throughout the semester, this was thrown in sharp relief while the we all thought about the language of the class corpus. Certainly one big surprise was the relative importance of ‘sugar’ compared to ‘water’ in VB380. We were very struck by the lack of fixed vocabulary for the recipes, though we all had a pretty clear sense of expectation for the full-class corpus based on our earlier exploration. And, finally, we had a brief discussion about variation and affordances of changing some of data to deal with the question of spelling, authorship and authenticity.

While this was set up to be a discussion of linguistic features (nouns versus verbs; variation; etc) many students commented on how salient some terms were as transcribers – yet these were deemed less important in the overall big picture provided by Voyant. Ultimately, this left our brilliant students thinking about the strengths and weaknesses of both Voyant’s birds-eye view and linear start-to-finish reading – which was an even better outcome than I could have asked for.

***

Additional notes:
This is related to another undergraduate classroom activity I have done a few times where students read several online articles related to a theme, try out the birds-eye-view approach to the topic using Voyant, and then move into trying it out on some of their own writing (forum posts, assignments, etc) to think about their own style. It works pretty well as a way to get undergrads to think about their own style and editing in a way that is a little less obvious than other formats.

Please also see Miriam Posner’s very excellent Investigating Texts With Voyant workshop (10 April 2019) for more ideas https://github.com/miriamposner/voyant-workshop/blob/master/investigating-texts-with-voyant.md

[1] For the curious: One was from Smitten Kitchen, one from Martha Stewart, and a recipe of their choice from the NYT’s giant chocolate chip cookie compendium. (If I was doing this again, I would maybe not use NYT, as it kept asking us to log in. Also if I was doing this again, I’d be bringing cookies into class!)

How I use Twitter as an academic

An enormous amount of academic life happens in digital spaces these days. The microblogging service Twitter, which has been around since 2007, has all but replaced the academic listserv in 2017.  Despite the various ways that Twitter continues to struggle with content and user management,  it has perhaps accidentally become a widely used professional space for academics to network, exchange ideas, and collaborate. When Twitter works like this, it is brilliant. When Twitter does not work like this, it becomes an incredibly dangerous place for people who are already in precarious positions (for any number of reasons: rank, social identity or identities, job market status, any intersection of the above, etc)

It is becoming increasingly clear to me that academics in general – especially early-stage graduate students – are in desperate need of training, support, & guidance on professional social media spaces. Although more senior scholars are often on social media, they are also secure enough in their positions and academic life in general that their experiences of social media and social networking can be very different than their graduate students.

I’ve been on Twitter since 2010, and I have seen this play out more than a few times, including as a graduate student myself. In these seven years I have maintained what I hope is a very professional profile and I have accidentally amassed a rather large following (in the 1000s). I would not go so far as to say that I am Internet Famous but certainly it is rare I walk into a room now and I don’t know someone there. I try to be very modest about my internet life but I also recognize that is quite difficult when I occupy this space.

People have asked me for years about how I do this. I tweeted yesterday about how I manage my own social media presence, which unexpectedly  got a lot of interest. I thought it might be good to have these up more permanently for reference, as the very nature of Twitter is ephemeral (except for when it’s not). I’ve kept these in mostly 140ch-bites, because brevity is often better than verbosity.

1. Don’t tweet anything that I would not want to see associated with me/my name/my likeness in the international news.

2. Mute people you have to follow for whatever political reason but actively dislike. (it happens.) They don’t know you have them on mute.

3. Mute words that you don’t need to see. I’m not a basketball fan so I have “March Madness” on mute. For example.
I use Tweetdeck to mute individual words. This link explains how you do that on Regular Twitter

4. Everything you say on Twitter is public and reaches lots of people you don’t know.

5. 140 characters (or 280, depending on who you are now) is very, very flattening. Assume the recipient will have the worst interpretation.

6. Twitter is a space for networking and making friends, but also your seniors are watching you. They will write your letters of rec one day. (If you are up for tenure, whatever, they are writing those letters too.)

7. Yelling about politics does not make you a better person. but it does make you feel part of a larger culture of dissatisfication. If that makes you feel better, good for you. It can be more performative than anything else.

8. There is an art to being quiet about some things. This one is hard and takes practice.

9. You like the thing you study, so tweet about what you are doing. Be generous about what you know.

10. Give yourself regular days away from Twitter. People will still be there when you come back. Go outside, watch a film, have a life.

Heather’s 3 rules of doing digital scholarship

In my new job as Digital Scholarship Fellow in Quantitative Text Analysis, I’m starting to work with students and faculty in the Liberal Arts on the hows and ways of counting words in lots of texts. This is a lot of fun, especially because I get to hear a lot about what excites different scholars from across different disciplines and what they think is fascinating about the data they work with (or want to work with).

One thing that is kind of strange about my job – and there are several aspects of my job that have required some adjustment – is that my background is broadly in corpus linguistics and English literature, so I don’t always think of the work I do as being explicitly “DH”. These distinctions are quite frankly boring unless you are knee-deep in the digital humanities, and even then I am not convinced it is an interesting discussion to have. Ultimately, people have lots of preconceived notions about what DH is and why it matters. I suspect that different disciplines within the Humanities writ large have different ideas about this too – certainly the major disciplines I cut across (English, linguistics, history, computer science, sociology) have very different perspectives on the value and experience of digital scholarship. And of course, doing “digital” work in the humanities is kind of redundant anyway: we don’t talk about “computer physics” or “test tube chemistry”, as Mike Witmore and others have pointed out.

Being mindful of this, I have acquired a few rules for doing digital scholarship over the years, and I find myself saying them a lot these days. They are as follows:

1. “Can you do X” is not a research question.

The answer to “can you do X” is almost always “yes”. Should you do X? That’s another story. Can you observe every time Dickens uses the word ‘poor’? Of course you can. But what does it tell you about poverty in Dickens’ novels? Without more detail, this just tells you that Dickens uses the word ‘poor’ in his books about the working class in 19th century Britain — and you almost certainly didn’t need a computer to tell you that. But should you observe every time Dickens uses the word ‘poor’? Maybe, if it means he uses this over other synonyms for the same concept, or if it tells us something about how characters construct themselves as working-class, or if it tells us how higher status characters understand lower-status individuals, or whatever else. These are all research questions which require further investigation, and tell us something new about Dickens’ writing.

2. Programming and other computational approaches are not findings.

So you have learned to execute a bunch of scripts (or write your own scripts) to identify something about your object of study! That’s great. Especially if you are in the humanities, this requires a certain kind of mind-bending that requires you to think about logic structures, understand how computers process information we provide, and in some cases overcome the deeply irregular rules which make your computer language of choice work. This is hard to do! You deserve a lot of commendation for having figured out how to do all of this, especially if your background is not in STEM. But – and this is hard to hear – having done this is not specifically a scholarly endeavour. This is a methodological approach. It is a means to an end, not a fact about the object(s) under investigation, and most importantly, it is not a finding. This is intrinsically tied to point #1: Should you use this package or program or script to do something? Maybe, but then you have to be ready to explain why this matters to someone who does not care about programming or computers, but cares very deeply about whatever the object of investigation is.

3. Get your digital work in front of people who want to use its findings.

Digitally inflected people already know you can use computers to do whatever it is you’re doing. It may be very exciting to them to learn that this particular software package or quantitative metric exists and should be used for this exact task, but unless they also care about your specific research question, there is a limited benefit for them beyond “oh, we should use that too”. However, if you tell a bunch of people in your specific field something very new that they couldn’t have seen without your work, that is very exciting! And that encourages new scholarship, exploring these new issues to those people your findings matter most to. You can tell all the other digital people about the work you’ve done as much as you want, but if your disciplinary base isn’t aware of it, they can’t cite it, they can’t expand on your research, and the discipline as a whole can’t move forward with this fact. Why wouldn’t you want that?

What’s a “book” in Early English Books Online?

Recently I have been employed by the Visualising English Print project, where one of the things we are doing is looking at improving the machine-readability of the TCP texts. My colleague Deidre has already released a plain-text VARDed version of the TCP corpus, but it is our hope to further improve the machine-readability of these texts.

One of the issues that came up in modernising and using the TCP texts has to do with non-English print. It has been previously documented that there are several non-English languages in EEBO – including Latin, Greek, Dutch, French, Welsh, German, Hebrew, Turkish and Algonquin. Our primary issue is if there is a transcription that is not in English in the corpus, it will be very difficult for an English-language text parser or word vector model to account for this material.

So our solution has been to isolate the texts which are printed in a non-English language, either monolingually (e.g. a book in Latin) or a bi- or tri-lingual text (e.g. Dutch/English book, with a Latin foreword). Looking at EEBO-the-books is a helpful way to identify languages in print, as there are all sorts of printed cues to suggest linguistic variation, such as different fonts or italics to set a different language off from the primary language. It also means I get a chance to look at many of these non-English texts as they were printed and transcribed initially.

Three years ago, I wrote a blog post about some Welsh language material that I found in EEBOTCP Phase I. In the intervening time I still have not learned Welsh (though I am endlessly fascinated by it), still get lots of questions and clicks to this site related to Early Modern Welsh (hello Early Modern Welsh fans), and I have since learned quite a lot more about how texts were chosen to be included in EEBO (it involves the STC; Kichuk 2007 is an excellent read on this topic to the previously uninitiated). So while that previous post asked “What makes a text in EEBO an English text”, this post will ask “what makes a text in EEBO a book?”

In general, I think we can agree that in order to be considered a book or booklet or pamphlet, a printed object has to have several pages. These pages can either created through folding one broadside sheet, or it will have collection of these (called gatherings). It may or may not have a cover, but it would be one of several sizes (quarto, folio, duodecimo, etc). To this end, Sarah Werner has an excellent exercise on how to fold a broadside paper into a gathering which builds the basis for many, but probably not all, early books. Here is an example of a broadside that has clearly been folded up; it’s been unfolded for digitization.

folded broadsheeet A17754           TCPID A17754

So it has been folded in a style that suggests it could be read like a book, but it is not necessarily a book in the sense that there is a distinct sense of each individual page and that some of the verso/recto pages would be rendered unreadable unless they had been cut, etc.

In order to be available for digitization from the original EEBO microfilms, a text needed to be included in a short title catalogue. The British Library English Short Title Catalogue describes itself as

a comprehensive, international union catalogue listing early books, serials, newspapers and selected ephemera printed before 1801. It contains catalogue entries for items issued in Britain, Ireland, overseas territories under British colonial rule, and the United States. Also included is material printed elsewhere which contains significant text in English, Welsh, Irish or Gaelic, as well as any book falsely claiming to have been printed in Britain or its territories.

I select the British Library ESTC here because it covers several short title catalogues (Wing and Pollard & Redgrave are both included) and it’s my go-to short title catalogue database. Including “ephemera” is important, because it allows any number objects to be considered as items of early print, even if they’re not really ‘books’ per se.

Such as this newspaper (TCPID A85603)…

newspaper A85603

Or this this effigy, in Latin, printed on 1 broadside (TCP id A01919); click to see full-sizeeffigy A01919

Or this proclamation, also printed on 1 broadside (TCPID A09573)

proclamation A09573

Or this sheet of paper, listing locations in Wales (Wales! Again!) (TCPID A14651); click to see full-size

Screen Shot 2016-07-12 at 9.00.48 pm

 

Or this acrostic (TCPID A96959); click to see full-size

acrostic

Interestingly, these are all listed as “one page” in the Jisc Historical books metadata, though they are perhaps more accurately “one sheet”. While there’s no definitive definition of “English” in Early English Books Online, it’s becoming increasingly clear to me that there’s no definitive definition of “book” either. And thank god for that, because EEBO is the gift that keeps giving when it comes to Early Modern printed materials.

10 Things You Can Do with EEBO-TCP Phase I

The following are a list of resources I presented at Yale University (New Haven, CT, USA) on 4 May 2016 as part of my visit to the Yale Digital Humanities Lab. Thank you again for having me! This resource list includes work by colleagues of mine from the Visualising English Print project at the University of Strathclyde. You can read more about their work on our website, or read my summary blog post at this link.

You can download the corresponding slides at this link and the corresponding worksheet at this link.

EEBO and the TCP initiative
Official page http://www.textcreationpartnership.org/tcp-eebo/
The History of Early English Books Online http://folgerpedia.folger.edu/History_of_Early_English_Books_Online
Transcription guidelines & other documentation http://www.textcreationpartnership.org/docs/
EEBO-TCP Tagging Cheatsheet: Alphabetical list of tags with brief descriptions http://www.textcreationpartnership.org/docs/dox/cheat.html
Text Creation Partnership Character Entity List http://www.textcreationpartnership.org/docs/code/charmap.htm

Access the EEBOTCP-1 corpus
#1 – Download the XML files https://umich.app.box.com/s/nfdp6hz228qtbl2hwhhb/
#2 – Search the full text transcription repository online
http://quod.lib.umich.edu/e/eebogroup/ *
or http://eebo.odl.ox.ac.uk/e/eebo/ *
or http://ota.ox.ac.uk/tcp/
* these are mirrors of each other

#3 – Find a specific transcription
STC number vs ESTC number vs TCPID number
STC = specific book a transcription is from
ESTC number = “English Short Title Catalogue” (see http://estc.bl.uk/F/?func=file&file_name=login-bl-estc)
TCPID = specific transcription (A00091)

#4 – Search the full text corpus of EEBOTCP*
(*Can include EEBO-TCP phase I, Phase I and Phase II; read documentation carefully)

EEBO-TCP Ngram reader, concordancer & text counts http://earlyprint.wustl.edu/ (big picture)
CQPWeb EEBO-TCP, phase I https://cqpweb.lancs.ac.uk/eebov3
#5 – identify variant spellings BYU Corpora front end to EEBO-TCP (*not completely full text but will be soon*) http://corpus.byu.edu/eebo (potential variant spellings; see also EEBO NGram Viewer, above)

Find specific information in the TCP texts…
#6 – Find a specific language
e.g. Welsh: *ddg*
#7 a specific term or concept using the Historical Thesaurus of the OED, trace with resources listed above (http://historicalthesaurus.arts.gla.ac.uk/)

#8 Curate corpora
Alan Hogarth’s Super Science corpus uses EEBO-TCP provided metadata + disciplinary knowledge to curate texts about scientific writing

#9 Clean up transcriptions – Shota Kikuchi’s work
Using the VARD spelling moderniser + PoS tagging (Stanford PoS tagger and TregEx → improvements for tagging accuracy, syntactic parsing)

#10 Teach with it!
Language of Shakespeare’s plays web resource lead by Rebecca Russell, undergraduate student (University of Strathclyde Vertically Integrated Project, English Studies / Computer Science)

Introducing “Social Identity In Early Modern Print: Shakespeare, Drama (1514-1662) and EEBO-TCP Phase I”

Today I am enormously pleased to introduce my PhD thesis, entitled “Social Identity In Early Modern Print: Shakespeare, Drama (1514-1662), and EEBO-TCP Phase I”. It investigates how Shakespeare’s use of terms marking for social identity compares to his dramatist contemporaries and in Early Modern print more generally using the Historical Thesaurus of the Oxford English Dictionary and the Early English Books Online Text Creation Partnership Phase I corpus.

Here are some facts about it: It ended up being 267 pages long; without the bibliography/front matter/appendices it’s 51,399 words long, which is about the length of 2 ⅓ Early Modern plays. There are 5 chapters and 3 appendices. It’s dedicated to Lisa Jardine, whose 1989 book Still Harping on Daughters: Women and Drama in the Age of Shakespeare changed my life, getting me interested in social identity in the Early Modern period in the first place. I am so grateful for Jonathan Hope and Nigel Fabb’s guidance. My examination is scheduled to be within the next two months, which is relatively soon. I’m excited to have it out in the world.

In lieu of an abstract I have topic modelled most of the content chapters for you (I had to take out tables and graphs, and some non-alphanumeric characters; I’m sure you will forgive me). For the unfamiliar, topic modelling looks for weighted lexical co-occurance in text. You can read more about it at this link.

A note on abbreviations: The SSC (“Standardized Spelling WordHoard Early Modern Drama corpus, 1514-1662”, Martin Mueller, ed. 2010) is a subcorpus of Early Modern drama from EEBO-TCP phase I which use to set up comparisons between Shakespeare and his dramatist contemporaries. It generally serves as a prototype for Mueller’s later Shakespeare his Contemporaries corpus (2015). EEBOTCP is Early English Books Online Text Creation Partnership phase I and HTOED is the Historical Thesaurus of the Oxford English Dictionary. Wordhoard is a complete set of Shakespeare’s plays, deeply tagged for quantitative analysis.

1. play act suggests titus scene time bigram tragic iv strategy naming change setting hamlet shift iii downfall beginning serves construction established lack direct tamora revenge hero moment modes cleopatra andronicus
2. forms similar considered general categories surrounding directly position multiple common found listed contexts illustrates examples plays contrast concept variant sample kinds small included covering past perceived suggested thinking variety investigated
3. concordance search query view plot title instance queen lines cqpweb plots image string process line antconc average regular aggregated queries images normalized imagemagick size page length produce expressions bible adjacent
4. gentry status figures table models classes social high members nevalainen raumolin upper archer brunberg model nobility part identifying explicitly servants references presented clergy commoner account commoners noble nobles system ranking
5. england make acts paul writing true thou evidence back primary conversion festus thee threat noble art long sermon doctor christianity madde effect john audience speak verse church apostles truth argument
6. man lady generally native understood men latinate positive good beauteous virtuous character contrast register night proper adjectives poor takes lost love stylistic day raising formal gentle sweet honourable qualities dream
7. social class vocatives status titles specific marking vocative comedies identify tragedies histories indicative feature busse absence broadly pattern courteous moments middle start markers visualisation names individuals world finish amount presence
8. characters plays character title necessarily number total play mistress high roles list speaking rank comedy considered minor remains duke romeo ephesus syracuse juliet marriage analysis attendants lords measure merchant taming
9. social historical study literary language identity linguistic features studies culture evidence primarily feminist sociolinguistic criticism thesis claims community cultural approach methods scholars personal identify research previously renaissance understand central jardine
10. part discuss question bedlam stage represent order shown socially shakespearean describe honest invoking earlier ford duchess run discussion instability neely husband alongside ridge acting issues bellamont put accounts considers whore
11. woman collocates lady register knave nouns lower wench man terms unique term binary negative low lord high status variation show describe gendered collocate lexical semantic higher describes formality individuals noun
12. individual lord ways include reference sir politeness set result king henry applied strategies referenced parts address othello court philaster semantically marker due master wife introduce kind cardinal correlation question structures
13. female male characters gender women means dialogue patterns feminine find fewer marked idea role objects masculine grammatical race dependent skewed understood heavily ambiguous distinctions body rackin finally century negation interest
14. shows figure compared million authors instances unlike difference clear difficult contemporary specifically previous usage frequency comparatively ssc picture rest consistently shakespeare begin strong scenes times john place possibility england thomas
15. shakespeare plays show contemporaries larger chapter variation compare history dramatists relationship dramatist genre attention observe writing compares style divide edited internal exceptional assigned studied cover challenge correlate finding worthy lear
16. mad sanity actions phrase claim religious reason threatening face gram suggest establish dangerous prove suggesting potentially claims orleans biting attempts state explored evidence accusations grams justify corvino recurring circumstances accused
17. corpus corpora section ssc spelling included including making full range data meaning genres author frequently identify drama represents wider investigation comparisons nature side discussions observe metadata wide separate dialogues smaller
18. present hand http made oed malvolio works project head makes ii complete issue end fool ability thesaurus www requires illustrate kay dictionary org gutenberg hope oxford referencing initially house uk
19. early modern tcp eebo drama english phase period print online books creation widely playwrights partnership focus limited handful short jonson pronouns catalogue publication diachronic selected initiative recently years markedly norm
20. words examples frequent function highly word potential table left phrases node analysis ranked span pronoun content unclear multi lexical keyword context sense luna verbs collocate count finds unit highlights large
21. dramatic structure narrative action characterization model outlined freytag shakespearean visible work understanding point argues schmidt end tragedy jockers genre introduced constructions construct plausible description sudden suggest neuberg fiske climax reasons
22. texts text information based digital lists wordhoard personae dramatis mueller editions printed view folger selection designed source finally print includes burns ultimately discussing machine nameless edition readable time critical provided
23. madness terms mental htoed synonyms relevant illness form crazy melancholy historically health person related synonymous brained symptoms insanity scientific physical medical dement linked term relating semantic crazed variations definition discussed
24. test lemma collocates probability results likelihood collocational information analysis log dice lemmata top specific tests coefficient collocation based squared method frequency mutual relationships conditional chi appears phi produce methods wordhoard
25. form texts speaker culpeper discourse speech world context category spoken constructed kyt hearer addressee al biber analyses interaction writing written lutzky interactions fictional imaginary relevant circumstances addition person people real

7 reasons why I think this Hebrew-Latin book from 1683 is really cool

A few years ago I wrote about non-English language printing in EEBO, a post which still gets a fair amount of traffic and a lot of people asking me about Welsh. So when I found a bilingual Latin/Hebrew book in EEBO on Friday night while searching for something else just as I was getting ready to go meet some friends for dinner, I was overjoyed. This this is a book printed in Cambridge, England, in 1683 and contains two languages which are very much not English.

JISC’s EEBO portal lists the title as “Komets leshon ha-koresh ve-ha-limudim = Manipulus linguae sanctae & eruditorum : in quo, quasi, manipulatim, congregantur sequentia, I. index generalis difficilorum vocum Hebraeo-Biblicarum, irregularium, & defectivarum, ad suas proprias radices, & radicum conjugationes, tempora, & personas, &c. reductarum” (R1614 Wing), describing it briefly as a Hebrew grammar (with the first four words in the title transliterated from Hebrew). My years of Hebrew school did not leave me a fluent Hebrew speaker or reader; I have no formal Latin or bibliographic training, but this book is really cool. Here are some reasons why…

This isn’t the title page, but it is the introductory material and you can see that it contains Hebrew, Latinate and Greek characters on the same page:

Screen Shot 2015-12-12 at 3.45.54

For starters, this is a bilingual grammar and index to the Old Testament, serving in some ways as a precursor to my digital concordances. But it also is fascinating because it involves several different typefaces representing several different languages, so someone in 1683 had either created a typeface for Hebrew or had access to a Hebrew typeface to print this book. Furthermore, Hebrew has a script form and a block letter form; the block letters are often used in printing whereas script is much more common elsewhere. Torahs are hand-copied onto vellum (even today!), so it is plausible someone may have had to transform each scripted character into block letters for this.

Hebrew is read from right to left, whereas Latin is read from left to right, so this book had to be very carefully typeset to put these two languages back to back. It also has a vowel system which is optional in print, but they are usually found under the consonants. Torahs often do not use the vowel system so the inclusion of them here (look for the lines, dots and small T’s) is interesting and an extra complication for typesetting.

Screen Shot 2015-12-12 at 3.46.25

The catchwords at the bottom of the page are printed in Hebrew here, but the book uses Latinate numbering. And – as my mother pointed out – entries are listed alphabetically in Hebrew (not in Latin).

Screen Shot 2015-12-12 at 3.46.39

It also includes a list of ambiguities, still written in both languages, and still juxtaposed with a left-to-right and right-to-left language.

Screen Shot 2015-12-12 at 3.46.53

So this is already interesting from a printing perspective, but then there are also grammatical notes and commentaries included, with descriptions of how to use this grammar. And still the juxtaposition of both languages on the same line is really fascinating:

Screen Shot 2015-12-12 at 3.47.06 Screen Shot 2015-12-12 at 3.47.21

From the grammatical guide, here  is a table of conjugations in Hebrew, marked with Latin descriptions (active, passive, future, participles, etc): Screen Shot 2015-12-12 at 3.45.25

And finally it ends in a two-column translation of Hebrew text into Latin:

Screen Shot 2015-12-12 at 4.06.43

Download the EEBO scan as a PDF for more.

Ways of Accessing EEBO(TCP)

On October 28, 2015, the Renaissance Society of America sent an email to all members announcing the demise of their previous partnership with ProQuest (now in control of ExLibris too). Their email to all of us, in full:

The RSA Executive Committee regrets to announce that ProQuest has canceled our subscription to the Early English Books Online database (EEBO). The basis for the cancellation is that our members make such heavy use of the subscription, this is reducing ProQuest’s potential revenue from library-based subscriptions. We are the only scholarly society that has a subscription to EEBO, and ProQuest is not willing to add more society-based subscriptions or to continue the RSA subscription. We hoped that our special arrangement, which lasted two years, would open the door to making more such arrangements possible, to serve the needs of students and scholars. But ProQuest has decided for the moment not to include any learned societies as subscribers. Our subscription will end a few days from now, on October 31. We realize this is very late notice, but the RSA staff have been engaged in discussions with ProQuest for some weeks, in the hope of negotiating a renewal. If they change their mind, we will be the first to re-subscribe.

This is truly terrible news, especially for anyone whose institution did not/could not subscribe to the ProQuest interface.

**EDIT 29 Oct 8:05pm**: the RSA confirms that access to EEBO via ProQuest will continue:

We are delighted to convey the following statement from ProQuest:

“We’re sorry for the confusion RSA members have experienced about their ability to access Early English Books Online (EEBO) through RSA. Rest assured that access to EEBO via RSA remains in place. We value the important role scholarly societies play in furthering scholarship and will continue to work with RSA — and others — to ensure access to ProQuest content for members and institutions.”

The RSA subscription to EEBO will not be canceled on October 31, and we look forward to a continued partnership with ProQuest.

Perhaps because the first set of TCP editions of the EEBO texts are now part of the public domain, this is supposed to be sufficient for scholars’ use. Of course, this is not true: the TCP texts are a facsimile of the EEBO images (themselves facsimiles of facsimiles). However inadequate the TCP texts are for someone without an EEBO subscription, I have been collecting a number of links for a number of years about how to access and use EEBO(TCP). Despite overturning this decision, the benefit of having all these resources listed together seems to justify their continued existence here. They are also available on my links page, but in the interest of accessibility, here they are replicated:

1 EEBO(TCP) documentation
Text Creation Partnership http://www.textcreationpartnership.org/
EEBO-TCP documentation http://www.textcreationpartnership.org/docs/
EEBO-TCP Tagging Cheatsheet: Alphabetical list of tags with brief descriptions http://www.textcreationpartnership.org/docs/dox/cheat.html
Text Creation Partnership Character Entity List http://www.textcreationpartnership.org/docs/code/charmap.htm
The History of Early English Books Online http://folgerpedia.folger.edu/History_of_Early_English_Books_Online
Using Early English Books Online http://folgerpedia.folger.edu/Using_Early_English_Books_Online

2 Access to EEBO(TCP) full texts (searchable)
Early English Books Online (EEBO): JISC historical books interface (UK, paywall, free access from the British Library Reading Room) http://historicaltexts.jisc.ac.uk/
Early English Books Online (EEBO): Chadwyck-Healey interface (outside UK, paywall; your mileage may vary by country) http://eebo.chadwyck.com/home

The Dutch National Library has off-site access, including full EEB (European books), ECCO (18C), TEMPO (pamphlets), for members, 15€/yr. Register online: inschrijven.kb.nl/index.php

EEBO-TCP Texts on Github https://github.com/textcreationpartnership/Texts
UMichigan TCP repository https://umich.app.box.com/s/nfdp6hz228qtbl2hwhhb/
UMichigan EEBO-TCP full text search http://quod.lib.umich.edu/e/eebogroup/*
University of Oxford Text Archive TCP full text search http://ota.ox.ac.uk/tcp/*
* These sites are mirrors of each other
See also 10 things you can do with EEBOTCP

EEBO-TCP Ngram reader, concordancer, & text counts  http://earlyprint.wustl.edu/
CQPWeb EEBO-TCP, phase I (and many others) https://cqpweb.lancs.ac.uk/
(Video guide to CQPWeb: https://www.youtube.com/watch?v=Yf1KxLOI8z8&list=PL2XtJIhhrHNTxjyZ5VSKUr0-4EuzJJDbe)
BYU Corpora front end to EEBO-TCP (*not completely full text but will be soon*) http://corpus.byu.edu/eebo

3 Other resources
English Short Title Catalogue (ESTC), http://estc.bl.uk/
Universal Short Title Catalogue (USTC) http://www.ustc.ac.uk/
Items in the English Short Title Catalogue (ESTC), via Hathitrust http://babel.hathitrust.org/cgi/mb?a=listis;c=247770968
LUNA, Folger Library Digital Image Collection http://luna.folger.edu/
Internet Archive Books https://archive.org/details/early-european-books
The Folger Digital Anthology of Early Modern English Drama http://digitalanthology.folger.edu/

Sarah Werner’s compendium of resources, incl digitised early books http://sarahwerner.net/blog/a-compendium-of-resources/
Laura Estill’s Digital Renaissance wiki page covers online book catalogues, digitised facsimiles, early modern playtexts online, print and book history, etc http://digitalrenaissance.pbworks.com/w/page/54277828/EarlyModernDigitalResources
(see also her very thorough guide to manuscripts online http://manuscriptresearch.pbworks.com/w/page/48026041/FrontPage)
Claire M. L. Bourne’s Early Modern Plays on Stage & Page resource list: http://www.ofpilcrows.com/resources-early-modern-plays-page-and-stage

Large Digital Libraries of Pre-1800 Printed Books in Western Languages http://archiv.twoday.net/stories/6107864/
The University of Toronto has a large number of Continental Renaissance text-searchable books online http://link.library.utoronto.ca/booksonline/
30+ digitised STC titles at Penn (free to use, from their collection) http://franklin.library.upenn.edu/search.html?filter.library_facet.val=Rare%20Book%20and%20Manuscript%20Library&q=STC%20collection%20%22sceti%22&sort=publication_date_sort%20asc,%20title_sort%20asc

UCSB Broadside Ballads Archive http://ebba.english.ucsb.edu/
Broadside Ballads Online http://ballads.bodleian.ox.ac.uk/

A database of early modern printers & sellers culled from the eMOP source documents https://github.com/Early-Modern-OCR/ImprintDB
(And their mirror of the ECCO-TCP texts: https://github.com/Early-Modern-OCR/TCP-ECCO-texts)

Database of Early English Playbooks (DEEP) http://deep.sas.upenn.edu/

How to save and download pdfs from the Chadwyck EEBO Interface https://www.youtube.com/watch?v=6u2B_MagrPc

And a crucial read from Laura Mandell and Elizabeth Grumbach on the digital existence of ECCO (Eighteenth Century Collections Online): http://src-online.ca/src/index.php/src/article/view/226/448

this page will update with more resources as they are available. email me with links: heathergfroehlich at gmail dot com // 15 Aug 2016