Against Cultural Heritage Wastelands

This is a version of remarks I prepared for The Future of Early Modern Marginalia roundtable at the 2023 Renaissance Society of America conference.

I want to use my 10 minutes today to discuss digital collections, an activity that is rapidly growing in academic libraries. Over the course of the pandemic especially, many of you will have encountered digital collections from major libraries — these include the Folger’s LUNA database, the Clark Library’s digital holdings, the Digital Bodleian – but also from sources like HathiTrust in the US or vendor-provided databases like Adam Matthews, ProQuest, and Gale. These were lifelines in a scary time, providing access to tons of content we couldn’t otherwise access from our desks at home. The Collections as Data movement has been a major player in encouraging the rise of such collections with duplicitous uses beyond the reading room; university libraries in particular have undergone major digitization projects with hopes of serving a variety of needs in response to their initial funding cycles.

For university-driven digital collections, we are especially eager to provide access to materials that will reach key stakeholders, such as student researchers. Moreover, locally-focused collections are more able to serve local needs, again putting us back into the realm of accessibility: not everyone can get to the library or want to be in the rare books room, but we can serve this content up on the web and more people can use it. Or, you don’t have to go miles away to see one or two items. One of libraries’ major interests is to provide researchers access to materials, so this presents a good outcome However, this is also especially desirable for folks at teaching-focused institutions, including small liberal arts colleges and regional-serving institutions (such as post-92s in the UK system), that might not have the resources available at their campus to really support a robust Special Collections experience for their students. Someone in Australia can go to Digital Bodleian and encounter important aspects of print history without stepping on a plane.

Digitization creates facsimiles of our holdings; this is also a process of preservation, ultimately. So, on the one hand, we want to prioritize materials that will get a lot of use, maximizing their access; on the other hand, performing high-quality digital preservation is expensive and computationally-heavy. This means we have to be strategic about gets digitized, how, when, and why. These digital collections need champions or else these collections slip into a condition I would like to describe as “cultural heritage wasteland” – a circumstance where we have digitized tons of things that nobody wants to use, or is otherwise unable to use. Or that the vendors will be able to hoover up huge collections of materials and sell them back to us at tremendous markup.

There is, however, a world of fragility inherent here: the internet turns over (at minimum) every 5-7 years. Huge swaths of internet history is gone, and we haven’t come up with a meaningful sustainable strategy for that beyond the WayBack Machine, which requires a lot of human intervention and incredible volumes of server space.[1] NISO and the National Digital Stewardship Alliance have discussed best practices in a variety of formats but nothing will ever match the sustaining power of “book time”, where  the book is published, it sits there, and it persists.

Investment in longer-term digital preservation becomes a question of vulnerability: we want to maximize our uses of our collections as per the Santa Barbara statement and the many use case scenarios Collections as Data has produced, but also we are at a loss if all these collections turn over to never be used again. In conclusion, then, libraries need champions who can provide value – people who will create stories and afterlives of projects to prevent them from falling prey to cultural heritage wasteland, never to be seen again.

[1] There isn’t a good answer here, as Ben Goldman helpfully summarizes: “ideally, sustainability would promote a balance between environmental protection, economic development, and social equity. Solutions that disproportionately benefit one community at the expense of another would not be considered sustainable” (p. 275)


A Gentle Introduction to Excel and Spreadsheets for Humanities People

I understand that a lot of Humanities researchers do not use spreadsheets or Excel very often in their lives. It’s not something we’re really trained in, nor is it really that much of a necessity in academic humanities research. You often can get away without really having to use it for most of an academic career. That said, sometimes you get 10 Word doc pages deep into a table you’ve built and realize that this is maybe not the best way to deal with your information. Or maybe you are starting a data-driven project for the first time and you realize you need to learn something about this now, or you are suddenly in charge of a budget and need to manage it, or maybe you are seeking employment outside straight-up Humanities research that requires that you have a basic knowledge of Excel. In all these situations, you don’t use spreadsheets normally and you now are in a situation that requires you to engage with it.

Part of my job supporting quantitative digital scholarship regularly involves explaining how spreadsheets work to the reluctant Humanities-trained person. And often, people don’t want to admit they don’t know how something like spreadsheets work or that they are intimidated by them because they think they should know this already. It’s never been something we’ve been required to know. This isn’t a criticism — I had to learn Excel too! I remember having to do a little with it in computer class in, maybe, 5th grade and had to get more hands on and dirty with it in graduate school. To be clear, I’m not judging here — everyone starts somewhere.

So when a colleague working with me on a digital project that relied heavily on using data asked for a training to understand what spreadsheets do, I realized I had given versions of this a few times in an ad-hoc way and I could formalize it a little. I wrote a workshop that takes about 40 min to go through that helps you understand what spreadsheets are for, and talks a little about how we can use Excel to manipulate them in a variety of ways.

When I tweeted about this, the interest was huge and totally surprising. As a result I have made my whole workshop is available here, including my comments, for you to download, learn, and teach with. My workshop is based on some data from Daniel Oehm’s SurvivoR package for R (which includes a link to an xlsx file of all his data, at the very bottom). I like this data set because I think it is relatively accessible to someone not very used to looking at data.

View and download the whole presentation as a .pptx file from Dropbox here. My notes do not appear to render in the preview provided by Dropbox, so if you want or need those you will have to download this file to your computer and open it in Powerpoint from there.


I use spreadsheets already but want some more experience with Excel. Is this for me? Honestly, probably not. This starts with the very basics. If you have experience working with data (in any sense) you’ll probably be better served by the Data Carpentry’s lessons and trainings. I’ve borrowed pieces from them, and I like their general approach for teaching beyond the basics. I particularly like their Ecology lesson, if you want a recommendation.

Can I borrow these slides and adapt them for my audience? Absolutely! I would love to have people use and adapt this workshop for their own needs but please cite me when you use it. There’s a CC-BY license on the slide deck itself, too.

Duh, everyone knows what a cell and a row and a column are. We’re not THAT bad. OK, you might know this but someone else might not. Also, establishing a shared baseline vocabulary before getting to more complicated things is always a good idea!

Can I use Google Sheets instead of Excel? Sure! The main learning outcomes should translate well. The exact directions provided that show you how to do some of these tasks might be a little different, though, so be aware of that.

Why aren’t you doing this in R? R is great if you are used to looking at and thinking about data. But if you’re not super comfortable with using and reading spreadsheets in the first place, I think considering them as dataframes will be conceptually more difficult.

Excel is bad and you should use something else for this. There are plenty of limitations with every software; Excel is definitely not a one-size-fits-all environment. But I’m trying to meet my people where they are, which is to say they’ve probably heard of Excel and probably have access to it through a MS office subscription. Google Sheets are similar enough that you should be able to transfer this to their interface too (though it is a little different, be forewarned!)

Why doesn’t this cover every aspect of best practices for data and management? This is an introduction to spreadsheets as a form of structuring data for people who aren’t used to thinking about this. While this is an important issue, there are lots of other opportunities to raise these issues that don’t fit into a less-than-an-hour long introductory workshop.

What about macros? Excel includes the ability to program some recurring tasks; they call this a “macro”. This presentation doesn’t cover them, but there are many other resources available to learn more about them if you google “excel macro tutorial”. If you’re anticipating doing a lot of repetitive data tasks like cleaning or standardizing your data, you might be better off using something like OpenRefine for this. (Here’s a great guided tour of OpenRefine from Scotty Carlson).

I would like to learn more about making good choices about organizing my data. Great! There’s a lot of ‘best practices’-style advice out there on the web but I think Hadley Wickham’s discussion of “tidy data” is particularly helpful. His paper introducing his theory of tidy data gets a little in the weeds towards the end, especially for beginners, but it clearly introduces the concepts he is thinking about otherwise.

I want more information or help on using spreadsheets?! I would love to help you – really honestly I would love to – but my time is tied to my institution and if you also aren’t there I probably can’t do that. Reach out to your local librarian and ask them to direct you towards support at your institution.

One of your links is broken. Thanks for catching that! Please email me at froehlich at arizona dot edu with this information and title your message “BROKEN LINK”.

Moby Dick is About Whales, or Why Should We Count Words?

Why are we interested in counting words? The immediate payoff is not always clear. Many of us are familiar with what I like to call the Moby Dick is About Whales model of quantitative work, wherein we generate some kind of word-frequency chart and the most dominant words are terms that are so central to the overall story being presented.

In the case made by Moby Dick is About Whales we get words like WHALE, BOAT, CAPTAIN, SEA presented as hugely important terms. Great! There is no doubt that these terms are important to Moby Dick. However, and this is crucial: there is nothing terribly groundbreaking about discovering these words are central to the world of Moby Dick. In fact, it is nothing we couldn’t have discovered if we sat down and read the book ourselves. (Another example of this phenomenon is ‘Shakespeare’s plays are about kings and queens’, lest it sound like I am picking on the 19c Americanists.)

One of the reasons the Moby Dick is About Whales model is so popular is because both humans and computers can handle the saturation of these words. WHALE, BOAT, CAPTAIN, and SEA are indeed very high-frequency words in the novel. But so are the tiny boring words like THE, OF, I, IS, ARE, WHO, DO, FOR, ON, WITH, YOU, ARE, SHE, HE, HIS, HER, BUT, WHICH, THAT, FROM. We tend not to notice these terms so much as readers, because they serve a specific function rather than delivering specific content-driven meaning. However, these are by far the most frequent terms in any given English-language document. The overall distribution of these terms often fluctuates based on the kinds of documents we are writing/reading . [1] We can contrast these function words – which have some sort of grammatical purpose, first and foremost – against words that have some kind of content-driven purpose. These content words are often the words that make up what a document is about, which makes it easier to keep track of and care about these things.

In 1935 and 1945, G. K. Zipf formulates a now-famous postulation, now called Zipf’s law, that within a group or corpus of documents, the frequency of any word is inversely proportional to its rank in a frequency table.[2] Thus, the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. Zipf’s law, which feels a bit hard to visualize as a reader, is often presented as a logarithmic distribution that looks like Figure 1.

Figure 1. Zipf’s law, visualized.

Our very high-frequency function words are way up on the top of the y axis – they are everywhere, to the point of over-saturating. (You could go back and take a highlighter to every one of them in this essay and you’ll find that the majority of your document is covered in highlighter). As readers, writers, and speakers, we simply don’t notice them because there are too many examples of them to keep track of with our puny human minds. We do, however, pay a lot more attention to lower-frequency terms, in part because there is so much variation on the far right-hand side of this graph. These contentful words are much less likely to occur, so the variation in these terms is ultimately much more meaningful to readers. In the middle, where there is an L-shaped curve, is what I will call the sweet spot where medium-high-frequency words start to creep into the domain of very noticeable. Over here we have high-saliency content words like WHALES and BOATS that are hugely obvious to readers: they are approaching a high enough frequency that we notice them, but not so high-frequency to basically become invisible.

So, the interpretive argument that Moby Dick is About Whales is pretty dependent on words that are high-enough frequency to be noticeable to a linear reader, but low-frequency enough to be content-driven. The challenge in doing quantitative literary analysis, then, is pushing past the obvious stuff in the output. This is not to say that Moby Dick is About Whales is a waste of time – but it should be a way to guide a more complex analysis. Part of the reason it is so easy to criticize the Moby Dick is About Whales model of digital scholarship is that Moby Dick is a long book, and a rather complicated one at that, so it makes sense to want to get the big picture. But there is much more going on in this novel: there are big questions surrounding the ideas of isolation, homosociality, and self-discovery. One of the joys of working with literary language is that we have to offer interpretations of what is happening beyond the most obvious level. When we accept the Moby Dick is about Whales model of scholarship, we are accepting the C-student interpretation of the novel. Most of us who teach literature expect our students to be able to read and interpret beyond the most obvious models of what a book is ‘about’: It is fine if your biggest takeaway from reading novel is that it is about boats and whales, because the book is indeed about boats and whales. But this is the start of the conversation, not the end of it. When it comes to word counting, we have a quite accessible way of discussing language, style, and variation.

For example, I looked up the use of ‘ship’ or ‘ships’ in Moby Dick. Ship(s) appear 607 times in total across the entire novel. ‘Boat’ or ‘boats’ appears 484 times. Now, if you were anything like me, you spent most of your school years despairing over math problems about Mary having 64 oranges and Tom having 33 apples. Who cares? Why do these people have so much fruit? But in the context of a literary analysis, there is a fascinating question to ask: why does Melville use ‘ship’ so more frequently than ‘boat’? Doing a survey of keyword in context hits for both sets of terms, I observed some general patterns in their usage. ‘Ship’ is largely used as part of a phrase: “that ship”, “the ship”, “whale-ship”, whereas ‘boat’ is largely used in a way that shows possession: X’s boat, her boats, his boats, the boat, (a number of) boats, whale boat. In other words, ‘boat’ shows much more variation than ‘ship’ does in this novel.

Here’s another example, culled from a word cloud of most frequent contentful words in Moby Dick. ‘Old’ appears 450 times; ‘new’ appears 99 times. Old is often used as part the phrase ‘old man’ (and to a lesser extent in reference to Ahab, who is called ‘old Ahab’). Are these the same person? That’s a research question. Meanwhile, ‘new’ largely applies to places, like New England, New Bedford, and New Zealand. Although old and new are theoretically antonyms, one references a person and the other references a place. Are they serving the same purpose? Something similar happens for ‘sea’ compared to ‘ocean’ (taken from the same word cloud): ‘sea’ appears 455 times, whereas ocean appears 81 times. The use of ‘ocean’ is more specific, showing up in constructions like ‘the ocean’, ‘Pacific Ocean’, and ‘Indian ocean’. Meanwhile, ‘sea’ suggests more movement: a sea(-something), at sea, the sea.

One final point to make: Computers are very good at keeping track of presence and absence for us. There are around 15 chapters in Moby Dick in which nobody talks about whales at all and we discuss boats in great detail; this is something we may not notice as linear readers but it is something that computers are very good at showing us. As readers we continue to understand that the book remains about whales and boats. But at the level of less-obvious lexical variation, what does this look like? Capitan Ahab appears 62 times in Moby Dick; Ahab appears 517 times, and captain appears 329 times. When does Ahab get to be called Captain Ahab? When is he just called by his first name? Who calls him by which name? Why does this matter to our understanding the over the novel?

To develop these kinds of research questions we haven’t done anything more complicated than simple arithmetic. We did a little bit of addition (we added together the overall frequencies of ‘boat’ and ‘boats’), and then we compared the size of several groups of things against each other, by asking which pile was bigger than the others. Other than that, I simply returned to the text using a concordance’s Keyword In Context viewer to find examples of our terms and observed ways our terms worked in practice. Computers are good at finding patterns and people are good at interpreting patterns. Moreover, without anything too much more complicated we can ask much more interesting questions. My favourite example of this is that the word ‘she’ is comparatively underused in Macbeth than in other Shakespeare plays, a fact that my friends who study Shakespeare are always shocked by. Lady Macbeth is the most important figure! She is the driving force behind the whole play! Yes, this is true, but also nobody speaks about her in great detail until she falls ill in Act 5 Scene 1. We could have found this fact out by sitting down with highlighters and a copy of the text, but computers make this whole process so much easier, allowing us to get to more interesting questions that linear readers of a text may not always be able to observe.

“Quantification”, as Morris Halle says, “is not everything”. And in doing so, we must consider what we can do with the ability to look at texts in a non-linear fashion: the ability to move between close reading and a more bird’s eye view of a corpus is truly the most powerful thing counting words can offer us.


[1] See https://www.wordfrequency.info/free.asp for a full list of the overall most frequent terms in English; genre is an important feature to consider when it comes to these fluctuations!

[2] See George K. Zipf (1935) The Psychobiology of Language. Houghton-Mifflin, and George K. Zipf (1949) Human Behavior and the Principle of Least Effort. Cambridge, Massachusetts: Addison-Wesley. The Wikipedia page is generally quite good for explaining this, too: https://en.wikipedia.org/wiki/Zipf%27s_law

Call for Papers – Revealing Meaning: Feminist Methods in Digital Scholarship

Posted August 2019 

This volume, tentatively titled Revealing Meaning: Feminist Methods in Digital Scholarship, will gather chapters in which digital methodologies can engage directly with intersectional feminist scholarship. The sheer volume of data available online (primary and secondary source material, social, political, environmental, personal, etc) offers the opportunity to investigate previously inaccessible or otherwise understudied research topics. As white, female, academic editors of this book, we recognize that the web is biased towards white, middle-to-upper class men (Wagner et al, 2015, Sengupta and Graham, 2017, Wellner and Rothman, 2019). Thus, we are interested in how digital research, broadly conceived, makes room for alternative methods and approaches and opens the conversation to new voices. 

The true promise of digital methods is being able to tell stories which were previously understudied or otherwise ignored on the basis of class, race, gender, etc.  Choices of language, data, platform, and variables reveal researchers’ biases and experiences. By sharing stories about the way we choose to do our work, we can offer a more robustly intersectional approach to data-driven humanities research. We foreground method as our unifying theme, seeking discussions of how methodological choices impact the ways of producing meaning. 

The telling of these stories is itself feminist in nature. We are specifically interested in stories about research methods that describe the lived experience of doing the work rather than creating instructional guides for replicating a specific study. We hope this volume will create conversations amongst practitioners of digital scholarship. We intend to structure our collection around topics such as the ethics of working with data, critical views of technology, interoperability, maintenance and costs of human labour, and evolving research methodologies.

We invite contributions on topics such as: 

  • Data collection methods & collections as data
  • Privacy on the web
  • Difficulties locating data/information
  • Changing research questions based on data availability
  • Choices about digital representations of things/people
  • Adapting to existing platforms/programs vs. reinventing the wheel
  • Openness of methods/platforms/programs
  • Collaboration
  • Digital infrastructure, or lack thereof
  • Development of standards (ontologies/taxonomies)
  • Failure
  • When to walk away from a digital project


Important dates

Abstracts (500 words) & 3-pg CVs for all contributors – November 22nd

Full draft (5000 words) – June 2020

Peer review of relevant section/chapters – Summer 2020

Final drafts – November 2020

Junior scholars, people of colour, and other under-represented communities in academia are particularly encouraged to send proposals. Please send all materials to Kim Martin (kmarti20@uoguelph.ca) and Heather Froehlich (hgf5@psu.edu) by November 22, 2019 with the headline ‘Revealing Meaning’. Feel free to contact us about any questions you may have.


Works Cited

Sengupta, A., and Graham, M. “We’re All Connected Now, So Why Is The Internet So White And Western?The Guardian, October 2017.

Wagner, C., Garcia, D., Jadidi, M. and Strohmaier, M., 2015, April. It’s a man’s Wikipedia? Assessing gender inequality in an online encyclopedia. Proceedings of the Ninth International AAAI conference on Web and Social Media.

Wellner, G. and Rothman, T., 2019. Feminist AI: Can We Expect Our AI Systems to Become Feminist?. Philosophy & Technology, pp.1-15.

Using Voyant Tools in the Undergraduate Research Classroom

[note: this post is cross-posted to the Early Modern Recipes Online Collective (EMROC) blog. see it here.]

This semester I have partnered with Dr Marissa Nicosia (Penn State Abington) on an undergraduate research course she runs on Early Modern recipes in collaboration with my colleague Christina Riehman-Murphy as part of the larger Early Modern Recipes Online Collective initiative. In this course, students transcribe recipes from a 17th century recipe book using Dromio (transcribe.folger.edu), learn about Early Modern food culture and history, and develop a lot of hands-on research experience with Marissa, Christina and me. This semester we and our students were focusing on a medicinal and cookery book associated with Anne Western, owned by Folger Shakespeare Library and affectionately called MS v.b.380.

This course has a pretty serious transcription element, where one of the requirements was that each student would transcribe 40 openings using Dromio throughout the semester. Since each student was responsible for submitting a Word file of their transcriptions to Marissa for grading, we then had a substantial (but not complete) coverage of the volume to work with. And it could be easily be loaded into Voyant Tools for some linguistic exploration.

Over the course of the semester, students have also grown increasingly comfortable with the differences between contemporary and early modern recipes with regards to both genre and format, so we wanted to get them to think about the language of recipes more pointedly. The students were already experts in the language and style of the author they were working with. And since the students were so intimately familiar with the work they had already done, it was a little less of a hard sell to get them to think about their work from a more birds-eye view and think about what the language of their recipes looked like in aggregate.

Before class met, we asked the students read three contemporary chocolate chip cookie recipes.[1] We had a brief discussion about the overall form of these contemporary recipes before dropping them in Voyant to practice reading from a birds-eye view. Chocolate chip cookie recipes, as you may have guessed, do not have a whole lot of variation, making this a pretty low-stakes way to introduce the various features of Voyant: word cloud, reading pane, general trends over time (not particularly useful for this genre), concordance, and some basic statistics.

3 chocolate chip cookie recipes in Voyant

three chocolate chip cookie recipes in Voyant

Once they were comfortable with the idea, we ramped up the stakes a little. In small groups, our students looked at their own transcriptions in Voyant, taking notes on what made their sections of the corpus similar and different to each other. This process was designed to get the students to think about what their recipes were doing not just stylistically but linguistically, too: what are the lexical ingredients of their recipes? This primed for discussions about polysemy (a pound of something vs pound the ingredient) and words marking for measurement (spoon). One group even discussed the importance of the verb mingling as a way to describe mixing things together in one student’s particular section of the recipe book.

Now used to the software and the process, we looked at the full class corpus (a compiled file consisting of all the students’ submissions). Students and faculty partners practiced some close reading, identifying terms of interest and looking at them using Voyant’s concordance feature, including a variety of ingredients (sugar, water, butter, mace, rosemary) and verbs for actions chefs may use (again, back to ‘mingling’ and ‘stir’).

VB380 class corpus in Voyant

looking at the class corpus in Voyant Tools

And though we had been discussing the role of VB 380 as a medicinal and cookery book throughout the semester, this was thrown in sharp relief while the we all thought about the language of the class corpus. Certainly one big surprise was the relative importance of ‘sugar’ compared to ‘water’ in VB380. We were very struck by the lack of fixed vocabulary for the recipes, though we all had a pretty clear sense of expectation for the full-class corpus based on our earlier exploration. And, finally, we had a brief discussion about variation and affordances of changing some of data to deal with the question of spelling, authorship and authenticity.

While this was set up to be a discussion of linguistic features (nouns versus verbs; variation; etc) many students commented on how salient some terms were as transcribers – yet these were deemed less important in the overall big picture provided by Voyant. Ultimately, this left our brilliant students thinking about the strengths and weaknesses of both Voyant’s birds-eye view and linear start-to-finish reading – which was an even better outcome than I could have asked for.


Additional notes:
This is related to another undergraduate classroom activity I have done a few times where students read several online articles related to a theme, try out the birds-eye-view approach to the topic using Voyant, and then move into trying it out on some of their own writing (forum posts, assignments, etc) to think about their own style. It works pretty well as a way to get undergrads to think about their own style and editing in a way that is a little less obvious than other formats.

Please also see Miriam Posner’s very excellent Investigating Texts With Voyant workshop (10 April 2019) for more ideas https://github.com/miriamposner/voyant-workshop/blob/master/investigating-texts-with-voyant.md

[1] For the curious: One was from Smitten Kitchen, one from Martha Stewart, and a recipe of their choice from the NYT’s giant chocolate chip cookie compendium. (If I was doing this again, I would maybe not use NYT, as it kept asking us to log in. Also if I was doing this again, I’d be bringing cookies into class!)

How I use Twitter as an academic

An enormous amount of academic life happens in digital spaces these days. The microblogging service Twitter, which has been around since 2007, has all but replaced the academic listserv in 2017.  Despite the various ways that Twitter continues to struggle with content and user management,  it has perhaps accidentally become a widely used professional space for academics to network, exchange ideas, and collaborate. When Twitter works like this, it is brilliant. When Twitter does not work like this, it becomes an incredibly dangerous place for people who are already in precarious positions (for any number of reasons: rank, social identity or identities, job market status, any intersection of the above, etc)

It is becoming increasingly clear to me that academics in general – especially early-stage graduate students – are in desperate need of training, support, & guidance on professional social media spaces. Although more senior scholars are often on social media, they are also secure enough in their positions and academic life in general that their experiences of social media and social networking can be very different than their graduate students.

I’ve been on Twitter since 2010, and I have seen this play out more than a few times, including as a graduate student myself. In these seven years I have maintained what I hope is a very professional profile and I have accidentally amassed a rather large following (in the 1000s). I would not go so far as to say that I am Internet Famous but certainly it is rare I walk into a room now and I don’t know someone there. I try to be very modest about my internet life but I also recognize that is quite difficult when I occupy this space.

People have asked me for years about how I do this. I tweeted yesterday about how I manage my own social media presence, which unexpectedly  got a lot of interest. I thought it might be good to have these up more permanently for reference, as the very nature of Twitter is ephemeral (except for when it’s not). I’ve kept these in mostly 140ch-bites, because brevity is often better than verbosity.

1. Don’t tweet anything that I would not want to see associated with me/my name/my likeness in the international news.

2. Mute people you have to follow for whatever political reason but actively dislike. (it happens.) They don’t know you have them on mute.

3. Mute words that you don’t need to see. I’m not a basketball fan so I have “March Madness” on mute. For example.
I use Tweetdeck to mute individual words. This link explains how you do that on Regular Twitter

4. Everything you say on Twitter is public and reaches lots of people you don’t know.

5. 140 characters (or 280, depending on who you are now) is very, very flattening. Assume the recipient will have the worst interpretation.

6. Twitter is a space for networking and making friends, but also your seniors are watching you. They will write your letters of rec one day. (If you are up for tenure, whatever, they are writing those letters too.)

7. Yelling about politics does not make you a better person. but it does make you feel part of a larger culture of dissatisfication. If that makes you feel better, good for you. It can be more performative than anything else.

8. There is an art to being quiet about some things. This one is hard and takes practice.

9. You like the thing you study, so tweet about what you are doing. Be generous about what you know.

10. Give yourself regular days away from Twitter. People will still be there when you come back. Go outside, watch a film, have a life.

Heather’s 3 rules of doing digital scholarship

In my new job as Digital Scholarship Fellow in Quantitative Text Analysis, I’m starting to work with students and faculty in the Liberal Arts on the hows and ways of counting words in lots of texts. This is a lot of fun, especially because I get to hear a lot about what excites different scholars from across different disciplines and what they think is fascinating about the data they work with (or want to work with).

One thing that is kind of strange about my job – and there are several aspects of my job that have required some adjustment – is that my background is broadly in corpus linguistics and English literature, so I don’t always think of the work I do as being explicitly “DH”. These distinctions are quite frankly boring unless you are knee-deep in the digital humanities, and even then I am not convinced it is an interesting discussion to have. Ultimately, people have lots of preconceived notions about what DH is and why it matters. I suspect that different disciplines within the Humanities writ large have different ideas about this too – certainly the major disciplines I cut across (English, linguistics, history, computer science, sociology) have very different perspectives on the value and experience of digital scholarship. And of course, doing “digital” work in the humanities is kind of redundant anyway: we don’t talk about “computer physics” or “test tube chemistry”, as Mike Witmore and others have pointed out.

Being mindful of this, I have acquired a few rules for doing digital scholarship over the years, and I find myself saying them a lot these days. They are as follows:

1. “Can you do X” is not a research question.

The answer to “can you do X” is almost always “yes”. Should you do X? That’s another story. Can you observe every time Dickens uses the word ‘poor’? Of course you can. But what does it tell you about poverty in Dickens’ novels? Without more detail, this just tells you that Dickens uses the word ‘poor’ in his books about the working class in 19th century Britain — and you almost certainly didn’t need a computer to tell you that. But should you observe every time Dickens uses the word ‘poor’? Maybe, if it means he uses this over other synonyms for the same concept, or if it tells us something about how characters construct themselves as working-class, or if it tells us how higher status characters understand lower-status individuals, or whatever else. These are all research questions which require further investigation, and tell us something new about Dickens’ writing.

2. Programming and other computational approaches are not findings.

So you have learned to execute a bunch of scripts (or write your own scripts) to identify something about your object of study! That’s great. Especially if you are in the humanities, this requires a certain kind of mind-bending that requires you to think about logic structures, understand how computers process information we provide, and in some cases overcome the deeply irregular rules which make your computer language of choice work. This is hard to do! You deserve a lot of commendation for having figured out how to do all of this, especially if your background is not in STEM. But – and this is hard to hear – having done this is not specifically a scholarly endeavour. This is a methodological approach. It is a means to an end, not a fact about the object(s) under investigation, and most importantly, it is not a finding. This is intrinsically tied to point #1: Should you use this package or program or script to do something? Maybe, but then you have to be ready to explain why this matters to someone who does not care about programming or computers, but cares very deeply about whatever the object of investigation is.

3. Get your digital work in front of people who want to use its findings.

Digitally inflected people already know you can use computers to do whatever it is you’re doing. It may be very exciting to them to learn that this particular software package or quantitative metric exists and should be used for this exact task, but unless they also care about your specific research question, there is a limited benefit for them beyond “oh, we should use that too”. However, if you tell a bunch of people in your specific field something very new that they couldn’t have seen without your work, that is very exciting! And that encourages new scholarship, exploring these new issues to those people your findings matter most to. You can tell all the other digital people about the work you’ve done as much as you want, but if your disciplinary base isn’t aware of it, they can’t cite it, they can’t expand on your research, and the discipline as a whole can’t move forward with this fact. Why wouldn’t you want that?

What’s a “book” in Early English Books Online?

Recently I have been employed by the Visualising English Print project, where one of the things we are doing is looking at improving the machine-readability of the TCP texts. My colleague Deidre has already released a plain-text VARDed version of the TCP corpus, but it is our hope to further improve the machine-readability of these texts.

One of the issues that came up in modernising and using the TCP texts has to do with non-English print. It has been previously documented that there are several non-English languages in EEBO – including Latin, Greek, Dutch, French, Welsh, German, Hebrew, Turkish and Algonquin. Our primary issue is if there is a transcription that is not in English in the corpus, it will be very difficult for an English-language text parser or word vector model to account for this material.

So our solution has been to isolate the texts which are printed in a non-English language, either monolingually (e.g. a book in Latin) or a bi- or tri-lingual text (e.g. Dutch/English book, with a Latin foreword). Looking at EEBO-the-books is a helpful way to identify languages in print, as there are all sorts of printed cues to suggest linguistic variation, such as different fonts or italics to set a different language off from the primary language. It also means I get a chance to look at many of these non-English texts as they were printed and transcribed initially.

Three years ago, I wrote a blog post about some Welsh language material that I found in EEBOTCP Phase I. In the intervening time I still have not learned Welsh (though I am endlessly fascinated by it), still get lots of questions and clicks to this site related to Early Modern Welsh (hello Early Modern Welsh fans), and I have since learned quite a lot more about how texts were chosen to be included in EEBO (it involves the STC; Kichuk 2007 is an excellent read on this topic to the previously uninitiated). So while that previous post asked “What makes a text in EEBO an English text”, this post will ask “what makes a text in EEBO a book?”

In general, I think we can agree that in order to be considered a book or booklet or pamphlet, a printed object has to have several pages. These pages can either created through folding one broadside sheet, or it will have collection of these (called gatherings). It may or may not have a cover, but it would be one of several sizes (quarto, folio, duodecimo, etc). To this end, Sarah Werner has an excellent exercise on how to fold a broadside paper into a gathering which builds the basis for many, but probably not all, early books. Here is an example of a broadside that has clearly been folded up; it’s been unfolded for digitization.

folded broadsheeet A17754           TCPID A17754

So it has been folded in a style that suggests it could be read like a book, but it is not necessarily a book in the sense that there is a distinct sense of each individual page and that some of the verso/recto pages would be rendered unreadable unless they had been cut, etc.

In order to be available for digitization from the original EEBO microfilms, a text needed to be included in a short title catalogue. The British Library English Short Title Catalogue describes itself as

a comprehensive, international union catalogue listing early books, serials, newspapers and selected ephemera printed before 1801. It contains catalogue entries for items issued in Britain, Ireland, overseas territories under British colonial rule, and the United States. Also included is material printed elsewhere which contains significant text in English, Welsh, Irish or Gaelic, as well as any book falsely claiming to have been printed in Britain or its territories.

I select the British Library ESTC here because it covers several short title catalogues (Wing and Pollard & Redgrave are both included) and it’s my go-to short title catalogue database. Including “ephemera” is important, because it allows any number objects to be considered as items of early print, even if they’re not really ‘books’ per se.

Such as this newspaper (TCPID A85603)…

newspaper A85603

Or this this effigy, in Latin, printed on 1 broadside (TCP id A01919); click to see full-sizeeffigy A01919

Or this proclamation, also printed on 1 broadside (TCPID A09573)

proclamation A09573

Or this sheet of paper, listing locations in Wales (Wales! Again!) (TCPID A14651); click to see full-size

Screen Shot 2016-07-12 at 9.00.48 pm


Or this acrostic (TCPID A96959); click to see full-size


Interestingly, these are all listed as “one page” in the Jisc Historical books metadata, though they are perhaps more accurately “one sheet”. While there’s no definitive definition of “English” in Early English Books Online, it’s becoming increasingly clear to me that there’s no definitive definition of “book” either. And thank god for that, because EEBO is the gift that keeps giving when it comes to Early Modern printed materials.

10 Things You Can Do with EEBO-TCP Phase I

The following are a list of resources I presented at Yale University (New Haven, CT, USA) on 4 May 2016 as part of my visit to the Yale Digital Humanities Lab. Thank you again for having me! This resource list includes work by colleagues of mine from the Visualising English Print project at the University of Strathclyde. You can read more about their work on our website, or read my summary blog post at this link.

You can download the corresponding slides at this link and the corresponding worksheet at this link.

EEBO and the TCP initiative
Official page http://www.textcreationpartnership.org/tcp-eebo/
The History of Early English Books Online http://folgerpedia.folger.edu/History_of_Early_English_Books_Online
Transcription guidelines & other documentation http://www.textcreationpartnership.org/docs/
EEBO-TCP Tagging Cheatsheet: Alphabetical list of tags with brief descriptions http://www.textcreationpartnership.org/docs/dox/cheat.html
Text Creation Partnership Character Entity List http://www.textcreationpartnership.org/docs/code/charmap.htm

Access the EEBOTCP-1 corpus
#1 – Download the XML files https://umich.app.box.com/s/nfdp6hz228qtbl2hwhhb/
#2 – Search the full text transcription repository online
http://quod.lib.umich.edu/e/eebogroup/ *
or http://eebo.odl.ox.ac.uk/e/eebo/ *
or http://ota.ox.ac.uk/tcp/
* these are mirrors of each other

#3 – Find a specific transcription
STC number vs ESTC number vs TCPID number
STC = specific book a transcription is from
ESTC number = “English Short Title Catalogue” (see http://estc.bl.uk/F/?func=file&file_name=login-bl-estc)
TCPID = specific transcription (A00091)

#4 – Search the full text corpus of EEBOTCP*
(*Can include EEBO-TCP phase I, Phase I and Phase II; read documentation carefully)

EEBO-TCP Ngram reader, concordancer & text counts http://earlyprint.wustl.edu/ (big picture)
CQPWeb EEBO-TCP, phase I https://cqpweb.lancs.ac.uk/eebov3
#5 – identify variant spellings BYU Corpora front end to EEBO-TCP (*not completely full text but will be soon*) http://corpus.byu.edu/eebo (potential variant spellings; see also EEBO NGram Viewer, above)

Find specific information in the TCP texts…
#6 – Find a specific language
e.g. Welsh: *ddg*
#7 a specific term or concept using the Historical Thesaurus of the OED, trace with resources listed above (http://historicalthesaurus.arts.gla.ac.uk/)

#8 Curate corpora
Alan Hogarth’s Super Science corpus uses EEBO-TCP provided metadata + disciplinary knowledge to curate texts about scientific writing

#9 Clean up transcriptions – Shota Kikuchi’s work
Using the VARD spelling moderniser + PoS tagging (Stanford PoS tagger and TregEx → improvements for tagging accuracy, syntactic parsing)

#10 Teach with it!
Language of Shakespeare’s plays web resource lead by Rebecca Russell, undergraduate student (University of Strathclyde Vertically Integrated Project, English Studies / Computer Science)

Introducing “Social Identity In Early Modern Print: Shakespeare, Drama (1514-1662) and EEBO-TCP Phase I”

Today I am enormously pleased to introduce my PhD thesis, entitled “Social Identity In Early Modern Print: Shakespeare, Drama (1514-1662), and EEBO-TCP Phase I”. It investigates how Shakespeare’s use of terms marking for social identity compares to his dramatist contemporaries and in Early Modern print more generally using the Historical Thesaurus of the Oxford English Dictionary and the Early English Books Online Text Creation Partnership Phase I corpus.

Here are some facts about it: It ended up being 267 pages long; without the bibliography/front matter/appendices it’s 51,399 words long, which is about the length of 2 ⅓ Early Modern plays. There are 5 chapters and 3 appendices. It’s dedicated to Lisa Jardine, whose 1989 book Still Harping on Daughters: Women and Drama in the Age of Shakespeare changed my life, getting me interested in social identity in the Early Modern period in the first place. I am so grateful for Jonathan Hope and Nigel Fabb’s guidance. My examination is scheduled to be within the next two months, which is relatively soon. I’m excited to have it out in the world.

In lieu of an abstract I have topic modelled most of the content chapters for you (I had to take out tables and graphs, and some non-alphanumeric characters; I’m sure you will forgive me). For the unfamiliar, topic modelling looks for weighted lexical co-occurance in text. You can read more about it at this link.

A note on abbreviations: The SSC (“Standardized Spelling WordHoard Early Modern Drama corpus, 1514-1662”, Martin Mueller, ed. 2010) is a subcorpus of Early Modern drama from EEBO-TCP phase I which use to set up comparisons between Shakespeare and his dramatist contemporaries. It generally serves as a prototype for Mueller’s later Shakespeare his Contemporaries corpus (2015). EEBOTCP is Early English Books Online Text Creation Partnership phase I and HTOED is the Historical Thesaurus of the Oxford English Dictionary. Wordhoard is a complete set of Shakespeare’s plays, deeply tagged for quantitative analysis.

1. play act suggests titus scene time bigram tragic iv strategy naming change setting hamlet shift iii downfall beginning serves construction established lack direct tamora revenge hero moment modes cleopatra andronicus
2. forms similar considered general categories surrounding directly position multiple common found listed contexts illustrates examples plays contrast concept variant sample kinds small included covering past perceived suggested thinking variety investigated
3. concordance search query view plot title instance queen lines cqpweb plots image string process line antconc average regular aggregated queries images normalized imagemagick size page length produce expressions bible adjacent
4. gentry status figures table models classes social high members nevalainen raumolin upper archer brunberg model nobility part identifying explicitly servants references presented clergy commoner account commoners noble nobles system ranking
5. england make acts paul writing true thou evidence back primary conversion festus thee threat noble art long sermon doctor christianity madde effect john audience speak verse church apostles truth argument
6. man lady generally native understood men latinate positive good beauteous virtuous character contrast register night proper adjectives poor takes lost love stylistic day raising formal gentle sweet honourable qualities dream
7. social class vocatives status titles specific marking vocative comedies identify tragedies histories indicative feature busse absence broadly pattern courteous moments middle start markers visualisation names individuals world finish amount presence
8. characters plays character title necessarily number total play mistress high roles list speaking rank comedy considered minor remains duke romeo ephesus syracuse juliet marriage analysis attendants lords measure merchant taming
9. social historical study literary language identity linguistic features studies culture evidence primarily feminist sociolinguistic criticism thesis claims community cultural approach methods scholars personal identify research previously renaissance understand central jardine
10. part discuss question bedlam stage represent order shown socially shakespearean describe honest invoking earlier ford duchess run discussion instability neely husband alongside ridge acting issues bellamont put accounts considers whore
11. woman collocates lady register knave nouns lower wench man terms unique term binary negative low lord high status variation show describe gendered collocate lexical semantic higher describes formality individuals noun
12. individual lord ways include reference sir politeness set result king henry applied strategies referenced parts address othello court philaster semantically marker due master wife introduce kind cardinal correlation question structures
13. female male characters gender women means dialogue patterns feminine find fewer marked idea role objects masculine grammatical race dependent skewed understood heavily ambiguous distinctions body rackin finally century negation interest
14. shows figure compared million authors instances unlike difference clear difficult contemporary specifically previous usage frequency comparatively ssc picture rest consistently shakespeare begin strong scenes times john place possibility england thomas
15. shakespeare plays show contemporaries larger chapter variation compare history dramatists relationship dramatist genre attention observe writing compares style divide edited internal exceptional assigned studied cover challenge correlate finding worthy lear
16. mad sanity actions phrase claim religious reason threatening face gram suggest establish dangerous prove suggesting potentially claims orleans biting attempts state explored evidence accusations grams justify corvino recurring circumstances accused
17. corpus corpora section ssc spelling included including making full range data meaning genres author frequently identify drama represents wider investigation comparisons nature side discussions observe metadata wide separate dialogues smaller
18. present hand http made oed malvolio works project head makes ii complete issue end fool ability thesaurus www requires illustrate kay dictionary org gutenberg hope oxford referencing initially house uk
19. early modern tcp eebo drama english phase period print online books creation widely playwrights partnership focus limited handful short jonson pronouns catalogue publication diachronic selected initiative recently years markedly norm
20. words examples frequent function highly word potential table left phrases node analysis ranked span pronoun content unclear multi lexical keyword context sense luna verbs collocate count finds unit highlights large
21. dramatic structure narrative action characterization model outlined freytag shakespearean visible work understanding point argues schmidt end tragedy jockers genre introduced constructions construct plausible description sudden suggest neuberg fiske climax reasons
22. texts text information based digital lists wordhoard personae dramatis mueller editions printed view folger selection designed source finally print includes burns ultimately discussing machine nameless edition readable time critical provided
23. madness terms mental htoed synonyms relevant illness form crazy melancholy historically health person related synonymous brained symptoms insanity scientific physical medical dement linked term relating semantic crazed variations definition discussed
24. test lemma collocates probability results likelihood collocational information analysis log dice lemmata top specific tests coefficient collocation based squared method frequency mutual relationships conditional chi appears phi produce methods wordhoard
25. form texts speaker culpeper discourse speech world context category spoken constructed kyt hearer addressee al biber analyses interaction writing written lutzky interactions fictional imaginary relevant circumstances addition person people real