This past week I was at DHWI, the inaugural digital humanities winter institute at the University of Maryland, modeled on the highly successful DHSI model. I was taking a class on R, the statistical analysis package and coding language, and applying it to large-scale text analysis. Large-scale text analysis can be and often is stylistic in nature: questions of authorship, questions of generic form, specific modes of language in use. I know this as “corpus stylistics” and “corpus linguistics”.
I left the US to pursue my postgraduate education in literary linguistics and stylistics, a notion that never quite caught on in North America in the same way that it caught on in Europe. It’s been really interesting to hear about this sudden resurgent interest in stylistics and corpus studies in North America over the past few years, but with a different name (‘text mining’, ‘big data’ ‘humanities computing’ – the list goes on; take your pick). The overlap between what North American academics have been calling ‘text analysis’ and its variant forms compared to corpus linguistics is huge. And yet so rarely does this cross-pollination seem to be discussed: the Humanist list has recently been discussing the role of XML markup and its uses; this is often an ongoing conversation on corpora-list.
When I attend events like this I’m always quietly alarmed that there are digital humanists who are, in essence, participating in a mode of linguistic inquiry which has been around for ~30 years, but seem to have absolutely zero conception of this early work: the use of existing concordance software and the ways they have been applied to a number of textual objects (newspapers, letters, books, English as a foreign language, representative samples of language in a specific time frame, etc etc.). Scholars and colleagues have admitted to me that they had no idea about ethics and good practice in corpus linguistics. Compare that to conversations I’ve had with colleagues who call themselves corpus linguists but not strictly digital humanists: they have more of a conception of what’s happening in digital humanities than DHers do with corpus linguistics. This is a publicity problem, wherein the digital humanities are new and sexy and buzzwordily exciting on both sides of the pond. This is in contrast to the more European tradition of corpus linguistics, albeit with some notable exceptions in North America and China (which leads to some interesting questions about privileging one higher education system over another in the so-called ‘citizen of the globe’ ethos being pushed). Where corpus linguistics and corpus stylistics has the historical precedence, it doesn’t have quite the public presence that digital humanities does- nor, indeed, the wide-ranging ‘humanities’ attached. As a result, the DH crowd is getting a lot of attention by the higher education community.
A number of people this past week have asked me about corpus linguistics/corpus stylistics, the tools we use, and how they could possibly use them for their projects, which has been exciting. If I wanted to do x, where would I begin? What can you do with this kind of information? I constantly am directing them to the same sources, which are largely bibliographies of corpus stylistic work. This page from PALA is somewhat outdated (from 2006) but will prove a good starting point with regards to who is doing what in what is essentially the same field but with different names on different continents. And the response is always “Oh! I had no idea.”
I personally want to see the gap between corpus studies and digital humanities close, and I don’t think I’m the only one to feel that way. How can we fix this? My modest suggestion is to start using John McHardy Sinclair’s works Corpus Corcordance Collocation (1991) and Trust the Text (2004) as digital humanities reference material as they are both short and major works in corpus linguistics as a methodology and as a theory. There’s a model already for an issue that one half of a community is currently wrestling with, as solved by the other half quite a while ago.