A Text Analysis By Any Other Name

This past week I was at DHWI, the inaugural digital humanities winter institute at the University of Maryland, modeled on the highly successful DHSI model.  I was taking a class on R, the statistical analysis package and coding language, and applying it to large-scale text analysis. Large-scale text analysis can be and often is stylistic in nature: questions of authorship, questions of generic form, specific modes of language in use. I know this as “corpus stylistics” and “corpus linguistics”.

I left the US to pursue my postgraduate education in literary linguistics and stylistics, a notion that never quite caught on in North America in the same way that it caught on in Europe. It’s been really interesting to hear about this sudden resurgent interest in stylistics and corpus studies in North America over the past few years, but with a different name (‘text mining’, ‘big data’ ‘humanities computing’ – the list goes on; take your pick). The overlap between what North American academics have been calling ‘text analysis’ and its variant forms compared to corpus linguistics is huge. And yet so rarely does this cross-pollination seem to be discussed: the Humanist list has recently been discussing the role of XML markup and its uses; this is often an ongoing conversation on corpora-list.

When I attend events like this I’m always quietly alarmed that there are digital humanists who are, in essence, participating in a mode of linguistic inquiry which has been around for ~30 years, but seem to have absolutely zero conception of this early work: the use of existing concordance software and the ways they have been applied to a number of textual objects (newspapers, letters, books, English as a foreign language, representative samples of language in a specific time frame, etc etc.). Scholars and colleagues have admitted to me that they had no idea about ethics and good practice in corpus linguistics. Compare that to conversations I’ve had with colleagues who call themselves corpus linguists but not strictly digital humanists: they have more of a conception of what’s happening in digital humanities than DHers do with corpus linguistics. This is a publicity problem, wherein the digital humanities are new and sexy and buzzwordily exciting on both sides of the pond. This is in contrast to the more European tradition of corpus linguistics, albeit with some notable exceptions in North America and China (which leads to some interesting questions about privileging one higher education system over another in the so-called ‘citizen of the globe’ ethos being pushed). Where corpus linguistics and corpus stylistics has the historical precedence, it doesn’t have quite the public presence that digital humanities does- nor, indeed, the wide-ranging ‘humanities’ attached. As a result, the DH crowd is getting a lot of attention by the higher education community.

A number of people this past week have asked me about corpus linguistics/corpus stylistics, the tools we use, and how they could possibly use them for their projects, which has been exciting. If I wanted to do x, where would I begin? What can you do with this kind of information? I constantly am directing them to the same sources, which are largely bibliographies of corpus stylistic work. This page from PALA is somewhat outdated (from 2006) but will prove a good starting point with regards to who is doing what in what is essentially the same field but with different names on different continents. And the response is always “Oh! I had no idea.”

I personally want to see the gap between corpus studies and digital humanities close, and I don’t think I’m the only one to feel that way. How can we fix this? My modest suggestion is to start using John McHardy Sinclair’s works Corpus Corcordance Collocation (1991) and Trust the Text (2004) as digital humanities reference material as they are both short and major works in corpus linguistics as a methodology and as a theory. There’s a model already for an issue that one half of a community is currently wrestling with, as solved by the other half quite a while ago.



  1. Some of the lower profile you observe for corpus linguistics and corpus stylistics, relative to DH, is the byproduct of a strategic choice, made by some, the strategic choice against openness.

    I will not claim it is a representative case, but it is a telling one, that the purveyors of WordNet Affect would not share their corpus with a nonprofit organization with an NLP & AI working group, because it is not an accredited academic institution. What harm could be done by such sharing is entirely occult to comprehension, as far as I can tell, but it drove this group directly in the AI direction, where access to closely held corpora would not hold them back.

    No one in the NLP / AI working group were in any way imagining themselves to be digital humanists, but the Wordnet Affect project is part of corpus linguistics. The anecdote of exclusion may not — or _may_ — be representative. It may be well worth rebutting; but no rebuttal can erase the unfortunate instance to which it refers.

    Your “modest suggestion” I will, personally, take full readerly advantage of, to the extent and in the time that I am able, because there is something valuable in it for me. It is worth noticing something salient in its allusive resonances, though, while unloading its rhetorical freight into my warehouse of future insight. When I see the formation “modest – NP” & the NP is a close synonym to “proposal,” I unpack that in a section of floor devoted to Jonathan Swift and his rhetorical descendants. His Modest Proposal was for Irish parents to sell their children as food to the rich in order to alleviate their own poverty, was it not? In sympathy with the author of the present piece, I will hope the discoveries of corpus linguists will not require as much of their DH audience as Swift’s Proposal, taken literally, would have asked of Irish parents; and it is a given that the author cannot have meant it in a similarly arch way. But the resonance is there, and it hits on the readerly reflex to probe for indirect discourse as surely as a rubber hammer probes at the patellar tendon. “My modest suggestion”: Is it modest? Is it immodest?

    If it is immodest, that is not a grounds for rejection, either. If it’s worth saying, say it out loud.

    Same for what a reader takes from it.

    In radio systems, a signal is in parts reflected, transmitted, and absorbed. In reading, I look for something to absorb. I tell myself in this case to read literally, not automatically to reflect the suggestion, however the adjective tinctures it.

    It is a difficult thing, to see one’s work fully in the process of reinvention in another field, all the pitfalls and lessons and hard, sometimes lifelong work, without reference to the work one has already done, the work one’s colleagues and teachers and mentors have already done. It is worthwhile to transmit that work, in its best form, into the newly developing tradition. It is worthwhile to solicit attention and it is work well spent to entertain this kind of suggestion.

    At all points of the compass, reciprocity: reader’s investment for writer’s, elder field for younger, younger for elder… And so, I will do what I can to profit from the suggestions offered. And I will ask for something in exchange.

    What I will ask is attention to the likelihood that when NLP and corpus methods are applied in new domains, by new and potentially naive practitioners, corpus linguists and stylists see the new work at least in part for its scholarly merits, independent of the familiarity of the methods it applies. I.e., if someone applies an old method to a new scholarly problem or domain, please–please–don’t insist that “this is a solved problem.” To say so is to miss the point. Once a method is developed, it deserves to be applied. Digital Humanities, to the extent that it is a field at all, seems to me to at least in part arise as a collision of disciplines that have invested heavily to develop methods with a variety of disciplines that would apply them, as extensions to ways of working already developed. The domain of application — e.g. issues in European history, questions of colonialism, any topic you can conceive — will have value of its own.

    The author advises: “There’s a model already for an issue that one half of a community is currently wrestling with, as solved by the other half quite a while ago.”


    But for many, the question of _how_ to apply a methodology (here, corpus linguistics or stylistics) is only the “technical” side of equally deep questions in the domains to which it is applied. It must be appreciated for its subtleties and its implications for one’s conclusions (some would say its ideological non-neutrality, quite the same thing, viewed from a different angle). But the scholarly domains into which digital and corpus methodologies are introduced manifest subtleties and structure peculiarly their own, equally worth attention.

    There is a model already for the response I have offered here, as well. Actually there are two families of models: (1) The model of closure, evidenced by the corpus linguists aforementioned, contradistinct to the programmatic openness of digital humanities; (2) the model of misattributed closure of problems “already solved,” where methodology is hoped to be made available to a new domain of study.

    Somewhere in a corpus, by some method, the kind of reader response I am manifesting in print may well be evident. Everywhere there are echoes. There is, everywhere as well, a reciprocity of obligations. You can see here in writing a sketch of the reflexes elicited by the present article; I am not sure whether they are modest or not, and I leave it to you to choose which are sufficiently representative to absorb, which to reflect, and which to transmit.

    I expect to make time to acquire the works cited, and to benefit from them. I hope sufficient corpus resources will be available, so that their lessons may be applied.

    1. Hi Phil,

      Thank you very much for these thoughts- and you’re absolutely correct that it is a specifically strategic move by some. The wordnet example is a good one: access and openness is certainly not a tenet of corpus linguistics and stylistics the same way they have been for NLP, AI and DH scholars, and I agree that it is a problem. I have a forthcoming post planned where I will be addressing some of these issues.

      As for my modest proposal: it is indeed modest, as I would like to see others follow suit and consider the suggestion of reaching back a little bit farther into another field’s (or subfield’s) history. Similarly I feel strongly that we should be considering earlier work from the social sciences as a large-scale digital movement picks up speed in the US and abroad. I can’t guarantee that others will do so, and I hope by providing links and references, you and others like you will consider how these fields can inform each other in productive ways.

      It is indeed a difficult thing to watch a reinvention of one field into another, and while I say that the problem has been solved I simply mean that the issue of “are we theory or are we methodology” has been addressed for corpus linguistics and there are two largely understood avenues: the corpus as a form of theory or the corpus as a form of methodology. (Please see Corpus Linguistics at Work, Tognini-Bonelli 2001: chapter 5, available here http://books.google.co.uk/books?id=z0TZmK1YWTIC&lpg=PA84&ots=er-Leip1yp&dq=corpus%20driven%20corpus%20based&pg=PA84#v=onepage&q=corpus%20driven%20corpus%20based&f=false for a discussion of corpus theory vs method).

