An introductory bibliography to corpus linguistics

This is a short bibliography meant to get you started in corpus linguistics – it is by no means comprehensive, but should serve to be a good introductory overview of the field.

>>This page is updated semi-regularly for link rot; if you find any dead links please contact me at heathergfroehlich at gmail dot com. Thanks!<<

1.1 Books (and one article)
Baker, Paul, Andrew Hardie and Tony McEnery. (2006). A Glossary of Corpus Linguistics. Edinburgh, Edinburgh UP.
Biber, Douglas (1993). “Representativeness in Corpus Design”. Literary and Linguistic Computing, 8 (4): 243-257.
Biber, Douglas, Susan Conrad and Randi Reppen (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge UP.
Granger, Sylviane, Joseph Hung and Stephanie Peych-Tyson. (2002). Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins.
Hoey, Michael, Michaela Stubbs, Michaela Mahlberg, and Wolfgang Teubert. (2011). Text, Discourse and Corpora. London: Continuum.
Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press.
Mahlberg, Michaela. (2013). Corpus Stylistics and Dickens’ Fiction. London: Routledge.
McEnery, T. and Hardie, A. (2012). Corpus Linguistics: Method, theory and practice. Cambridge: Cambridge UP.
O’Keefe, Anne and Michael McCarthy, eds. (2010).The Routledge Handbook of Corpus Linguistics. London: Routledge.
Sinclair, John and Ronald Carter. (2004). Trust the Text. London: Routledge.
Sinclair, John. (1991) Corpus Concordance Collocation. Oxford: Oxford UP.
Wynne, M (ed.) (2005). Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow Books.

1.2 Scholarly Journals
Literary and Linguistic Computing

1.3 Externally compiled bibliographies and resources
David Lee’s Bookmarks for corpus-based linguistics
Costas Gabrielatos has been compiling a bibliography of Critical Discourse Analysis using corpora, 1982-present
Members of the corpus linguistics working group UCREL at Lancaster University have compiled some of their many publications here; see also their LINKS page
Michaela Mahlberg is one of the leading figures in corpus stylistics (especially of interest if you want to work on literary texts); in 2006 she helped compile a corpus stylistics bibliography (pdf) with Martin Wynne.
Lots of work is done on Second Language Acquisition using learner corpora. Here’s a compendium of learner corpora

Corpora-List (mailing list)
CorpusMOOC, run out of Lancaster University, is an amazingly thorough resource. Even if you can’t do everything in their course, there’s lots of step-by-step how-tos, videos, notes, readings, and help available for everyone from experts to absolute beginners.

1.4 Compiled Corpora
Xiao, Z. (2009). Well-Known and Influential Corpora,  A Survey, based on Xiao (2009), “Theory-driven corpus research: using corpora to inform aspect theory”. In A. Lüdeling & M. Kyto (eds) Corpus Linguistics: An International Handbook [Volume 2]. Berlin: Mouton de Gruyter. 987-1007.
Various Historical Corpora
Oxford Text Archive
Linguistic Data Consortium
CQPWeb, a front end to various corpora
BYU Corpora
NLTK Corpora
1.5 DIY Corpora (some work required)
Project Gutenberg
LexisNexis Newspapers
LexisNexis Law
BBC Script Library

1.6 Concordance software
No one software is better than another, though some are better at certain things than others. Much here comes down to personal taste, much like Firefox vs Chrome or Android vs iPhone. While AntConc, which is what I use, is great it is far from the only software available. (Note that these may require a licencing fee.)
R (for the advanced user)
Anthony, Laurence. (2013). “A critical look at software tools in corpus linguistics.” Linguistic Research 30(2), 141-161.

1.7 Annotation You may want to annotate your corpus for certain features, such as author, location, specific discourse markers, parts of speech, transcription, etc. Some of the compiled corpora might come with included annotation.
Text Encoding Initiative
A Gentle Introduction to XML
Hardie, A (2014) ““Modest XML for Corpora: Not a standard, but a suggestion”. ICAME Journal 38: 73-103.
UAM Corpus Tool does both concordance work and annotation

1.7.1 Linguistic Annotation
Natural Language Toolkit the NLTK book
Stanford NLP Parser (includes Named Entity Recognition, semantic parser, and grammatical part-of-speech tagging)
CLAWS, a part of speech tagger
USAS, a semantic tagger

1.8 Statistics Help 1.8.1 Not Advanced
Wikipedia (great for advanced concepts written for the non-mathy type)
Log Likelihood, explained
AntConc Videos
WordSmith Getting Started Files
Oakes, M. (1998): Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. Baroni, M. and S. Evert. (2009): “Statistical methods for corpus exploitation”, in A. Lüdeling and M. Kytö (eds.), Corpus Linguistics: An International Handbook Vol. 2. Berlin: de Gruyter. 777-803. 1.8.2 Advanced Stefan Th. Gries’ publications: Adam Kilgarriff’s publications: Pre-2009 Post-2009
Baayen, R.H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.

heather froehlich // last updated 06 Feb 2016