An introductory bibliography to corpus linguistics

This is a short bibliography meant to get you started in corpus linguistics – it is by no means comprehensive, but should serve to be a good introductory overview of the field.

>>This page is updated semi-regularly for link rot; if you find any dead links please contact me at heathergfroehlich at gmail dot com. Thanks!<<

1.1 Books (and one article)
1.2 Scholarly Journals
Literary and Linguistic Computing

1.3 Externally compiled bibliographies and resources
David Lee’s Bookmarks for corpus-based linguistics
Costas Gabrielatos has been compiling a bibliography of Critical Discourse Analysis using corpora, 1982-present
Members of the corpus linguistics working group UCREL at Lancaster University have compiled some of their many publications here; see also their LINKS page
Michaela Mahlberg is one of the leading figures in corpus stylistics (especially of interest if you want to work on literary texts); in 2006 she helped compile a corpus stylistics bibliography (pdf) with Martin Wynne.
Lots of work is done on Second Language Acquisition using learner corpora. Here’s a compendium of learner corpora

Corpora-List (mailing list)
CorpusMOOC, run out of Lancaster University, is an amazingly thorough resource. Even if you can’t do everything in their course, there’s lots of step-by-step how-tos, videos, notes, readings, and help available for everyone from experts to absolute beginners.

1.4 Compiled Corpora
Xiao, Z. (2009). Well-Known and Influential Corpora,  A Survey, based on Xiao (2009), “Theory-driven corpus research: using corpora to inform aspect theory”. In A. Lüdeling & M. Kyto (eds) Corpus Linguistics: An International Handbook [Volume 2]. Berlin: Mouton de Gruyter. 987-1007.
Various Historical Corpora
Oxford Text Archive
Linguistic Data Consortium
CQPWeb, a front end to various corpora
BYU Corpora
NLTK Corpora
1.5 DIY Corpora (some work required)
Project Gutenberg
LexisNexis Newspapers
LexisNexis Law
BBC Script Library

1.6 Concordance software
No one software is better than another, though some are better at certain things than others. Much here comes down to personal taste, much like Firefox vs Chrome or Android vs iPhone. While AntConc, which is what I use, is great it is far from the only software available. (Note that these may require a licencing fee.)
R (for the advanced user)
Anthony, Laurence. (2013). “A critical look at software tools in corpus linguistics.” Linguistic Research 30(2), 141-161.

1.7 Annotation You may want to annotate your corpus for certain features, such as author, location, specific discourse markers, parts of speech, transcription, etc. Some of the compiled corpora might come with included annotation.
Text Encoding Initiative
A Gentle Introduction to XML
Hardie, A (2014) ““Modest XML for Corpora: Not a standard, but a suggestion”. ICAME Journal 38: 73-103.
UAM Corpus Tool does both concordance work and annotation

1.7.1 Linguistic Annotation
Natural Language Toolkit the NLTK book
Stanford NLP Parser (includes Named Entity Recognition, semantic parser, and grammatical part-of-speech tagging)
CLAWS, a part of speech tagger
USAS, a semantic tagger

1.8 Statistics Help 1.8.1 Not Advanced
Wikipedia (great for advanced concepts written for the non-mathy type)
Log Likelihood, explained
AntConc Videos
WordSmith Getting Started Files
Oakes, M. (1998): Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. Baroni, M. and S. Evert. (2009): “Statistical methods for corpus exploitation”, in A. Lüdeling and M. Kytö (eds.), Corpus Linguistics: An International Handbook Vol. 2. Berlin: de Gruyter. 777-803. 1.8.2 Advanced Stefan Th. Gries’ publications: Adam Kilgarriff’s publications: Pre-2009 Post-2009
Baayen, R.H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.

