text analysis | heather froehlich

This is a short bibliography meant to get you started in corpus linguistics – it is by no means comprehensive, but should serve to be a good introductory overview of the field.

>>This page is updated semi-regularly; if you find any dead links please contact me at hgf5 at psu dot edu. Thanks!<<

1.0 General resources
Froehlich, H. “Intro to Text Analysis”. Penn State University Library Guides. (30 May 2018), http://guides.libraries.psu.edu/textanalysis
Froehlich, H. “Text mining: Web-based resources”. Penn State University Library Guides. (10 October 2018) https://guides.libraries.psu.edu/textmining/web

1.1 Books (and two articles)
Baker, Paul, Andrew Hardie and Tony McEnery. (2006). A Glossary of Corpus Linguistics. Edinburgh, Edinburgh UP.
Atkins, Sue, Jeremy Clear and Nicholas Oster. (1991) “Corpus Design Criteria”. http://www.natcorp.ox.ac.uk/archive/vault/tgaw02.pdf
Biber, Douglas (1993). “Representativeness in Corpus Design”. Literary and Linguistic Computing, 8 (4): 243-257. http://llc.oxfordjournals.org/content/8/4/243.abstract
Biber, Douglas, Susan Conrad and Randi Reppen (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge UP.
Granger, Sylviane, Joseph Hung and Stephanie Peych-Tyson. (2002). Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins.
Hoey, Michael, Michaela Stubbs, Michaela Mahlberg, and Wolfgang Teubert. (2011). Text, Discourse and Corpora. London: Continuum.
Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press.
Mahlberg, Michaela. (2013). Corpus Stylistics and Dickens’ Fiction. London: Routledge.
McEnery, T. and Hardie, A. (2012). Corpus Linguistics: Method, theory and practice. Cambridge: Cambridge UP.
O’Keefe, Anne and Michael McCarthy, eds. (2010).The Routledge Handbook of Corpus Linguistics. London: Routledge.
Sinclair, John and Ronald Carter. (2004). Trust the Text. London: Routledge.
Sinclair, John. (1991) Corpus Concordance Collocation. Oxford: Oxford UP.
Wynne, M (ed.) (2005). Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow Books. http://www.ahds.ac.uk/creating/guides/linguistic-corpora/

1.2 Scholarly Journals
Corpora http://www.euppublishing.com/journal/cor
ICAME http://icame.uib.no/journal.html
IJCL https://benjamins.com/#catalog/journals/ijcl
Literary and Linguistic Computing http://llc.oxfordjournals.org/

1.3 Externally compiled bibliographies and resources
David Lee’s Bookmarks for corpus-based linguistics http://www.uow.edu.au/~dlee/CBLLinks.htm
Costas Gabrielatos has been compiling a bibliography of Critical Discourse Analysis using corpora, 1982-present https://www.edgehill.ac.uk/englishhistorycreativewriting/staff/dr-costas-gabrielatos/?tab=docs-bibliography
Members of the corpus linguistics working group UCREL at Lancaster University have compiled some of their many publications here http://ucrel.lancs.ac.uk/pubs.html; see also their LINKS page http://ucrel.lancs.ac.uk/links.html
Michaela Mahlberg is one of the leading figures in corpus stylistics (especially of interest if you want to work on literary texts) http://www.michaelamahlberg.com/publications.shtml; in 2006 she helped compile a corpus stylistics bibliography (pdf) with Martin Wynne.
Lots of work is done on Second Language Acquisition using learner corpora. Here’s a compendium of learner corpora http://www.uclouvain.be/en-cecl-lcworld.html

Corpora-List (mailing list) http://torvald.aksis.uib.no/corpora/
CorpusMOOC https://www.futurelearn.com/courses/corpus-linguistics, run out of Lancaster University, is an amazingly thorough resource. Even if you can’t do everything in their course, there’s lots of step-by-step how-tos, videos, notes, readings, and help available for everyone from experts to absolute beginners.

1.4 Compiled Corpora
Xiao, Z. (2009). Well-Known and Influential Corpora, A Survey http://www.lancaster.ac.uk/staff/xiaoz/papers/corpus%20survey.htm, based on Xiao (2009), “Theory-driven corpus research: using corpora to inform aspect theory”. In A. Lüdeling & M. Kyto (eds) Corpus Linguistics: An International Handbook [Volume 2]. Berlin: Mouton de Gruyter. 987-1007.
Various Historical Corpora http://www.helsinki.fi/varieng/CoRD/corpora/index.html
Oxford Text Archive http://ota.ahds.ac.uk/
Linguistic Data Consortium http://catalog.ldc.upenn.edu/
CQPWeb, a front end to various corpora https://cqpweb.lancs.ac.uk/
BYU Corpora http://corpus.byu.edu/
NLTK Corpora http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
1.5 DIY Corpora (some work required)
Project Gutenberg http://gutenberg.org
LexisNexis Newspapers https://www.lexisnexis.com/uk/nexis/
LexisNexis Law https://www.lexisnexis.com/uk/legal
BBC Script Library http://www.bbc.co.uk/writersroom/scripts

1.6 Concordance and Other software
No one software is better than another, though some are better at certain things than others. Much here comes down to personal taste, much like Firefox vs Chrome or Android vs iPhone. While AntConc, which is what I use, is great it is far from the only software available. (Note that these may require a licencing fee.)
AntConc http://www.laurenceanthony.net/software/antconc/
Wordsmith http://lexically.net/
Monoconc http://www.monoconc.com/
CasualConc https://sites.google.com/site/casualconc/
Wmatrix http://ucrel.lancs.ac.uk/wmatrix/
SketchEngine http://www.sketchengine.co.uk/
R http://www.rstudio.com/ide/docs/using/source (for the advanced user)
200+ software resources for corpus analysis https://corpus-analysis.com/
Anthony, Laurence. (2013). “A critical look at software tools in corpus linguistics.” Linguistic Research 30(2), 141-161.

1.7 Annotation You may want to annotate your corpus for certain features, such as author, location, specific discourse markers, parts of speech, transcription, etc. Some of the compiled corpora might come with included annotation.
Text Encoding Initiative http://www.tei-c.org/index.xml
A Gentle Introduction to XML http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html
Hardie, A (2014) ““Modest XML for Corpora: Not a standard, but a suggestion”. ICAME Journal 38: 73-103.
UAM Corpus Tool does both concordance work and annotation http://www.wagsoft.com/CorpusTool/

1.7.1 Linguistic Annotation
Natural Language Toolkit http://nltk.org& the NLTK book http://www.nltk.org/book/ch01.html
Stanford NLP Parser http://nlp.stanford.edu/software/corenlp.shtml (includes Named Entity Recognition, semantic parser, and grammatical part-of-speech tagging)
CLAWS, a part of speech tagger http://ucrel.lancs.ac.uk/claws/
USAS, a semantic tagger http://ucrel.lancs.ac.uk/usas/

1.8 Statistics Help 1.8.1 Not Advanced
Wikipedia http://wikipedia.com (great for advanced concepts written for the non-mathy type)
Log Likelihood, explained http://ucrel.lancs.ac.uk/llwizard.html
AntConc Videos https://www.youtube.com/user/AntlabJPN
WordSmith Getting Started Files http://www.lexically.net/downloads/version6/HTML/index.html?getting_started.htm
Oakes, M. (1998): Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. Baroni, M. and S. Evert. (2009): “Statistical methods for corpus exploitation”, in A. Lüdeling and M. Kytö (eds.), Corpus Linguistics: An International Handbook Vol. 2. Berlin: de Gruyter. 777-803. 1.8.2 Advanced Stefan Th. Gries’ publications: http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html Adam Kilgarriff’s publications: Pre-2009 http://www.kilgarriff.co.uk/publications.htm Post-2009 https://www.sketchengine.co.uk/documentation/wiki/AK/Papers
Baayen, R.H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.

heather froehlich // last updated 13 June 2019