By Joyce Lim, Faculty of Education, University of Cambridge

As a PhD student, one of the first questions I get asked is “what is your research on?”. When I respond with “corpus linguistics,” many people either raise their eyebrows or proceed to ask me what that is. The brief explanation of this is: A corpus is a collection of written or spoken texts that offer systematic insight into how a language is used in the population that the corpus represents. However, it goes without saying that corpus linguistics encompasses much more than scrutinizing the use of certain words and/or expressions. It is a powerful tool that can offer new insight into both language teaching and learning. It is also a tool that has made a tremendous impact not only in the field of linguistics, but also in various areas of education and material development. 

So, I am going to share a bit about what I have learned about “corpus” and “corpus linguistics” and the breadth in which it can be served to facilitate a better language learning and teaching experience.

What is a corpus?

Simply put, a corpus is a systematic collection of machine-readable written and/or spoken texts, as used in the real world, which is stored in digital forms. It is a sample of texts which represent a given language or language variety (Sinclair, 2005). A corpus can also be identified as a tool that can be used to test hypotheses about how a language is used, identify various language patterns, and compare across different users of a language. The main purpose of a corpus is to understand the different usages of sounds, words, constructions, as well as sentences. An example of a corpus is the British National Corpus (BNC), which is a 100-million-word corpus containing written and spoken language samples from a wide range of sources. It is designed to represent how British English is used (both spoken and written) since the late 20th century. Hence, the BNC only contains texts which contain British English, not any other forms of English (e.g. American English).

What is a corpus linguistics?

Corpus linguistics is concerned with ways in which language is expressed in a real-world context. In other words, corpus linguistics is a study of a language through a sample of its natural uses. Through corpora, patterns of various lexical and grammatical features can be identified, which can then inform something about learners’ language performance. More commonly, it can be used to compare language uses of native and non-native speakers to better understand how second language learners develop in comparison to those who speak the language as their first language. Because of its systematic nature, corpora have been used as one of the most robust tools among researchers. For instance, researchers no longer need to rely on native-speakers’ intuition of the language. Rather, they can resort to a corpus, which can either confirm or refute their hypotheses about features of the language. 

How can we use a corpus in teaching?

Not only does a corpus have its merits in research, it has many great uses in teaching as well. Below are three major uses of corpora in teaching:

Material development

Lexicographers, people who compile dictionaries, have long been proponents of corpora. They observe how language is used in real life data (e.g. newspaper, tweets, blogs) in order to provide an accurate account of vocabulary use. By scrutinizing data of recent trends in the use of words and expressions, lexicographers are able make suitable amendments (e.g. adding new words). In a similar vein, textbook writers have also favoured the use of corpora because of its authentic use of the language. By using corpora to make textbooks, learners are exposed to activities that contain language as used in the real world, rather than in a textbook world. 

Syllabus design

When designing a syllabus, teachers must make decisions regarding what they are going to include in the curriculum. This can be daunting especially when there is so much to teach, but limited time. However, information about frequency and register can be useful in deciding what is relevant and of importance. For instance, an instructor can conduct a corpus analysis to understand language items that are relevant to the target register. Thus, teachers can use a corpus as a reference to make a decision about which language feature they should focus on.

Example of frequency: Word count of verbs ending in -ed in the BNC (Source:

Classroom activities

Corpora can serve as a powerful tool in the classroom as well. For instance,          students can use corpus and concordancing programs (list of instances of a given word along with its immediate context) to discover ways in which language is used themselves. This promotes autonomous learning where students no longer rely on the teacher’s knowledge but discover language patterns on their own. Furthermore, teachers can use the concordance in order to create various activities to teach learners about different ways in which a word is used.

Example of concordance: Instances of the conjunction ‘and’ in the English Web 2015 Corpus (Source:

The application of corpora extends beyond analysing language patterns and can be used as a powerful tool in both teaching and material development. In particular, corpus linguistics promotes a bottom-up approach to teaching where the learners are given the responsibility to discover the patterns themselves. Using a corpus provides more interesting, relevant, and goal-oriented materials for learners. Aside from language teaching, corpora can also be used to analyse political language (see Partington, 2012) or even forensic texts (see Coulthard, 2013). In a world where the use of language constantly changes the ways in which people think, interact, and express, corpora can act as a powerful tool to delve into how language shapes our society today. 


Coulthard, M. (2013). On the use of corpora in the analysis of forensic texts. International Journal of Speech, Language, and the Law, 1(1), 27-43.

Partington, A. (2012). Corpus analysis of political language. In C. Chapelle (Ed.), The Encyclopedia of Applied Linguistics (pp. 1-8). Oxford: Blackwell.

Sinclair, J. (2005). Corpus and Text- Basic Principles. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 1-16)Oxford: Oxbow Books. 

Joyce Lim is a third-year PhD student at the Faculty of Education and Hughes Hall. Her research project involves observing syntactic properties of second language (English) writing across different language proficiencies of Korean students. Her research interests include corpus linguistics, second language writing development, as well as vocabulary development. She’s been involved in a number of research projects as well as conferences including Kaleidoscope 2018 and 2019 and Cambridge Language Sciences Symposium 2018 and 2019.

Posted by:fersacambridge

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s