A corpus is a collection, or body, of language. Though usually text-based, corpora (the plural of corpus) can include collections of spoken language as well. In fact, some of the most popular examples of corpora include TV news and U.S. Supreme Court transcripts. Other collections include religious texts, academic papers, Wikipedia, and, definitely the largest of all corpora, the Internet. Using a corpus to learn vocabulary can be a much more active experience than traditional, passive, approaches to learning vocabulary.
The Advantages of Corpora
One of the great advantages of a corpus is that it presents language in context. This is known as a concordance and allows learners to recognize relationships among words, phrases, sentences, and paragraphs. In particular, this extended context allows us to see collocations, or the connections, between words in the various ways they may be used. For example, we can get a better idea which adjectives are commonly associated with a particular noun and what prepositions are associated with a particular verb.If you think of how you may use a dictionary to learn new words, you realize that there is typically a single sentence that serves as the example for any particular word. With a corpus, you may have dozens or even hundreds of examples. Further, these are likely to be authentic language rather than the one contrived sentence that is likely to be included in a dictionary. Having access to multiple authentic examples provides learners with lexical as well as grammatical models. Corpora may be most useful in order to encourage learners to experiment with different sentence constructions.
Traditionally, corpora have been very expensive and time consuming to construct and this has limited the accessibility for learning purposes. That has resulted in corpora being primarily used by researchers rather than language teachers, but technology has made it easier to gather, code, and archive large bodies of text and institutions, and instructors have created numerous new collections of corpora, including collections of their own students’ work.
The Corpus of Contemporary American English
The Corpus of Contemporary American English (COCA) is easy to use through a freely available website. COCA is a good example of conventional corpus-driven concordance tools. With a larger corpus like COCA and iWeb, users can find more examples of any given word, including numerous examples of context, collocations, and phraseology. This allows learners to observe various authentic examples of a given word in order to develop a more diverse and sophisticated understanding of the diverse use of a word or word root (using conventional corpora, users can search with an asterisk for various morphological forms of a word root). Users can search for a word or word root by using an asterisk at the end of the root. For example, you can see the results when I search for reach*:
And the results allow me to select the form of the word I would like to explore further, so I select reached:
I can see 55,905 examples of this word in context:
Recently, the COCA has launched a new English Corpora website that combines COCA with a number of other corpora, including a corpus of TV and movies, and the new Intelligent Web corpus (iWeb), which allows you to create a “virtual corpus” that is customized and still retains these powerful functions. Teachers and learners can gather a variety of texts into customized collections based on their own interests or around a particular academic topic. This can be useful if a class is organized around thematic topics or if students are preparing for a particular academic discipline. This can particularly useful for disciplines that have unique writing conventions or incorporate a lot of technical jargon. These virtual corpora can be saved for continued use and users can also save a history of their previous activity for future reference. The iWeb corpus includes 14 billion words that were systematically selected from across the Internet. This site offers users a lot of functionality for free as long as you make fewer than 250 queries per day. Additional searching and features are available as part of a paid individual or institutional site license.
Google can also be used as a basic concordance tool with the entire internet as a corpus. However, such use does not include the robust and sophisticated nature of a tagged corpus. In a future entry, I will share some practical suggestions for such use.
Here are some additional resources:
- Corpora as an Authentic Resource of Language and Beyond
- CorpusEye is another large collection of different corpora
- TESOL Press Resource: Using Corpora for Language Learning and Teaching
- The Use of Corpora in the Vocabulary Classroom
- Corpora in English Language Teaching
- Giampieri, P. (2019). The web as corpus in ESL classes: A case study. International Journal of Language Studies, 13(2), 91–108.
How do you use corpora in your language classroom? Please share in the comments, below.