Basics of Using Corpora

A corpus is a collection, or body, of language. Though usually text-based, corpora (the plural of corpus) can include collections of spoken language as well. In fact, some of the most popular examples of corpora include TV news and U.S. Supreme Court transcripts. Other collections include religious texts, academic papers, Wikipedia, and, definitely the largest of all corpora, the Internet. Using a corpus to learn vocabulary can be a much more active experience than traditional, passive, approaches to learning vocabulary.

The Advantages of Corpora

One of the great advantages of a corpus is that it presents language in context. This is known as a concordance and allows learners to recognize relationships among words, phrases, sentences, and paragraphs. In particular, this extended context allows us to see collocations, or the connections, between words in the various ways they may be used. For example, we can get a better idea which adjectives are commonly associated with a particular noun and what prepositions are associated with a particular verb.If you think of how you may use a dictionary to learn new words, you realize that there is typically a single sentence that serves as the example for any particular word. With a corpus, you may have dozens or even hundreds of examples. Further, these are likely to be authentic language rather than the one contrived sentence that is likely to be included in a dictionary. Having access to multiple authentic examples provides learners with lexical as well as grammatical models. Corpora may be most useful in order to encourage learners to experiment with different sentence constructions.

Traditionally, corpora have been very expensive and time consuming to construct and this has limited the accessibility for learning purposes. That has resulted in corpora being primarily used by researchers rather than language teachers, but technology has made it easier to gather, code, and archive large bodies of text and institutions, and instructors have created numerous new collections of corpora, including collections of their own students’ work.

The Corpus of Contemporary American English

The Corpus of Contemporary American English (COCA) is easy to use through a freely available website. COCA is a good example of conventional corpus-driven concordance tools. With a larger corpus like COCA and iWeb, users can find more examples of any given word, including numerous examples of context, collocations, and phraseology. This allows learners to observe various authentic examples of a given word in order to develop a more diverse and sophisticated understanding of the diverse use of a word or word root (using conventional corpora, users can search with an asterisk for various morphological forms of a word root). Users can search for a word or word root by using an asterisk at the end of the root. For example, you can see the results when I search for reach*:

And the results allow me to select the form of the word I would like to explore further, so I select reached:

I can see 55,905 examples of this word in context:

Recently, the COCA has launched a new English Corpora website that combines COCA with a number of other corpora, including a corpus of TV and movies, and the new Intelligent Web corpus (iWeb), which allows you to create a “virtual corpus” that is customized and still retains these powerful functions. Teachers and learners can gather a variety of texts into customized collections based on their own interests or around a particular academic topic. This can be useful if a class is organized around thematic topics or if students are preparing for a particular academic discipline. This can particularly useful for disciplines that have unique writing conventions or incorporate a lot of technical jargon. These virtual corpora can be saved for continued use and users can also save a history of their previous activity for future reference. The iWeb corpus includes 14 billion words that were systematically selected from across the Internet. This site offers users a lot of functionality for free as long as you make fewer than 250 queries per day. Additional searching and features are available as part of a paid individual or institutional site license.

Google can also be used as a basic concordance tool with the entire internet as a corpus. However, such use does not include the robust and sophisticated nature of a tagged corpus. In a future entry, I will share some practical suggestions for such use.

Additional Resources

Here are some additional resources:

How do you use corpora in your language classroom? Please share in the comments, below.

About Greg Kessler

Greg Kessler
Greg Kessler is professor of instructional technology in the Patton College of Education at Ohio University. He has written numerous books, articles, book chapters, and other publications. He has delivered keynote and featured talks around the world. His research addresses technology, learning, and language use with an emphasis on teacher preparation. He has held numerous leadership positions, including as Ohio TESOL president, CALICO president, and TESOL CALL IS chair. He is the editor of the CALICO book series, Advances in CALL Practice & Research, the Language Learning & Technology journal forum, Language Teaching & Technology, and many other comprehensive collections.
This entry was posted in TESOL Blog and tagged , , , , . Bookmark the permalink.

2 Responses to
Basics of Using Corpora

  1. Greg Kessler Greg Kessler says:

    Hi Ayanna,

    I can’t find any upcoming webinars about using corpora, but it is an increasingly popular topic so you can certainly find opportunities to learn more. Browsing the program from TESOL 2019, I found 42 sessions about using corpora. You may want to check those out and see if the presenters have anything to share. There is also this webinar offered by IATEFL on September 7th that is focused on error correction that is corpus driven:

  2. Ayanna Cooper Ayanna says:

    Hi Greg,
    Thanks for writing such an informative post. Do you know of any webinars on this topic?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.