What Is a Corpus?
Corpse, marine corps, corporation, and corpulent all derive from the Latin word corpus, meaning body. That Latin word corpus also exists, intact, in English, but rather than an anatomical body, it refers to a body of language. A corpus is a large collection of language, traditionally written, but nowadays, corpora (the Latinate plural) of spoken language can be found.
Corpora in Language Teaching
The big benefit of using a corpus is that it’s data driven, and that data is based on actual language usage. It’s pretty much descriptivist heaven.
When a professor was first explaining to me the value of corpus data, he used this example: if asked to define the phrase par for the course, you’ll find “what is normal or expected in any given circumstances.” But if students depend only on definitions like that, from textbooks or dictionaries or teachers, they’re likely to miss some crucial information. Though that definition may be accurate, a quick corpus search reveals that the phrase is almost always used with a negative connotation, as in, “These tantrums are par for the course.”
As corpora have become more readily available and more representative of spoken, day-to-day language, they have become valuable tools for those in the field of TESOL. Most often, it’s researchers and materials developers, but there are some classroom applications for the corpus as well. Let’s look at the basics of how to use a corpus and check out a couple of introductory techniques for incorporating corpus data into your classroom.
Using a Corpus
When you use a corpus, you’re generally performing a search, just like you would in Google. The difference is that when you Google “kitten in a tree” you’re most likely looking for pictures of kittens in trees or information for getting kittens out of trees. If you search for the phrase “kitten in a tree” in a corpus, what you’re looking for is instances of that actual phrase in use. The language is your end goal.
The Corpus of Contemporary American English (COCA) is my go-to. Search a phrase just like you would anywhere else:
Your results will be every instance of the word tree found in the corpus. You could do the same for a string of words.
Along the left you have the year, type of source, and specific source, and then along the right is the actual context in which the word was found. This isn’t terribly helpful yet, though.
Let’s say I’m an English learner, struggling with prepositions. I want to know which prepositions commonly precede the word table. Let’s select the Collocates tab:
We’re also going to select “prep.ALL” from the [POS] dropdown next to “Collocates.” What this means is we’re searching for the prepositions that most commonly occur with the word table.
The scale of numbers below tells the engine where to search. By selecting the 2 on the left, I’m searching only for prepositions that occur one or two words before the word table. Any prepositions outside of that range or after table will be ignored.
Here are the results:
That’s just a very brief primer, but corpora are extremely powerful tools for getting loads of language data. There are some tutorial videos out there to get you more familiar with using corpora.
Introducing ELs to the Corpus
With beginners, I recommend doing the work for them. When presenting students with new vocabulary words, print out the results you get, and help them to notice important patterns related to syntax and collocation.
As students progress, show them how the corpus works, perhaps using an LCD projector while you search, narrating the process as you go.
Once students can do some limited searching on their own, give them assignments that they can use the corpus to complete. For instance, design a cloze activity based on simple corpus searches that you have performed.