Designing corpus-based materials

Learning how to connect, classify, categorise and analyse vocabulary are vital skills for vocabulary development. Using corpora allows learners to look at the topic areas which interest them.

The greatest advantage of corpus-based language teaching is that the students can choose what they want to study. Every search option enables them to observe and analyse language across a wide range of contexts. The freely available corpora of specific registers such as Corpus of Contemporary American English (COCA), MICUSP, MICASE and BNC can make the search specific to a context. However, except in a few technology-enabled higher education institutions, corpus-based teaching is not practised consistently. Many teachers are aware that their textbooks use ‘authentic’ instances, but they themselves are not able to contextualize this data in their local contexts of teaching. It is probably because many teachers are still unaware of the ways to design the corpus-based materials. In this paper, I share some activities which can be useful in motivating learners to become language researchers.

Introduction

In a conventional language classroom, depending upon course requirements we design our activities around specific contexts of language use, such as newspapers, spoken texts and research articles. For example, in using newspaper registers, we attempt to expose our students to the ways (genres) the columnists or reporters produce meaningful discourses that focus on reporting, arguing, comparing and advertising. Since newspaper texts are ‘authentic’, they help the learners understand how people construct meaning in different social locations. More specifically, the students will be able to notice the similarities and differences between the genres, such as opinions and reports. Although this exposure to ‘complete texts’ enables learners to analyse the schematic structure of the moves in the genre and answer comprehension questions, it may not adequately help them notice and learn the register choices at the micro level in terms of collocations and patterns across a wide range of similar texts. It is important that students receive enough practice for the micro aspects, which mainly include the lexico-grammatical choices, to be able to produce meaningful texts. Corpus-based materials can be very useful in this context.

Corpus and context

The definition of a corpus (the plural form corpora), provided by contemporary learners’ dictionaries as ‘a collection of spoken or written texts that are stored electronically for language research’ (Lea & Bradbery, 2020) is misleading. It does not help us understand much about the nature of the texts included except the point that the texts could be both be spoken and written. We now have access to multi-million word corpora such as COCA, MICUSP, and British Academic Written English (BAWE) (available on Sketchengine) that classify the ‘texts’ into specific registers (spoken, written, classroom lectures, medical or engineering texts, textbooks, legal texts) or genres (narrative, persuasive essays, case studies, critiques, poems, letters). We also have access to some freely available corpus analysis tools such as AntConc (https://www.laurenceanthony.net/ software/antconc/). If the students/teachers wish to examine specific registers, they can build their discipline-specific corpora using Sketchengine or Antconc tools.

How can we use corpus data in language teaching/learning?

There are two pedagogical approaches to corpus-based teaching: direct and indirect. While the indirect application of corpora in pedagogy informs syllabus designers and materials producers about the what and when to teach, direct application refers to the use of concordances (Data-Driven Learning) in the actual teaching-learning (Römer, 2011). Concordances, as shown in Figure 1, are a collection of all occurrences of a word within a given corpus. Corpus analysis tools such as AntConc and Wordsmith tools can organize all the instances of a search word paradigmatically that we can notice collocations and syntactic patterns easily (Rundell & Granger, 2007).

Figure 1 Concordances of the search term ‘concordance’

As can be seen, if we are interested in administering an ‘input processing task’ where the learner is pushed to make connections between a word and its associative meanings in contexts – for example the use of academic vocabulary (Gardner & Davies, 2014) or modal auxiliaries like can, may, will – you can administer a concordance-based activity. This activity can display the examples vertically, facilitating the view of the pattern. The students read through the instances and guess the meaning(s) of the target word with the help of the co-text (words surrounding the target word) and the context of the instance. Similarly, in situations where our introspection fails to provide ‘a satisfactory account of word meaning and word behaviour’ (Rundell & Granger, 2007: 15), we can give them some example sentences, concordances, on how that word is used in real-life contexts. The point is that when students are curious to learn, it is important that we provide them with opportunities to learn rather than straightforward answers. This is the crux of corpus-based teaching, which is popularly known as Data-Driven Learning (DDL). In DDL, we encourage learners to become researchers who find answers to their questions through a discovery process.

Data-driven learning

Direct application of corpora doesn’t have to be necessarily the use of raw concordances to discover rules in the language classroom. It refers to the extent of our reliance on the real data and the way we use corpora in achieving our learning goals. According to O’Sullivan, the direct application of corpus instances can help learners with their cognitive skills such as ‘predicting, observing, noticing, thinking, reasoning, analysing, interpreting, reflecting, exploring, making inferences (inductively or deductively), focusing, guessing, comparing, differentiating, theorising, hypothesising, and verifying’ (O’Sullivan, 2007: 277). In a typical corpus-based task, our task as teachers is to help students frame their search focus and approach the corpora.

Design and use of corpus-based activities

The search word

The first stage in designing a corpus-based activity is to identify the search words. We can depend on frequency-based word lists or learner dictionaries or grammar books (Biber et al., 1999; Coxhead, 2000; Gardner & Davies, 2014; Simpson-Vlach & Ellis, 2010). One might ask why we should examine the most frequent words in a corpus. The answer is that the most frequently used words constitute at least 75-85% of the lexical words in a text and can act as the window to explore lexical relations and grammatical usage across registers. Moreover, the most frequently used words such as give, take, put, make, and walk tend to have multiple meanings when used in specific contexts, e.g. give up, give in, give into, put out, put up with…(See Table 1). It is important that we make our learners notice the patterns associated with these simple and frequently used words, in order to help them develop fluency in speech as well as in writing.

 

take into account the

take advantage of the

take the form of

take part in the

take place in the

take the time to

take a closer look at

take on the role

take the lead in

take the place of

take account of the

take into consideration

take it for granted

take responsibility for the

take pride in their

take action against the

take an active role

take a long time

take precedence over the

take care of themselves

take control of the

take note of the

take control of their

take a back seat

Table 1 Concordances with the search word ‘take’

Search focus

There are several types of activities we can devise using these search words. Table 2 provides a classification of search words and their focus for corpus activities.

 

Classification

Sub-classification

Search words/phrases

Usage-based

British vs American

hunting vs shooting (meaning-focused); color vs colour (form related);

 

Spoken vs Written

guy vs child, look up vs examine

 

Contextual usage

almost – nearly; alone – lonely; aloud – loudly; also – too – as well; anyone – anybody

Vocabulary based

Word-formation Affixation (Prefixes and Suffixes)

anti-, ex-, inter-, dis-, bio-, kilo-, kilo-, vis-,

-full, -ment, -tion, -ion, -ity, -ness, -ance/-ence, -ist

 

Meaning-based (multiple meanings of a word)

Synonyms based: allow – permit – let – enable

Register-specific: baby – infant

 

Collocation

adjective + noun; noun + noun

 

Lexical chunks and patterns

it is important that, on the other hand, as can be seen

Grammar based

Nominalization (The clause that contains the nouns and the words that accompany them)

Questions

Tense

Example: factor, important factor, most important factor, single most important factor, the single most important factor

Wh-words, primary and modal verbs + nouns/pronouns (did + noun; be-form + noun)

Different verb phrases (was, had, have been, had been)

Discourse & Genre-based

Cohesion – anaphoric (identify the referent)

Cataphoric (referent to be mentioned later)

Hedging

Personal pronouns such as she, he, it, and they

perhaps, probably, certainly, usually, always

verbs: believe, suggest, assume

modal verbs: may, could, might

   

Compare and contrast: on the one hand, on the other hand while, whereas

Argumentative:

counterargument: however, nonetheless, and nevertheless

addition: besides, moreover, in addition

conclusion: therefore, thus, hence

process: firstly, secondly, finally; then, followed by

Table 2 Classification of search words

Corpus-based activities

Basic corpus analysis activities aim at introducing the search database and its functions. Frequency and distribution information, range of the search word (across genres and registers), and the most frequent collocations of the node can be explored under basic search activities (see Table 3).

 

1. Parts of Speech related activity: Many nouns are uncountable and cannot have a plural noun (e.g. gold, information). Using COCA look up the following words and find out if any of the following listed words has a plural form ending in ‘s’ or ‘es’. Words from AWL (Sub-list 1) (Coxhead, 2000)

Attitude

Capacity

Challenge

Analogy

Alternative

Academy

Author

Authority

Chart

Chapter

Colleague

Commission

Complement

Conduct

Conflict

Construct

Clause

Area

2. Word formation activity: Each of the following prefixes helps you form new nouns without changing their grammatical word class. Use the following prefixes with an asterisk (e.g., anti*) on the COCA search engine and find some new nouns.

Note: Do you consider the ‘under’ in understand and ‘ex’ in experience as ‘prefixes’? Identify only those words where they are prefixes.

Anti-

Arch-

Auto-

Bio-

Counter-

Ex-

Hyper-

In-

Kilo-

Mis-

Neo-

Re-

Sub-

Tele-

Under-

3. Collocation activity: The following five nouns are frequently used in academic texts. Using the BAWE corpus list out the most frequent collocates of each one in academic genres. Are there any common collocates between them?

benchmark       framework       guideline       measure       criteria

Table 3 Corpus-based activities introducing the search database and its functions

While these activities help students understand the search platform and list out the many options, they will not enable learning per se. We need to provide contexts and push learners to consult the corpus with a purpose, e.g. to revise a written draft of an essay, or to identify and use register-specific linguistic choices (see Table 4).

 

1. Look up COCA for the words underlined and find out their usage (acceptable or not). (Based on Collins CoBuild English Usage)

a. There was a number of/ number of chairs in the room.

b. They arrange things better in another/other countries.

c. We need many/more information.

d. We arrived at home/arrived home and I carried the suitcase up the stairs to my room.

e. That is a good/very good answer indeed.

2. Only one of the three modal verbs (may, should, could) is appropriate in all three contexts. (contexts from COCA)

a. Seven patients were in a critical condition, while the others were stable and two ____ be discharged soon, the commission said.

b. According to a report from the UN children’s agency UNICEF earlier this year, some 63 percent of children across the Middle East already ____ not read or understand a simple text by the age of 10.

c. The immediate effect on the cruise industry _____ have been far worse had the outbreak occurred during the Chinese summer period, when the number of cruise ships operating out of China more than doubles.

3. Using the surrounding co-text, fill in each blank with an appropriate grammatical form of the word ‘suggest’ (suggests, suggested, suggesting, and suggestions). For example, the first blank uses the adjectival form ‘suggested’.

Table 4 Activities for searching the corpus with a purpose

As can be seen, all three grammar activities rely on the corpus in some way, but they also differ in terms of their requirements: Activity 1 uses the ‘common errors’ (Rundell & Granger, 2007) listed in reference material to provide opportunities to consult a corpus, whereas Activity 2 draws on specific instances from a corpus to encourage learners to notice the differences between using different modal verbs. Activity 3 uses the traditional DIY (Do It Yourself) KWIC (Key-Word-In-Context) concordances to teach students how different members of a word family take on different roles in language use. We can administer Activities 2 and 3 as paper-based or classroom-based activities; however, Activity 1 needs students to access a corpus on a device. What is common in all three activities is that they provide opportunities to learn rather than supply answers.

In other contexts, the entire course could be based on corpus analysis, such as the EAP writing course by Maggie Charles (Charles, 2012). In direct application ‘learners have direct access to concordances to find language rules for themselves’ (Yoon & Jo, 2014: 97). Language rules here refer to a wide range of aspects such as frequent collocates, register-specific bundles and academic genres.

Dictionary-based corpus activities

Reference materials such as usage guides, grammar books, activators and learner dictionaries are usually context independent. The entries in all these resources are alphabetically organized to facilitate quick access. We need to find ways to contextualize language use through meaning-focused and form-focused activities. Since the modern learner’s dictionaries emphasize the use of real-life/corpus examples to explain and exemplify word usage, we can use the same principle to contextualize language learning. One of the activities we can think about is ‘match the dictionary definition with the contextual definition’: Many of the frequently used 3000 words in English have more than three definitions. Some online dictionaries, such as Longman Dictionary of Contemporary English (LDOCE) and Collins CoBuild English Dictionary (CCED), by extension, have skilfully utilized the virtual space by providing multiple instances of word use across contexts. This access to multiple comprehensible real-life instances can be fruitfully utilized in this activity (for activities see Adrian Underhill’s <http://www.macmillandictionaries.com/resources/e-lessons/e-lessons-archive>).

Conclusion

The process of compiling of a corpus was once thought to be expensive, challenging and cumbersome. However, it has become so common these days that many language educators and EAP and ESP practitioners across the world are encouraging their students to build ad-hoc, small-size specialized corpora for their courses. This is not just a trend towards using ‘real’ or ‘authentic’ data in language education, but the need of the hour. Many corpus-based courses are redefining the role, nature and scope of language in the meaning-making processes. Learners are encouraged to discover language use across the strata of linguistic choices, schematic structures (genres), and registers.

References

Biber D, Johansson S, Leech G, Conrad S & Finegan E (1999) Longman Grammar of Spoken Written English. Pearson Education Limited.

Charles M (2012) “Proper vocabulary and juicy collocations”: EAP students evaluate do-it-yourself corpus-building. English for Specific Purposes 31 (2) 93–102.https://doi.org/10.1016/j.esp.2011.12.003

Coxhead A (2000) A New Academic Word List. TESOL Quarterly 34 (2) 213–238

Gardner D & Davies M (2014) A New Academic Vocabulary List. Applied Linguistics 35 (3) 305–327.https://doi.org/10.1093/applin/amt015

Lea D & Bradbery J (2020) Oxford Advanced Learner’s Dictionary (10th ed.). Oxford: Oxford University Press

O’Sullivan I (2007) Enhancing a process-oriented approach to literacy and language learning: The role of corpus consultation literacy. ReCALL 19 (3) 269–286.

Römer U (2011) Corpus research applications in second language teaching. Annual Review Of Applied Linguistics 31 205–225.https://doi.org/10.1017/S0267190511000055

Rundell M & Granger S (2007) From corpora to confidence. English Teaching Professional 50 15–18

Simpson-Vlach R & Ellis NC (2010) An Academic Formulas List: New Methods in Phraseology Research. January 487–512.https://doi.org/10.1093/applin/amp058

Yoon H & Jo JW (2014) Direct and indirect access to corpora: An exploratory case study comparing students’ error correction and learning strategy use in L2 writing. Language Learning and Technology 18 (1) 96–117

 

Vijayakumar Chintalapalli is currently a faculty member in BITS Pilani, India. He has a PhD in ELT from EFL University, India. He has designed and taught EGP and EAP courses in India and Saudi Arabia. His major interests include corpus-based language teaching, pedagogic lexicography, and English for academic purposes.

Email: c.vijayakumar@pilani.bits-pilani.ac.in