Cleaning the Ouput

As already mentioned, the output of a automated term extraction (ATE) tool contains term candidates. They are not terms, per se, that is, they are not all going to be useful to include in your termbase. You have to check them and made that determination. Some term candidates are not useful at all and should be ignored, others will be very useful and should be retained. The process of discarding some items and keeping others is referred to as cleaning. It is important to keep the purpose of the termbase in mind when cleaning the output. (Refer to topic 1.2.2 for a review of some key criteria for determining termhood in your organization.)

The term candidates that are deemed to be of little use and are subsequently discarded are referred to as noise. Examples of noise incude function words (articles, prepositions, pronouns, etc.), general lexicon words (common words such as person, information, date, etc.), and adjectives or adverbs of little use (important, new, quickly, again, etc.). Also considered noise are numbers and other non-terms such as code strings.

All term extraction tools produce noise. The question is how much? Some tools produce so much noise that cleaning the output takes more time than manually identifying terms without the tool! These tools are clearly not worth using at all. Many of the first term extraction tools to emerge on the market some years ago were of this kind, so the first experience that some terminologists had with ATE was very negative.

Today, there are better term extraction tools, and in particular the use of grammatical rules to exclude function words and the use of a reference corpus to statistically measure a term’s saliency have really improved the output. Still,to minimize the cleaning effort, you should use a good stopword list. Another effective method is to add your own items to the stopword list, such as the noise you cleaned from a previously extracted output, i.e. if you remove an item from the output, why not add it to your stopword list so that it will be automatically removed the next time you run the tool?

From the TermoStat extraction of our cricket text, the following term candidates could be considered members of what we call the general lexicon and are therefore of little interest. They should probably NOT be retained for the termbase:

  • key difference
  • painted line
  • bad weather
  • single day
  • flat surface

Sometimes the distinction between words from the general lexicon and real terms is not clear-cut. We have listed painted line above as a general lexicon expression to be deleted from the term candidate list. Its just a line that has been painted. But another similar extracted candidate is sticky wicket; like painted line it also has the pattern adjective + noun and although wicket is a domain specific term, sticky is rather ordinary. Besides, wicket is already in our list so maybe we don’t need sticky wicket. However, sticky wicket is indeed an interesting term because the unithood (strength of association between the words) is much stronger for sticky wicket than it is for painted line. By that we mean that this is an “expression” frequently used by cricket enthusiasts. What is further interesting about sticky wicket, is that an examination of the context sentences reveals two synonyms: sticky dog and glue pot. This further attests to the likelihood that these terms refer to a distinct concept (whereas there are no synonyms for painted line). Any time you find synonyms, it is very important to include them in the concept entry in the termbase. Further research reveals that sticky wicket refers to a wicket when its surface is in a glutinous condition (after a rain for example), which affects the behaviour of the ball.

Although determining what is a term and what is just a general expression can be quite subjective, having an understanding of the terminology of the subject field, coupled with statistics and contexts shown by good corpus analysis tools, can raise the reliability of this decision process.

Be aware of the possibility that a general lexicon word has a specialized meaning in your corpus. Consider the following terms taken from the cricket text. They all have special usage within the sport and should be retained as interesting terms:

  • century
  • delivery
  • over
  • bye
  • run
  • dinner
  • gardening

Cleaning involves not only removing unwanted term candidates but also consolidating families of related terms (so that you can later add relations), as well as adding new terms by resetting the boundaries of some multiword term candidates. The idea is to get what you need, and this involves more than just deleting the noise. While cleaning, you should check the context sentences to verify that the boundaries of multi-word terms have been properly set. Sometimes you’ll need to remove, for example, a common premodifier (adjective) from a multi-word term. Making adjustments to the boundaries of multi-word terms by removing or sometimes even adding a word can result in a retained term that has a much higher frequency than the original one found by the tool. Sometimes you will need to do a quick search of the corpus to determine the frequency of the adjusted term candidate.

Here are some examples, taken from our TermoStat extraction of the cricket text, of how adjusting the boundaries of a multi-word term can result in a better term for the termbase.

Adjusting term boundaries

Note innings above. Normally when preparing terms for the termbase, we reduce terms to their canonical (base) form, which is usually singular in the case of nouns. However, the context sentences clearly show that innings is used with the “s” in both singular and plural form.

It is also recommended to include families of terms that form a system. For example, we found the following context sentence in the cricket term extraction:

“The pitch is marked at each end with four white painted lines :
a bowling crease, a popping crease and two return creases”

These three types of creases are important to include in the termbase as well as the term crease itself.

Similarly, the following sentence contains four important terms.

“Bowlers are classified according to their style,
generally as fast bowlers, seam bowlers or spinners.”

The above example shows why it is important to look at the context sentence, because otherwise, assuming that fast is an unnecessary premodifier in fast bowler, we might inadvertently delete it, when in reality fast bowler denotes a specific type of bowler. Similarly, in the extracted term limited overs cricket, we might assume that limited is an unnecessary premodifier, but the contexts clearly show that this trigram is a stable term.

Consider also the word leg which occurred 13 times. An examination of the context sentences shows that this term occurs frequently in the phrase leg before wicket. Furthermore, that term also occurs as an acronym: LBW. As soon as a term candidate is also expressed as an acronym, it is a valuable term and should be retained (along with the acronym).

To clean the output, you need to have some way to delete and change items. The method depends on what options the term extraction tool offers for editing the output. Usually there is a way to export the output to a file that you can subsequently work with, such as plain text or spreadsheet. We will show how to work with text and spreadsheet files in the next topic.

Aside from the noise, there is another problem with term extraction tools: silence. Silence refers to the important terms that the ATE tool did not extract. Dealing with the noise is more straightforward because you can see it. But you can’t see the silence. The only way to find these missing terms is to perform some deep investigation into the corpus using a concordancing software such as Sketch Engine or WordSmith Tools. The procedure is beyond the scope of this lesson, however you can get some ideas from the last two items listed in the readings below.

Further Reading