Working with Terms. Course 3 – Cleaning the output

Current Status
Not Enrolled
Price
Closed
Get Started
This course is currently closed

After you run a term extraction tool, you are presented with a list of term candidates. We call them term candidates, and not terms, because we have to decide if they are going to be useful to us or not — we have to decide if they are terms for the purposes of our work. Some of these candidates will not be useuful or interesting for a termbase. Those unwanted candidates should be removed from the list before you import the terms to the termbase. Because they are not wanted, they are referred to as noise.

The noise can be addressed by “cleaning” the raw output of a term extraction tool.

The opposite problem, i.e. the really interesting terms that for one reason or another were not picked up by the term extraction tool, those are appropriately referred to as silence. That problem is more difficult to address because it involves terms that we can’t “see” in the output of a term extraction tool. One way to address some of the silence involves using a more complex research process whereby you search for salient unigrams (i.e. keywords) in your corpus by using a concordancing software such as WordSmith to identify multi-word terms that have those keywords as their headword. This process is out of scope of the current course. If you are interested in learning more about this, send an enquiry to Kara.

While silence is more difficult to address, it is at the same time a more serious problem than the noise, since having unnecessary terms in your termbase has less of an impact on its use and benefits than when important terms are missing.

In this lesson, we will combine the raw output of soccer terms from TermoStat and from Sketch Engine, and then remove the noise. We will finish by importing the file into our Sports termbase.

Posted in