Skip to content

2.2 Annotation Rules

Claude Roux edited this page Oct 19, 2022 · 1 revision

Annotations

Version française

For a very long time only manually annotated corpora were allowed: the use of rules was prohibited because of their sulphurous origin in the ancient world of symbolic approaches. However, anyone who has had the rare pleasure of annotating corpora, a task that advantageously replaces flogging as penance, knows how frustrating it can be to annotate again and again the same pattern that a simple rule could have identified almost certainly. However, at the beginning the mix of genres was acceptable. The trees of the Penn Tree Bank, for example, were first produced with a "shallow parser" and then hand-corrected by armies of students. However, if you ponder for a moment, claiming that a manual human annotation is superior to a few rules is like saying that the best way to describe ℕ is to list all the elements that belong to it. Similarly, if each of the "and" in a text is to be annotated as a coordination, the task may quickly become repetitive and tedious. The solution we propose is not intended to relaunch the long forgotten old battle between rule-based systems and machine learning. What we do claim, however, is that the use of rules makes it possible to automate the annotation of documents, particularly recurring items, and therefore lighten the annotator's workload so that he or she can focus on the most difficult items to analyze.

Tamgu

Tamgu offers a particular formalism of rules, close to regular expressions, that discreetly blends into your programs. These rules are combined with general or user lexicons to detect the presence and position of recurring expressions.

In addition, these rules can include capsules, i.e. calls to external modules such as word embeddings or classifiers.

Lexicons

Tamgu can both handle general lexicons combined with user lexicons. In particular, these lexicons are used to tokenize the text into words or multiple word expressions.

Lexical rules are composed of a label that is associated with a word or regular expression. A lexical rule always begins with an "@".

@food <- meat.
@food <- "candied chestnuts".
@food <- "Milanese escalope".
@food <- "fish(es)".
@food <- meal.

In the above example, the "fish(es)" rule is to be understood as a rule that can be applied to both "fish" and "fishes".

All these words and regular expressions are then compiled on the fly as an object transducer. It is this same type that also makes it possible to compile general lexicons of the language.

Rules

The rules are written directly in the code or in a string of characters according to the user's needs. A rule is composed of a label associated with a complex regular expression in which each element is separated by a comma.

foody <- {the,some}, #food. 

Let's notice a few things right away.

  • The braces introduce a disjunction between elements
  • The labels defined in the lexicon are preceded by a "#".
  • The words are written as is or in the form of a regular expression

annotator

The rules are written directly into the code. All you need is an object annotator to have access to them:

annotator r;

ustring u="The lady eats candied chestnuts after a meal with some fish and some meat." 

vector v = r.parse(u);

To apply our rules, we will use the method: parse.

This method will first use the lexicon to break down the text into words, according to the user lexicon. Thus, candied chestnuts will be recognized as a single element, while the rest of the words will be cut along the spaces and punctuations.

The program

ustring u="The lady eats candied chestnuts after a meal with some fish and some meat.";

@food <- meat.
@food <- "candied chestnuts".
@food <- "Milanese escalope".
@food <- "fish(es)".
@food <- meal.

foody <- {the, some}, #food.

annotator r;


vector v = r.parse(u);

//Displaying the vector content
println(v);

//Then each of the detected sections.

for (self e in v)
    println(u[u[e[1][0]: e[-1][-1]]]);

Which after execution will give us:

[['foody',[50,54],[55,59]],['foody',[64,68],[69,73]]]

some fish
some meat

Language Lexicon

An English lexicon can be added to this example, in order to make the rule more general. In this case, "#" will also be used to refer to a trait or category detected by the lexicon.

//We replace our disjunction with a check of the category of the front word

foody <- #Det, #food.

//We add a lexicon
transducer lex(_current+"english.tra");

annotator r;

//We load it into our annotator
r.lexicon(lex);

vector v = r.parse(u);

The execution will give us a slightly different result than before:

[[['foody',[15,18],[19,33]],['foody',[40,42],[43,48]],['foody',[54,56],[57,64]],['foody',[71,73],[74,80]]]

candied chestnuts
a meal //meal occurs here now because "a" is recognised as a determiner...
fish
the meat

Application of the rules

Of course, we can integrate as many rules as we want. When a rule applies, the cursor moves after the last word consumed by that rule. All rules are then applied from this new position. On the other hand, if a rule fails, we move on to the next rule, always from the current position in the sentence. If all rules fail, the cursor is moved one word forward and the entire grammar is applied again.

Grammars

It is also possible to compile the grammar from a string of characters, which allows in particular to have several annotators.

//Our rule is no longer declared in the code
string rule = "foody <- {#Det, some}, #food."

transducer lex(_current+"english.tra");

annotator r;
//We compile our rule
r.lexicon(lex);
r.compile(rule);

string other = "drink <- #Det, #drinking.";

annotator rr;

rr.compile(other);

//We apply our grammar on "u"
r.parse(u);

//Then we take this result always in "r" and we apply "rr" on it
rr.apply(r);

In the example above, the grammar is no longer defined in the code but in the form of a string of characters, which makes it possible to have several annotators. These can even be linked. Indeed, the result of applying "r" to "u" is always kept in "r". We can therefore apply the second grammar to this structure and refer for example to the labels produced by the first grammar. Note that in a multi-threaded context, there can be several parallel executions of "r", each with access to its own environment.

Conclusion

In fact, the implementation of such a grammar can be done gradually. Each new recurring pattern can thus be translated into a rule as soon as the need arises.

We have also used this rule mechanism in the sound effects of documents used to drive machine translation systems (https://www.aclweb.org/anthology/D19-5617.pdf). With these rules, we have isolated the words we wanted to replace with a noisy version in the documents. Thanks to these rules, detection and replacement are done in two lines of code.

Clone this wiki locally