-
Notifications
You must be signed in to change notification settings - Fork 3
Java Interface for Phrase Highlighting
Beagle phrase highlighting exposes options to control:
- case sensitivity,
- ASCII folding,
- stemming support for various languages,
- phrase slop,
- defining synonymous phrases,
- tokenization,
- assigning metadata,
- combining all the options.
Examples will be given using Beagle library for processing text snippets from texts about one of the most famous beagle dog owners Lyndon B. Johnson.
Beagle is deployed on Maven Central. Just add an entry in you favourite dependency manager configuration, e.g. pom.xml
<dependency>
<groupId>lt.tokenmill</groupId>
<artifactId>beagle</artifactId>
<version>0.3.1</version>
</dependency>
Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "Lyndon Baines Johnson"
import lt.tokenmill.beagle.phrases.Annotation;
import lt.tokenmill.beagle.phrases.Annotator;
import lt.tokenmill.beagle.phrases.DictionaryEntry;
import java.util.Arrays;
import java.util.Collection;
public class Main {
public static void main(String[] args) {
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Baines Johnson");
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
}
}
// => Annotated: 'Lyndon Baines Johnson' at offset: 0:21
Other examples will not include class definitions and imports for conciseness.
Say your list of phrases comes from some system where everything is stored in upper-case.
Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "LYNDON BAINES JOHNSON"
DictionaryEntry dictionaryEntry = new DictionaryEntry("LYNDON BBAINES JOHNSON");
dictionaryEntry.setCaseSensitive(false);
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
// => Annotated: 'Lyndon Baines Johnson' at offset: 0:21
Some words have different forms, e.g. "naïve" and "naive", that creates problems for phrase matching. To solve some of the cases we can apply ASCII-folding.
Text: "Lyndon Baines Johnson was naïve kid from Brooklyn."
Phrase: "Lyndon Baines Johnson was naive"
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Baines Johnson was a naive");
dictionaryEntry.setAsciiFold(true);
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson was a naïve kid from Stonewall, Texas.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));ach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
// => Annotated: 'Lyndon Baines Johnson was a naïve' at offset: 0:33
Another word forms are language dependent, e.g. singular and plural forms. To solve this issue we can use stemming.
Text: "Johnson's presidency marked the peak of modern liberalism after the New Deal era."
Phrase: "Johnson presidency"
DictionaryEntry dictionaryEntry = new DictionaryEntry("Johnson presidency");
dictionaryEntry.setStem(true);
dictionaryEntry.setStemmer("english");
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Johnson's presidency marked the peak of modern liberalism after the New Deal era.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
// => Annotated: 'Johnson's presidency' at offset: 0:20
When word order in a phrase is not strict, e.g. "FIRST_NAME MIDDLE_NAME LAST_NAME" and "FIRST_NAME LAST_NAME", we can set the phrase slop parameter.
Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "Lyndon Johnson"
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Johnson");
dictionaryEntry.setSlop(1);
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
When an entity is known by more than one name, e.g. "Lyndon Baines Johnson" and "LBJ", we need synonymous phrase matching.
Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "Lyndon Johnson" with a synonym "JBL"
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Baines Johnson");dictionaryEntry.setId("lyndon-baines-johnson");
dictionaryEntry.setSynonyms(Arrays.asList("LBJ"));
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: dictioanryEntryId=" + s.dictionaryEntryId() + ": \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
// => Annotated: dictioanryEntryId=lyndon-baines-johnson: 'Lyndon Baines Johnson' at offset: 0:21
// => Annotated: dictioanryEntryId=lyndon-baines-johnson: 'LBJ' at offset: 99:102
Text: "\"War on Poverty\" and healthcare reform"
Phrases:
A) "\"War on Poverty\"" with a standard tokenizer
B) "\"War on Poverty\"" with a whitespace tokenizer
DictionaryEntry dictionaryEntryA = new DictionaryEntry("\"War on Poverty\"");
dictionaryEntryA.setId("A");
dictionaryEntryA.setTokenizer("standard");
DictionaryEntry dictionaryEntryB = new DictionaryEntry("\"War on Poverty\"");
dictionaryEntryB.setId("B");
dictionaryEntryB.setTokenizer("whitespace");
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntryA, dictionaryEntryB));
Collection<Annotation> annotations = annotator.annotate("\"War on Poverty\" and healthcare reform");
annotations.forEach(s -> System.out.println("Annotated: \' dictioanryEntryId=\"" + s.dictionaryEntryId() + "\" " + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
//=> Annotated: ' dictioanryEntryId="A" War on Poverty' at offset: 1:15
//=> Annotated: ' dictioanryEntryId="B" "War on Poverty"' at offset: 0:16
When we need to store some data with the query we can use metadata (java.util.Map<String, String>
).
Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrase: "Lyndon Johnson" with a metadata map {"email": "demo@example.com"}
DictionaryEntry dictionaryEntry = new DictionaryEntry("Lyndon Baines Johnson");
HashMap<String, String> meta = new HashMap<>();
meta.put("email", "demo@example.com");
dictionaryEntry.setMeta(meta);
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry));
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.");
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset() + " with meta: " + s.meta()));
When your dictionary contains entries that are sub-phrases of other dictionary entries we'd get highlights that would be overlapping. We can set the "merge-annotations?" parameter to the annotation options and overlapping annotations will be by the highlighter.
Text: "Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969."
Phrases: "Baines" and "Lyndon Baines Johnson"
DictionaryEntry dictionaryEntry1 = new DictionaryEntry("Baines");
DictionaryEntry dictionaryEntry2 = new DictionaryEntry("Lyndon Baines Johnson");
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry1, dictionaryEntry2));
HashMap<String, Object> annotationOptions = new HashMap<>();
annotationOptions.put("merge-annotations?", false);
Collection<Annotation> annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.",
annotationOptions);
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset() + " with meta: " + s.meta()));
//=> Annotated: 'Baines' at offset: 7:13 with meta: {}
//=> Annotated: 'Lyndon Baines Johnson' at offset: 0:21 with meta: {}
annotationOptions.put("merge-annotations?", true);
annotations = annotator.annotate("Lyndon Baines Johnson (/ˈlɪndən ˈbeɪnz/; August 27, 1908 – January 22, 1973), often referred to as LBJ, was an American politician who served as the 36th president of the United States from 1963 to 1969.",
annotationOptions);
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset() + " with meta: " + s.meta()));
//=> Annotated: 'Lyndon Baines Johnson' at offset: 0:21 with meta: {}
Options for dictionary entries are independent. This means that every dictionary entry can have a separate set of options enabled. E.g. two dictionary entries can use stemmers for different languages. Likewise, for some phrases we just need exact matching while for others we need much more relaxed matching using phrase slop and stemming, and all these restrictions can be satisfied with maintaining one dictionary with a several boolean flags.