Language-specific tokenisers seem hard to use properly #7

jryans · 2019-11-05T17:50:42Z

It looks like seshat has support activating different tokenisers depending on the language set at index creation time:

Lines 162 to 176 in 71e17fa

    
           match language { 
        
               Language::Unknown => (), 
        
               Language::Japanese => { 
        
                   index 
        
                       .tokenizers() 
        
                       .register(&tokenizer_name, TinySegmenterTokenizer::new()); 
        
               } 
        
               _ => { 
        
                   let tokenizer = tv::tokenizer::SimpleTokenizer 
        
                       .filter(tv::tokenizer::RemoveLongFilter::limit(40)) 
        
                       .filter(tv::tokenizer::LowerCaser) 
        
                       .filter(tv::tokenizer::Stemmer::new(language.as_tantivy())); 
        
                   index.tokenizers().register(&tokenizer_name, tokenizer); 
        
               } 
        
           }

My concern with this is that you can easily be receiving events in several languages, so there's no way to specify a single language for all your events.

Maybe instead we should apply tokenisers per event and try to detect language per event somehow? I am not quite sure what the right model is, but there current one seems hard to use with it's assumption of a single language.

poljar · 2019-11-06T08:20:26Z

We seem to have a couple of options here:

Tokenize using the default tokenizer and a language specific one for all messages.
Allow users to specify multiple languages and then tokenize messages using all of configured tokenizers.
Configure our schema to have fields for every language and try to use something to guess the language. This would probably require us to guess the language at query time as well.

We could also do a combination of 2 and 3, where 3 would be some "auto" language setting. This way if language detection fails we use the configured language(s). Same would apply at search time, or we may want to have a language setting for the search query as well.

poljar · 2019-11-15T15:10:50Z

I thought about this for a bit more and I think that we shouldn't introduce NLP magic into Seshat.

If clients wish to use NLP they are free to do so. Letting users define multiple languages sounds sensible.

bwindels · 2019-11-18T11:44:54Z

Just my 5 cents, but I think the language usually differ per room, it's generally considered bad netiquette to talk other languages than the one determined by the room creator(s). So one possibility would be to allow to set the language of the room with a state event, which seshat could then pick up to use the correct tokenizer.

mvgorcum · 2020-02-07T14:14:39Z

Having multiple languages in a single room is probably not that uncommon, but having multiple languages in a single room that has end to end encryption enabled (ie: is not a public room) sounds like it should be less common.

Best effort seems to me to try and detect the language, and use the tokenizer for that language.

I ran into this issue today because Dutch combines words into a single word without a space, and partial matches don't work with seshat, so it took longer for me to find what I was looking for.

jryans · 2021-02-01T14:51:21Z

We also have requests for Chinese as well. (Let me know if each language of interest should be filed separately, for now I'm treating this as a general "make languages better" issue.)

poljar · 2021-02-01T16:26:43Z

At least a tokenizer for Chinese does exits: https://crates.io/crates/cang-jie.

This falls into the same category as Japanese which requires a separate tokenizer as well.

poljar mentioned this issue Nov 6, 2019

Review fixes #10

Closed

poljar mentioned this issue Nov 26, 2019

Add status and management UI for the event indexer matrix-org/matrix-react-sdk#3672

Merged

8 tasks

jryans mentioned this issue Feb 1, 2021

Cannot Search Chinese Words Correctly in E2EE Rooms element-hq/element-desktop#918

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language-specific tokenisers seem hard to use properly #7

Language-specific tokenisers seem hard to use properly #7

jryans commented Nov 5, 2019

poljar commented Nov 6, 2019

poljar commented Nov 15, 2019

bwindels commented Nov 18, 2019

mvgorcum commented Feb 7, 2020

jryans commented Feb 1, 2021

poljar commented Feb 1, 2021

Language-specific tokenisers seem hard to use properly #7

Language-specific tokenisers seem hard to use properly #7

Comments

jryans commented Nov 5, 2019

poljar commented Nov 6, 2019

poljar commented Nov 15, 2019

bwindels commented Nov 18, 2019

mvgorcum commented Feb 7, 2020

jryans commented Feb 1, 2021

poljar commented Feb 1, 2021