Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language-specific tokenisers seem hard to use properly #7

Open
jryans opened this issue Nov 5, 2019 · 6 comments
Open

Language-specific tokenisers seem hard to use properly #7

jryans opened this issue Nov 5, 2019 · 6 comments

Comments

@jryans
Copy link
Contributor

jryans commented Nov 5, 2019

It looks like seshat has support activating different tokenisers depending on the language set at index creation time:

seshat/src/index.rs

Lines 162 to 176 in 71e17fa

match language {
Language::Unknown => (),
Language::Japanese => {
index
.tokenizers()
.register(&tokenizer_name, TinySegmenterTokenizer::new());
}
_ => {
let tokenizer = tv::tokenizer::SimpleTokenizer
.filter(tv::tokenizer::RemoveLongFilter::limit(40))
.filter(tv::tokenizer::LowerCaser)
.filter(tv::tokenizer::Stemmer::new(language.as_tantivy()));
index.tokenizers().register(&tokenizer_name, tokenizer);
}
}

My concern with this is that you can easily be receiving events in several languages, so there's no way to specify a single language for all your events.

Maybe instead we should apply tokenisers per event and try to detect language per event somehow? I am not quite sure what the right model is, but there current one seems hard to use with it's assumption of a single language.

@poljar
Copy link
Collaborator

poljar commented Nov 6, 2019

We seem to have a couple of options here:

  1. Tokenize using the default tokenizer and a language specific one for all messages.
  2. Allow users to specify multiple languages and then tokenize messages using all of configured tokenizers.
  3. Configure our schema to have fields for every language and try to use something to guess the language. This would probably require us to guess the language at query time as well.

We could also do a combination of 2 and 3, where 3 would be some "auto" language setting. This way if language detection fails we use the configured language(s). Same would apply at search time, or we may want to have a language setting for the search query as well.

@poljar poljar mentioned this issue Nov 6, 2019
@poljar
Copy link
Collaborator

poljar commented Nov 15, 2019

I thought about this for a bit more and I think that we shouldn't introduce NLP magic into Seshat.

If clients wish to use NLP they are free to do so. Letting users define multiple languages sounds sensible.

@bwindels
Copy link

Just my 5 cents, but I think the language usually differ per room, it's generally considered bad netiquette to talk other languages than the one determined by the room creator(s). So one possibility would be to allow to set the language of the room with a state event, which seshat could then pick up to use the correct tokenizer.

@mvgorcum
Copy link

mvgorcum commented Feb 7, 2020

Having multiple languages in a single room is probably not that uncommon, but having multiple languages in a single room that has end to end encryption enabled (ie: is not a public room) sounds like it should be less common.

Best effort seems to me to try and detect the language, and use the tokenizer for that language.

I ran into this issue today because Dutch combines words into a single word without a space, and partial matches don't work with seshat, so it took longer for me to find what I was looking for.

@jryans
Copy link
Contributor Author

jryans commented Feb 1, 2021

We also have requests for Chinese as well. (Let me know if each language of interest should be filed separately, for now I'm treating this as a general "make languages better" issue.)

@poljar
Copy link
Collaborator

poljar commented Feb 1, 2021

At least a tokenizer for Chinese does exits: https://crates.io/crates/cang-jie.

This falls into the same category as Japanese which requires a separate tokenizer as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants