-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Language-specific tokenisers seem hard to use properly #7
Comments
We seem to have a couple of options here:
We could also do a combination of 2 and 3, where 3 would be some "auto" language setting. This way if language detection fails we use the configured language(s). Same would apply at search time, or we may want to have a language setting for the search query as well. |
I thought about this for a bit more and I think that we shouldn't introduce NLP magic into Seshat. If clients wish to use NLP they are free to do so. Letting users define multiple languages sounds sensible. |
Just my 5 cents, but I think the language usually differ per room, it's generally considered bad netiquette to talk other languages than the one determined by the room creator(s). So one possibility would be to allow to set the language of the room with a state event, which seshat could then pick up to use the correct tokenizer. |
Having multiple languages in a single room is probably not that uncommon, but having multiple languages in a single room that has end to end encryption enabled (ie: is not a public room) sounds like it should be less common. Best effort seems to me to try and detect the language, and use the tokenizer for that language. I ran into this issue today because Dutch combines words into a single word without a space, and partial matches don't work with seshat, so it took longer for me to find what I was looking for. |
We also have requests for Chinese as well. (Let me know if each language of interest should be filed separately, for now I'm treating this as a general "make languages better" issue.) |
At least a tokenizer for Chinese does exits: https://crates.io/crates/cang-jie. This falls into the same category as Japanese which requires a separate tokenizer as well. |
It looks like seshat has support activating different tokenisers depending on the language set at index creation time:
seshat/src/index.rs
Lines 162 to 176 in 71e17fa
My concern with this is that you can easily be receiving events in several languages, so there's no way to specify a single language for all your events.
Maybe instead we should apply tokenisers per event and try to detect language per event somehow? I am not quite sure what the right model is, but there current one seems hard to use with it's assumption of a single language.
The text was updated successfully, but these errors were encountered: