Improve Language Processing Logic #110

Cedric-Boucher · 2024-05-26T22:31:10Z

Goal of this pull request

I used chatanalytics to process an entire DM history I have with one person. I know that we have had multiple entire conversations in French, but we mostly use English. I was surprised to see that chatanalytics did not detect any French at all! The goal of this pull request was to improve the language processing logic so that French would show up correctly in the language list, while not allowing other (incorrect) languages to show up.

Changes made

Minimum word count for language detection goes from 0 -> 1
One typo fixed in a user-facing string

I have also added some code in the MessageProcessor to allow changing the minimum language detection model accuracy/confidence threshold, as well as a necessary minimum word count threshold for language detection.

Reasoning for changes made

Before the changes in this PR, the language detection was run on all message groups, even ones that had no words in them (for example only emojis or a URL). The changes in this PR require that the tokenization recognizes at least one word in the message group before running language detection, otherwise the language of the message group is marked as "unreliable to detect".

Cedric-Boucher · 2024-05-26T22:32:35Z

This pull request addresses issue #97

hopperelec · 2024-05-26T22:45:25Z

This is a very thoughtfully put together PR!

I encourage moderators to do additional testing with their own multi-language datasets to verify my changes or make minor adjustments to the values I have set.

Sadly, I don't have any multi-language datasets, and of course the obfuscator can't maintain the language. Does anyone know of any large, public channels which could be used? Most large servers are either just a single language or have dedicated channels for each language. I could just use multiple different channels, but I think channels where multiple languages are used simultaneously would be more likely to cause issues

Cedric-Boucher · 2024-05-26T22:56:53Z

I found this server which seems to have different channels for specific languages (once you give yourself the roles to see the channels).

Additionally, it has an "all languages" channel which has a whole mix of languages in it!

https://disboard.org/server/585488260621402133

hopperelec · 2024-05-26T23:55:25Z

The word count and proportion seem perfect, but I think the minimum accuracy should be lowered a tiny bit. It is quite sensitive, where all the ones between 0.67 and 0.7 are completely accurate, but most of the ones between 0.6 and 0.65 are a mix of two languages (where neither is the most prominent). Of course, going on the very edge of the "completely accurate" section is risky. From my testing (based on ~14000 messages), 0.68 is a good balance between "riskiness" and how many more messages it is able to reliably detect. Despite being a less than 3% change to the value, it took the number of "Unreliable to detect" messages from 4400 to 4000 which is a 10% difference. Of course, this was very eyeballed, but I'll try and do some more testing tomorrow.

hopperelec · 2024-05-28T14:32:39Z

I've now downloaded the entire channel so I'm gonna do some more thorough testing shortly. The original file was 361MB, but I've removed some unimportant details such as roles (74.5MB) then zipped it (5.6MB). It's probably not a good idea for me to upload it publicly, but if anyone else wants a copy of it to do testing, I can send it over Discord or something.

hopperelec · 2024-05-28T17:15:27Z

I haven't finished figuring out the "true accuracy", but here's all the data I've collected so far. I must have made a mistake last time because I'm no longer noticing as much of an impact on "Unreliable to detect", but I personally still reduce it a bit. Most of the incorrect guesses were from it mixing up Portuguese and Galician, which I don't know anything about but are apparently very similar languages so it is understandable.

Sorry that the table is so tall, I don't know why it has made the second column so thin when there's already a horizontal scrollbar...

accuracy_threshold	True accuracy of detected languages (rated out of 10)	Unreliable to detect	English	Spanish	Portuguese	French	German	Italian	Korean	Russian	Galician
N/A (pre-PR)	0 (not because it's never accurate; this is just a minimum control)	13986	62042	9734	0	0	0	0	0	0	0
0.7	10 (I obviously didn't look through all messages with an accuracy higher than 0.7, but all the messages I did look at were accurate. This is a maximum control)	45978	36675	5147	1098	620	503	149	148	122	0
0.69	10 (2/85 guesses with an accuracy of 0.69...0.7 were wrong, and only because those messages mixed 3 different languages together)	45815	36778	5189	1112	624	503	149	148	122	0
0.68	9 (7/72 guesses with an accuracy of 0.68...0.69 were wrong. Two of those guesses did have a bit of the guessed language, but it wasn't the majority of the message)	45701	36853	5214	1119	628	505	149	149	122	0
0.67	8 (16/75 gueses with an accuracy of 0.67...0.68 were wrong. Five of those guesses did have a bit of the guessed language. Five of those guesses were for messages which mixed 3+ different languages together and four of those were duplicate messages saying "Hello" in 10 different languages)	45602	36927	5234	1120	628	508	149	149	123	0
0.66	8 (5/63 guesses with an accuracy of 0.66...0.67 were wrong. Three of those guesses did have a bit of the guesses language)	45503	36996	5256	1123	631	509	150	149	123	0
0.65	8 (6/65 guesses with an accuracy of 0.65...0.66 were wrong)	45395	37080	5273	1129	631	510	150	149	123	0
0.64		45169	37174	5296	1136	633	515	151	151	123	92
0.63		45047	37239	5329	1141	633	520	159	155	124	93
0.62		44911	37332	5357	1146	635	523	159	155	124	98
0.61		44814	37383	5386	1146	642	526	161	155	124	103
0.6		44704	37455	5412	1148	643	530	163	157	125	103

Cedric-Boucher · 2024-05-28T17:38:15Z

I'll try to do some of my own testing as well, soon, and will post my findings as well

mlomb · 2024-05-29T01:57:49Z

I'm really on-board with this PR! Well put together!

@hopperelec: I'm no longer noticing as much of an impact

I've doing some testing too and I came to the same conclusion that accuracy_threshold does not affect that much the results. Using the demo export with 0.68 and 5 (also 4) min words (instead of 6) makes French appear in the list, which is about 0.05% of all messages. There are very small language-specific channels, while Russian being the biggest (0.13%) it only appears when lot of noise comes through too. It depends a lot on the language.
I think the accuracy param it's ok, we should be trying to fine tune the minimum words param, which has a way bigger effect. It should become more sensitive the lower we set the minimum words, right?

Also, fastText is a bit outdated now and almost 90% of the processing time is spent in this library, eventually we should change it to something better 😬 (not now)

Btw I approved the workflows but tests are failing as expected, since they had hardcoded languages to detect regressions. We can fix it before merging

hopperelec · 2024-05-29T18:02:27Z

Hm, the reason I thought 6 was a good minimum was mostly because of duplicate words (there are a lot of short messages which repeat the same words in the dataset), but it might be good to actually only count unique words

Cedric-Boucher · 2024-06-01T03:38:35Z

I feel like it would be pretty cool to give users sliders on the language page to change the minimum words and confidence threshold and see the stats change in realtime. Just an idea I had but that is probably somewhat difficult to do.

Anyway, I am working on gathering data to try to determine the best values for the parameters. So far it looks like we could reduce the minimum word count in the 1-3 range, if we increase the minimum language proportion slightly to ~0.5%. Some languages differentiate from others very well and the model doesn't seem to confuse them for anything else, whereas some languages are sometimes confused for others. For example if the minimum word count is 3 or lower, the English channel starts to see German as one of the languages (which I saw in my own testing of a known English-only DM). Another example is Italian starting to show up for Hindi-Urdu and Indonesian, but again, in a very low total proportion). I haven't played around with model accuracy threshold yet, I've only had it at 0.7 for my testing so far.

Cedric-Boucher · 2024-06-01T03:49:10Z

chatanalytics testing.xlsx

Here's what I've got so far

Cedric-Boucher · 2024-06-01T16:55:18Z

It looks like the minimum word count has a large effect on the detection of Mandarin/Chinese and Japanese. With the minimum word count at 1, those channels have just over 30% of their respective language detected, but with the minimum word count even at just 2, that drops about 10% for both of them, and at minimum 3, it drops about another 7%. Minimum word count has a much smaller effect on most other languages. English channel has 61% english with minimum 6 words and 78% english with minimum 1 word. That's still a big difference but the drop per word added is much less. (edit: to be fair this issue is mostly because the way I did word detection was counting the number of spaces, but there are no spaces between words in chinese and japanese)

Still, it looks like it might be best to keep the minimum word count at 1, and just adjust the language model confidence thresholda and minimum language proportion.

Now that I've done testing with minimum word count with the confidence threshold at 0.7, I will test the confidence threshold while keeping the minimum word count at 1.

By the way, for all of my testing I am keeping the minimum language proportion at 0.1%, because this allows me to see more languages for testing purposes, but increasing this doesn't change the numbers, it only cuts off the low proportions, so I don't need to test changing this value, and a good value can be determined based off the data I collect.

Cedric-Boucher · 2024-06-01T17:03:50Z

chatanalytics testing.xlsx

Here's the data with all my minimum word count testing.

Cedric-Boucher · 2024-06-01T18:18:20Z

I noticed that the way I was counting words for the minimum word count (by splitting the text my spaces and counting the number of items in that array) was incorrect for some languages as well as for non-word space separated message content. I have now fixed this by using the tokenizations of the message groups to correctly count the number of word tokens in the message group. This means I should redo my minimum word count testing as it is incorrect.

Cedric-Boucher · 2024-06-01T19:24:40Z

Looks like with my corrected word counting, the detection of languages has increased slightly, and unreliable to detect has decreased, so that's good. With minimum word count going from 1 to 2, I'm still seeing a significant (9-12%) reduction in mandarin/chinese and japanese in their channels. So I'll still keep the minimum word count at 1 for my round of confidence threshold testing.

Cedric-Boucher · 2024-06-01T19:25:29Z

chatanalytics testing.xlsx

Here's the data with the corrected word count, minimum word count testing (only 1, 2 and 6)

Cedric-Boucher · 2024-06-02T23:04:44Z

By the way, it looks like when the language model guesses the language on a message group that contains no words (maybe URLs, emojis, etc, but no words), it guesses English 100% of the time, with 0.12 confidence. I realized this when comparing minimum word count = 0 to minimum word count = 1. Since the code before the changes in this PR did not check word count, I believe that means that this was technically a bug? This PR will fix that bug so long as the minimum word count is >= 1.

Cedric-Boucher · 2024-06-03T00:43:30Z

When playing around with 0 and 1 minimum word count, I noticed a bug in the code I had originally committed, which has to do with the Unreliable to Detect counter. It was adding the languages that were getting cut off by the proportion threshold to the wrong language, and not Unreliable to Detect. I had misinterpreted how the indices worked for the list. This has now been fixed

Cedric-Boucher · 2024-06-03T00:54:24Z

Oh yeah, since now message groups that have no words no longer pass through the language model (and get determined as English), they go into the Unreliable to Detect category. Should we create a new category specifically for "no words in message" or something? Just a thought, though this could be done at a later time.

Cedric-Boucher · 2024-06-03T02:19:40Z

chatanalytics testing.xlsx

Here's the data for language model confidence threshold testing, with all fixes described above.

Cedric-Boucher · 2024-06-03T02:22:20Z

my conclusion from that testing is:
(language total proportion minimum cutoff on the left, language model confidence minimum threshold on the right)
0.4% cutoff would be good for 0.8
0.5% cutoff would be good for 0.7
0.7% cutoff would be good for 0.6
1.4% cutoff would be good for 0.5
2.5% cutoff would be good for 0.4
3.0% cutoff would be good for 0.0

hopperelec · 2024-06-03T14:06:14Z

All this testing looks awesome!

Should we create a new category specifically for "no words in message" or something?

I definitely think they should be distinct. Could consider just not including them at all (at least in the language section), but that could be confusing

Cedric-Boucher · 2024-06-07T17:07:22Z

To visualize the numbers in my last comment

Cedric-Boucher · 2025-01-09T20:56:27Z

Apologies for the extremely long pause in my work on this PR.
I have updated the description of this PR up at the top to reflect my new goals. I have decided not to change any minimum language detection model confidence level, minimum language proportion, or minimum word count. My only changes now, as described in the PR description, are to change the minimum word count in a message group to 1 (from 0 before), and to fix a typo in a related user-facing string.

Cedric-Boucher · 2025-01-09T21:09:32Z

I have run all tests using npm test and all tests have passed. Apparently changing the minimum word count didn't affect any of the tests, which slighly surprised me.

Anyway, I guess this PR is good to go now, if you'd like to take another look at it @hopperelec

Cedric-Boucher added 2 commits May 26, 2024 16:11

fix typo in user-facing tooltip text

98a242b

improve language processing logic

cf66bc2

Cedric-Boucher mentioned this pull request May 26, 2024

Language tab doesn't seem to detect french #97

Open

Cedric-Boucher added 2 commits May 26, 2024 17:01

remove log commands I forgot about

b67bf2c

fix format with npm run format

51bd0d3

correct word count for language detection cutoff

c015b14

fix bug in Unreliable to Detect counter

1842711

Cedric-Boucher and others added 6 commits January 9, 2025 13:29

Merge branch 'mlomb:main' into main

c412614

reset minimum language model accuracy to 0

e833138

Merge branch 'main' of /~https://github.com/Cedric-Boucher/chat-analytics

43ff91d

set minimum language proportion to 1%

1f97e1d

reset minimum language proportion to original 3%

66926e2

remove unused accidental import

a404a2a

Cedric-Boucher marked this pull request as draft January 14, 2025 21:25

Cedric-Boucher marked this pull request as ready for review January 14, 2025 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Language Processing Logic #110

Improve Language Processing Logic #110

Cedric-Boucher commented May 26, 2024 •

edited

Loading

Cedric-Boucher commented May 26, 2024

hopperelec commented May 26, 2024 •

edited

Loading

Cedric-Boucher commented May 26, 2024

hopperelec commented May 26, 2024

hopperelec commented May 28, 2024 •

edited

Loading

hopperelec commented May 28, 2024 •

edited

Loading

Cedric-Boucher commented May 28, 2024

mlomb commented May 29, 2024

hopperelec commented May 29, 2024

Cedric-Boucher commented Jun 1, 2024

Cedric-Boucher commented Jun 1, 2024

Cedric-Boucher commented Jun 1, 2024 •

edited

Loading

Cedric-Boucher commented Jun 1, 2024

Cedric-Boucher commented Jun 1, 2024

Cedric-Boucher commented Jun 1, 2024

Cedric-Boucher commented Jun 1, 2024

Cedric-Boucher commented Jun 2, 2024

Cedric-Boucher commented Jun 3, 2024

Cedric-Boucher commented Jun 3, 2024

Cedric-Boucher commented Jun 3, 2024

Cedric-Boucher commented Jun 3, 2024

hopperelec commented Jun 3, 2024

Cedric-Boucher commented Jun 7, 2024

Cedric-Boucher commented Jan 9, 2025

Cedric-Boucher commented Jan 9, 2025

Improve Language Processing Logic #110

Are you sure you want to change the base?

Improve Language Processing Logic #110

Conversation

Cedric-Boucher commented May 26, 2024 • edited Loading

Goal of this pull request

Changes made

Reasoning for changes made

Cedric-Boucher commented May 26, 2024

hopperelec commented May 26, 2024 • edited Loading

Cedric-Boucher commented May 26, 2024

hopperelec commented May 26, 2024

hopperelec commented May 28, 2024 • edited Loading

hopperelec commented May 28, 2024 • edited Loading

Cedric-Boucher commented May 28, 2024

mlomb commented May 29, 2024

hopperelec commented May 29, 2024

Cedric-Boucher commented Jun 1, 2024

Cedric-Boucher commented Jun 1, 2024

Cedric-Boucher commented Jun 1, 2024 • edited Loading

Cedric-Boucher commented Jun 1, 2024

Cedric-Boucher commented Jun 1, 2024

Cedric-Boucher commented Jun 1, 2024

Cedric-Boucher commented Jun 1, 2024

Cedric-Boucher commented Jun 2, 2024

Cedric-Boucher commented Jun 3, 2024

Cedric-Boucher commented Jun 3, 2024

Cedric-Boucher commented Jun 3, 2024

Cedric-Boucher commented Jun 3, 2024

hopperelec commented Jun 3, 2024

Cedric-Boucher commented Jun 7, 2024

Cedric-Boucher commented Jan 9, 2025

Cedric-Boucher commented Jan 9, 2025

Cedric-Boucher commented May 26, 2024 •

edited

Loading

hopperelec commented May 26, 2024 •

edited

Loading

hopperelec commented May 28, 2024 •

edited

Loading

hopperelec commented May 28, 2024 •

edited

Loading

Cedric-Boucher commented Jun 1, 2024 •

edited

Loading