-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create an overview of cleaning taggers #207
Comments
@KennethEnevoldsen @peterbjorgensen These are taggers that might be relevant for cleaning:
|
Seems like we are missing the gopher filters, PII and c4. Will also add this table as a PR with a a brief introduction on how to use the dolma tagger (it can just be a reference to their documentation) I would also like to check which existing were ignored (e.g. we discussed stopwords). Will you also add the taggers implemented in our github (see codebase) |
Yes, we have these taggers implemented: Maybe we should also include the remaining Germanic languages (nl, de) |
This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days. |
@TTTTao725 How did you measure the processing times of the taggers in the table. Did you create a script to do this? |
@peterbjorgensen No, they have a build-in timer so you check it out right after one execution of a tagger, I'll make a PR these days, I have tested more taggers including ones for scandi languages :) |
@TTTTao725 Cool, I don't see that anywhere? If I do |
I believe if you just run default you will get a time at the end. However this seems like a valid reason to use time:
|
And in case you need stats of more taggers:
|
Thanks @TTTTao725, if you have the time for the PR one of the following days that would be great to get it merged in such that you are not sitting with multiple tasks |
If I understand what you are saying correctly, you are running the tagger with only one tagger at a time and recording the time it took to run that tagger? I wrote a small bash loop to do this and here are some numbers I got:
I think I should remove all the taggers that took more than 10 seconds in my example (1000 documents only), maybe except for fasttext lang id. |
^yep exactly, we just did it to get an overview of what taggers were to slow to run in practice. |
A small update. |
Another update: |
This seems very odd. The function is quite simple. Can you identify the examples by just running the python implementation? |
I agree it's very odd. But it works when I exclude this tagger from the set. The command line tool is pure python when used for tagging. I can make a minimum working example to find the data examples it chokes on. |
The regex isn't really that simple, it looks like it could definitely involve a lot of backtracking and probably take forever on some edge case. I think that's likely what happens, can't you find the document it happens on? |
It looks like long sequences of emojis stalls the tagger forever InputSpec(id='7', text='😠 😡', source='hplt1.2', version=None) InputSpec(id='4', text='😠 😡 😤 😋 😎 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) InputSpec(id='11', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 Anti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) InputSpec(id='5', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) InputSpec(id='3', text='\nGæstebogs indlæg: *\n😄 😃 😊 😉 😍 😚 😗 😜 😛 😳 😁 😬 😌 😞 😢 😂 😭 😅 😓 😩 😮 😱 😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) |
Hmm that is odd a solution might be to wrap it in timeout: https://stackoverflow.com/questions/492519/timeout-on-a-function-call then simply have it return NaN? |
I think it would be better to either fix this tagger or add one that works. I don't understand why it is so slow. It basically checks if the string has any alphanumeric character in it and if not it checks if all the characters are punctuation. Seems like a bug in the |
Agreed with @peterbjorgensen that it would be a great idea to create over overview of what taggers might be relevant for cleaning.
Outlining
The text was updated successfully, but these errors were encountered: