Replies: 2 comments 10 replies
-
@w0rsti Thank you for the kind words!
What you most likely get is indeed clusters/topics about signatures or greetings. By themselves, they shouldn't be an issue since you can simply ignore those topics. However, if you do not split your data into sentences, then there is a chance that a very short email will be categorized into a "greeting" topic since the embedding for the mail is created based on the entire email. Although preprocessing is generally not needed, I do not think it would hurt to remove the greetings from these documents. If all you are interested in is the content of these documents, and not by who they were sent/received, then it might be helpful. I think I wouldn't be worried about the topic representation but more about whether documents end in "greeting" topics. |
Beta Was this translation helpful? Give feedback.
-
Hey @MaartenGr! After playing around with the library I am getting more and more impressed about the capabilities of BERTopic! I've trained a base model that contains a lot of good topics although some of them are still not quite satisfying:
In order to approach these issues I've came up with the following strategy:
The Problem I have is how to combine the results into one model. I can think of two ways:
The problems/questions I have:
What is your recommended approach? Are there any pros/cons between the two ways? |
Beta Was this translation helpful? Give feedback.
-
Hey @MaartenGr,
first of all - thanks a lot for this framework. It is incredible powerful and easy to use in my opinion and the docs are insanely good.
I am currently writing my master thesis that uses BERTopic.
The goal is to find topics in customer support tickets and use them to identify pain points or indicators on where the software or documentation needs some work.
My current problem is that - by nature - these tickets are highly unstructured.
Some of them are just plain descriptions about problems that the customer is experiencing. These are often raised directly via a portal and are actually pretty clean. Problems arise when the customer is using a second method to create a ticket - by sending an email to a special email-address which will "convert" the email automatically in a ticket. Now signatures, greetings etc. might end up in the description. Also some customers send their problem description to a different email-address of our customer support team, which then answer these problems separately manually (via email) and sent their response to the email-address that automatically creates a ticket for documentation purposes.
The main problems that result by this behavior:
I was able to "identify" and tackle the latter by just removing those tickets (identified by several techniques e.g. checking if the title contains tags like "Re:", "Fwd:" ... or checking the user who raised the issue - if it is a user who is part of our company, its pretty sure an (manually) answered ticket that was just "archived".
I am wondering if I should preprocess the tickets in some way, although not necessary (according to the docs/FAQ) , since the clustering might cluster documents based on content/patterns found in signatures or greetings. Or is this a wrong assumption and I do not need to worry about that?
I know I can easily remove those words from the topic representation, but I am rather worried that they influence the clustering.
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions