Should I preprocess emails? #2227

w0rsti · 2024-11-27T14:55:09Z

w0rsti
Nov 27, 2024

first of all - thanks a lot for this framework. It is incredible powerful and easy to use in my opinion and the docs are insanely good.

I am currently writing my master thesis that uses BERTopic.
The goal is to find topics in customer support tickets and use them to identify pain points or indicators on where the software or documentation needs some work.

My current problem is that - by nature - these tickets are highly unstructured.
Some of them are just plain descriptions about problems that the customer is experiencing. These are often raised directly via a portal and are actually pretty clean. Problems arise when the customer is using a second method to create a ticket - by sending an email to a special email-address which will "convert" the email automatically in a ticket. Now signatures, greetings etc. might end up in the description. Also some customers send their problem description to a different email-address of our customer support team, which then answer these problems separately manually (via email) and sent their response to the email-address that automatically creates a ticket for documentation purposes.

The main problems that result by this behavior:

Some tickets contain signatures, greetings and other repetitive patterns when using creating tickets via email
Some tickets contain answers/email conversations (+ the problems above)

I was able to "identify" and tackle the latter by just removing those tickets (identified by several techniques e.g. checking if the title contains tags like "Re:", "Fwd:" ... or checking the user who raised the issue - if it is a user who is part of our company, its pretty sure an (manually) answered ticket that was just "archived".

I am wondering if I should preprocess the tickets in some way, although not necessary (according to the docs/FAQ) , since the clustering might cluster documents based on content/patterns found in signatures or greetings. Or is this a wrong assumption and I do not need to worry about that?

I know I can easily remove those words from the topic representation, but I am rather worried that they influence the clustering.

Thanks in advance!

MaartenGr · 2024-11-29T07:29:57Z

MaartenGr
Nov 29, 2024
Maintainer

@w0rsti Thank you for the kind words!

I am wondering if I should preprocess the tickets in some way, although not necessary (according to the docs/FAQ) , since the clustering might cluster documents based on content/patterns found in signatures or greetings. Or is this a wrong assumption and I do not need to worry about that?

What you most likely get is indeed clusters/topics about signatures or greetings. By themselves, they shouldn't be an issue since you can simply ignore those topics. However, if you do not split your data into sentences, then there is a chance that a very short email will be categorized into a "greeting" topic since the embedding for the mail is created based on the entire email.

Although preprocessing is generally not needed, I do not think it would hurt to remove the greetings from these documents. If all you are interested in is the content of these documents, and not by who they were sent/received, then it might be helpful.

I think I wouldn't be worried about the topic representation but more about whether documents end in "greeting" topics.

9 replies

MaartenGr Dec 10, 2024
Maintainer

What would your approach be to prevent this behaviour?

So, if I'm not mistaken the clusters look good but the topic representation doesn't? If so, you might be able to improve the representation with something like KeyBERTInspired or using an LLM to extract a more accurate description/label.

Another question I have is: How exactly does BERTopic handle new documents? Does it assign the new document a topic based on "the cluster embedding" or the ctf-idf of a cluster?

That depends on whether you used pickle or pytorch/safetensors to save and load the model. With pickle, and this might also depend on whether you are using zero-shot topic modeling, it uses the underlying dim reduction and cluster model to perform the predictions.

With pytorch/safetensors it calculates the cosine similarity between the document embedding and the topic embeddings to make its prediction.

w0rsti Dec 10, 2024
Author

Its not about the representation of the cluster but rather that the cluster is "not desired".

Imagine the model has successfully found "generic" topics/clusters e.g. "Authentication Issues", "Performance Issues", "Feature XY Issues".

Due to the strong presence of the word "hotel" in one of our customers tickets, the model also creates a topic around "hotel". The customers tickets all get put into this topic although I want them to end up in the other topics. It seems like it focuses to much on the word "hotel" and disregards other information in the ticket about e.g. issues with the authentication.

MaartenGr Dec 10, 2024
Maintainer

Due to the strong presence of the word "hotel" in one of our customers tickets, the model also creates a topic around "hotel".

Hmmm, if it's just the appearance of the word, then it shouldn't affect the cluster creation that much since the features are created with an embedding model and not a bag-of-word in your case, right? Perhaps by switching to a more performing embedding model, it can more accurately capture the semantics of the document.

w0rsti Dec 11, 2024
Author

Hmmm, if it's just the appearance of the word, then it shouldn't affect the cluster creation that much since the features are created with an embedding model and not a bag-of-word in your case, right? Perhaps by switching to a more performing embedding model, it can more accurately capture the semantics of the document.

Maybe its not about the actual word but more about the "structure" about how the customer is creating the ticket? Nonetheless I have a topic that I don't see fitting for my use-case and I want prevent further tickets to get put into that cluster/topic since its not meaningful to us. Is it fine to merge it with the -1 topic and try to call reduce outliers to try to fit the now outliers into the other topics?

That depends on whether you used pickle or pytorch/safetensors to save and load the model. With pickle, and this might also depend on whether you are using zero-shot topic modeling, it uses the underlying dim reduction and cluster model to perform the predictions.
With pytorch/safetensors it calculates the cosine similarity between the document embedding and the topic embeddings to make its prediction.

For now I am not saving the model as we are still in the development phase but later on we would use pytorch/safetensors. With topic embeddings you mean the mean embedding of all documents within that topic, right?

Also we've noticed that there are some topics that are too generic or we think can be further split up (e.g. an "authentication" topic could be further split down into topics like "login", "registration", "forgot credentials" and so on). I saw your answer on #1889 and wanted to ask what you think works best. Creating submodels and merging them or just keep them seperate and have several instances of a BERTopic model and build kind of a pipeline that will pass it to the submodel if the first model puts it into a generic model. Are there any differences? The only advantage for keeping them seperate is if the submodel classifies the document as outlier we can still assign in the generic topic while using the merged model it might end up as outlier overall.

Once again - thanks a lot for this great library and especially the insane good maintanance/support you put into this project!

MaartenGr Dec 15, 2024
Maintainer

Maybe its not about the actual word but more about the "structure" about how the customer is creating the ticket? Nonetheless I have a topic that I don't see fitting for my use-case and I want prevent further tickets to get put into that cluster/topic since its not meaningful to us. Is it fine to merge it with the -1 topic and try to call reduce outliers to try to fit the now outliers into the other topics?

Yes, if there are topics that you want to remove, merging it with the -1 topic is generally a good strategy.

For now I am not saving the model as we are still in the development phase but later on we would use pytorch/safetensors. With topic embeddings you mean the mean embedding of all documents within that topic, right?

That is correct.

Also we've noticed that there are some topics that are too generic or we think can be further split up (e.g. an "authentication" topic could be further split down into topics like "login", "registration", "forgot credentials" and so on). I saw your answer on #1889 and wanted to ask what you think works best. Creating submodels and merging them or just keep them seperate and have several instances of a BERTopic model and build kind of a pipeline that will pass it to the submodel if the first model puts it into a generic model. Are there any differences? The only advantage for keeping them seperate is if the submodel classifies the document as outlier we can still assign in the generic topic while using the merged model it might end up as outlier overall.

I would generally advise merging them and using calculate_probabilities=True to generate the probabilities per topic. That way, you can still have that flexibility of ignoring the -1 topic and still assigning it to a non-outlier topic.

w0rsti · 2025-01-29T13:51:12Z

w0rsti
Jan 29, 2025
Author

Hey @MaartenGr!

After playing around with the library I am getting more and more impressed about the capabilities of BERTopic!
I was able to preprocess my data so that I am able to get good results. I just wanted to ask for your opinion/advice on the following:

I've trained a base model that contains a lot of good topics although some of them are still not quite satisfying:

As already mentioned, some topics are too broad and we want to discover subtopics within them. As we already discussed I want to follow your advice on taking the documents within a broad topic and train a submodel on the given data to get more granular topics.
For some topics we are not able to find a good semantic relation
Using high min_topic_size values result in mostly good clusters but also a lot of outliers that might have intresting, smaller topics in them.

In order to approach these issues I've came up with the following strategy:

Train a model on the unclassified dataset
Evaluate the topics found and evaluate them into one of the three states: "Good", "Bad", "Split"

Good Topics will be saved within a seperate dataset - containing the label/id and the relevant documents
Bad Topics and the outlier topic will stay in the unlabeled dataset for further training
Split Topics will be removed from the unlabeled dataset and handeled seperatly by training a submodel on this data.

Continue at step 1 - Iterativly train a new model (possibly with a smaller min_topic_size in order to find more topics within the unlabeled dataset) and keep iterating until no more useful topics are found or there are no more unlabeled data.

The Problem I have is how to combine the results into one model. I can think of two ways:

Use the (now) labeled dataset to train a model using manual topic modeling (basically just a classifier)
Keep merging the submodels (either "smaller base models" or "split topic models") and their topics back into the base model.

The problems/questions I have:

[Only Relevant for Approach 2] How can i safely merge the submodel (and their topics) back into the main model? I know there is the merge_models function but the Problem is that I somehow need to delete/remove the main topic. (E.g. "Authentication" Topic will be Split up in "Login" and "Register" -> I dont want "Authentication", "Login" and "Register" in the main topic but only the latter.). I thought about merging the main topic and the -1 (outlier) topic but I am affraid that this will influence the actual "outlier" topic and it will have impact on further classification of new documents.

What is your recommended approach? Are there any pros/cons between the two ways?

1 reply

MaartenGr Jan 29, 2025
Maintainer

After playing around with the library I am getting more and more impressed about the capabilities of BERTopic!

Thank you for the kind words!

[Only Relevant for Approach 2] How can i safely merge the submodel (and their topics) back into the main model? I know there is the merge_models function but the Problem is that I somehow need to delete/remove the main topic. (E.g. "Authentication" Topic will be Split up in "Login" and "Register" -> I dont want "Authentication", "Login" and "Register" in the main topic but only the latter.). I thought about merging the main topic and the -1 (outlier) topic but I am affraid that this will influence the actual "outlier" topic and it will have impact on further classification of new documents.

It's quite alright to merge that with the outlier topic I think as long as you indeed keep track of what are actual outliers and what aren't. The manual topic model is based purely on the y that you create, so if you make sure that the proper document has the "Login" topic and not the "Authentication" topic. Moreover, I don't think you will need to use the classification function in the manual topic modeling here as the default way to perform inference is to find the cosine similarity between the topic embeddings and the document embeddings. As such, you can also easily filter out the -1 results when you perform any transformation/prediction as the result is a document/topic matrix of distances.

Personally, I think I would prefer the manual approach as that would easily allow to me to filter and select the topics I want before finally creating them from "scratch" (as the manual approach will recalculate the topic embeddings and representations).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should I preprocess emails? #2227

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Should I preprocess emails? #2227

w0rsti Nov 27, 2024

Replies: 2 comments · 10 replies

MaartenGr Nov 29, 2024 Maintainer

MaartenGr Dec 10, 2024 Maintainer

w0rsti Dec 10, 2024 Author

MaartenGr Dec 10, 2024 Maintainer

w0rsti Dec 11, 2024 Author

MaartenGr Dec 15, 2024 Maintainer

w0rsti Jan 29, 2025 Author

MaartenGr Jan 29, 2025 Maintainer

w0rsti
Nov 27, 2024

Replies: 2 comments 10 replies

MaartenGr
Nov 29, 2024
Maintainer

MaartenGr Dec 10, 2024
Maintainer

w0rsti Dec 10, 2024
Author

MaartenGr Dec 10, 2024
Maintainer

w0rsti Dec 11, 2024
Author

MaartenGr Dec 15, 2024
Maintainer

w0rsti
Jan 29, 2025
Author

MaartenGr Jan 29, 2025
Maintainer