partial_fit with hierarchical_topics #837

ajdaling · 2022-11-12T03:09:38Z

First off, let me say that is module is amazing and the work you are doing is awesome.

Short version, I used the suggested setup for partial_fit on a reasonably small dataset and it works perfectly. The issue is that a model trained using the partial_fit method does not seem to work when I call hierarchical_topics(). Is partial_fit not compatible with hierarchical_topics? I'm not sure if this is a bug, user error, or if I am simply not using it the way it was intended, but I am out of ideas so any help is appreciated.

Long version:

My setup (basically copied from the readme):

IncrementalPCA
MiniBatchKMeans
OnlineCountVectorizer
30,000 document dataset with allenai/specter embeddings
ample memory and GPUs

When I call hierarchical_topics on the trained topic_mode, it throws the following error:

File "/users/PYS1027/ajdaling/work/munch/compare_models/generate_bertopic_model.py", line 266, in generate_model h = model.hierarchical_topics(list(doc_df.text)) File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/bertopic/_bertopic.py", line 860, in hierarchical_topics documents = pd.DataFrame({"Document": docs, File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/pandas/core/frame.py", line 662, in __init__ mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 493, in dict_to_mgr return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy) File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 118, in arrays_to_mgr index = _extract_index(arrays) File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 666, in _extract_index raise ValueError("All arrays must be of the same length") ValueError: All arrays must be of the same length

Upon inspection, it seems that self.topics_ is the length of the last batch of documents passed into partial_fit, so when I input my full list of documents, the documents list is longer than the topics list in the dataframe constuctor.

As a shot in the dark, I tried modifying the source code for the hierarchical_topics function to take the full list of topics (obtained calling .transform() on the full list of documents) as an input parameter but that led to other errors.

More generally, I am using partial_fit because I have some datasets that are simply too large for the standard umap/hdbscan setup (30M+ documents). I am open to any suggestions/configuration, not necessarily partial_fit if it is not a viable option, that would get me to a hierarchical set of clusters.

Thank you in advance.

The text was updated successfully, but these errors were encountered:

MaartenGr · 2022-11-12T09:05:31Z

First off, let me say that is module is amazing and the work you are doing is awesome.

Thank you for your kind words!

I am quite sure that your problem should be resolved with the following:

# Incrementally fit the topic model by training on 1000 documents at a time and track the topics in each iteration
topics = []
for docs in doc_chunks:
    topic_model.partial_fit(docs)
    topics.extend(topic_model.topics_)

topic_model.topics_ = topics

As you mentioned, when you run .partial_fit it will only keep track of the topics created at that specific step. If you go to the documentation here, then you can see that in the last part of the example it is mentioned that you will need to continuously update the internal topic_model.topics_ in order to use it for hierarchical topic modeling. You can use it without updating the internal topic_model.topics_ but then you can use hierarchical topic modeling only for the most recent set of documents on which was trained.

ajdaling · 2022-11-12T21:35:46Z

Oh, wow. I swear that paragraph explicitly documenting and answering my exact question was not there before. You just added that to make me look bad...

Thank you for responding so quickly. While I am embarrassed that the answer to my problem was just "read the extremely well-written and easy-to-follow documentation" and I am sincerely sorry for wasting your time on something you so clearly answered, I take solace only in the hope that someday, someone as lazy and oblivious as I will make the same mistake and share in my embarrassment.

Thanks again for the help.

MaartenGr · 2022-11-14T04:41:43Z

No problem! Any and all questions are welcome. Reading through the documentation, it does seem like it's rather hidden away. I'll make sure it gets a bit clearer in the next release 😄

ajdaling closed this as completed Nov 12, 2022

MaartenGr added a commit that referenced this issue Nov 29, 2022

Fix #837 by updating the documentation

d183bb1

MaartenGr mentioned this issue Nov 29, 2022

v0.13 #840

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

partial_fit with hierarchical_topics #837

partial_fit with hierarchical_topics #837

ajdaling commented Nov 12, 2022 •

edited

Loading

MaartenGr commented Nov 12, 2022

ajdaling commented Nov 12, 2022

MaartenGr commented Nov 14, 2022

partial_fit with hierarchical_topics #837

partial_fit with hierarchical_topics #837

Comments

ajdaling commented Nov 12, 2022 • edited Loading

MaartenGr commented Nov 12, 2022

ajdaling commented Nov 12, 2022

MaartenGr commented Nov 14, 2022

ajdaling commented Nov 12, 2022 •

edited

Loading