-
Notifications
You must be signed in to change notification settings - Fork 793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partial_fit with hierarchical_topics #837
Comments
Thank you for your kind words! I am quite sure that your problem should be resolved with the following: # Incrementally fit the topic model by training on 1000 documents at a time and track the topics in each iteration
topics = []
for docs in doc_chunks:
topic_model.partial_fit(docs)
topics.extend(topic_model.topics_)
topic_model.topics_ = topics As you mentioned, when you run |
Oh, wow. I swear that paragraph explicitly documenting and answering my exact question was not there before. You just added that to make me look bad... Thank you for responding so quickly. While I am embarrassed that the answer to my problem was just "read the extremely well-written and easy-to-follow documentation" and I am sincerely sorry for wasting your time on something you so clearly answered, I take solace only in the hope that someday, someone as lazy and oblivious as I will make the same mistake and share in my embarrassment. Thanks again for the help. |
No problem! Any and all questions are welcome. Reading through the documentation, it does seem like it's rather hidden away. I'll make sure it gets a bit clearer in the next release 😄 |
First off, let me say that is module is amazing and the work you are doing is awesome.
Short version, I used the suggested setup for partial_fit on a reasonably small dataset and it works perfectly. The issue is that a model trained using the partial_fit method does not seem to work when I call hierarchical_topics(). Is partial_fit not compatible with hierarchical_topics? I'm not sure if this is a bug, user error, or if I am simply not using it the way it was intended, but I am out of ideas so any help is appreciated.
Long version:
My setup (basically copied from the readme):
When I call hierarchical_topics on the trained topic_mode, it throws the following error:
File "/users/PYS1027/ajdaling/work/munch/compare_models/generate_bertopic_model.py", line 266, in generate_model h = model.hierarchical_topics(list(doc_df.text)) File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/bertopic/_bertopic.py", line 860, in hierarchical_topics documents = pd.DataFrame({"Document": docs, File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/pandas/core/frame.py", line 662, in __init__ mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 493, in dict_to_mgr return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy) File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 118, in arrays_to_mgr index = _extract_index(arrays) File "/users/PYS1027/ajdaling/.conda/envs/munch/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 666, in _extract_index raise ValueError("All arrays must be of the same length") ValueError: All arrays must be of the same length
Upon inspection, it seems that self.topics_ is the length of the last batch of documents passed into partial_fit, so when I input my full list of documents, the documents list is longer than the topics list in the dataframe constuctor.
As a shot in the dark, I tried modifying the source code for the hierarchical_topics function to take the full list of topics (obtained calling .transform() on the full list of documents) as an input parameter but that led to other errors.
More generally, I am using partial_fit because I have some datasets that are simply too large for the standard umap/hdbscan setup (30M+ documents). I am open to any suggestions/configuration, not necessarily partial_fit if it is not a viable option, that would get me to a hierarchical set of clusters.
Thank you in advance.
The text was updated successfully, but these errors were encountered: