You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello!
I am working on analyzing research funding trends using data collected from NIH ExPORTER.
My team and I have run into a problem where there are no duplicate documents in our raw data, as in there are not repeated ABSTRACT, APPLICATION_ID, or PI_names, but documents extracted from the BERTopic results using get_document_info(corpus) has repeated documents with the same exact ABSTRACT, APPLICATION_ID, and PI_names. Even the dates of budget start and end are the same.
I don't think the problem lies in preprocessing, as our corpus creation is rather simple:
`def clean_text(text):
text = text.lower().strip() # Lowercase
text = re.sub(r'\s+', ' ', text) # Remove extra spaces
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
text = re.sub(r'\b\d+\b', '', text) # Remove standalone numbers
return text
data['CLEANED_ABSTRACT'] = data['ABSTRACT_TEXT'].astype(str).apply(clean_text)
corpus = data['CLEANED_ABSTRACT'].astype(str).tolist()
`
There shouldn't be repeated documents, since there are no resubmissions or renewals(which would mean a different start date on the budget, but the same abstract and PI names, but there aren't any repeats in our raw data file).
The repeated documents only appear in doc_info that was retrieved via get_document_info(corpus). We're thinking that duplicates maybe appearing because some documents are assigned 2 or more topics at once, and replacing some of the other documents assigned to those same topics with lower probabilities. However, we don't see these repeat documents in different topics, only in the same ones. And in our manual validation where we picked 15 documents for 15 topics, we will get anywhere from 2~8 repeated documents per topic.
Any idea is appreciated. Please let me know if more information is needed to answer this question. Thank you.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello!
I am working on analyzing research funding trends using data collected from NIH ExPORTER.
My team and I have run into a problem where there are no duplicate documents in our raw data, as in there are not repeated ABSTRACT, APPLICATION_ID, or PI_names, but documents extracted from the BERTopic results using get_document_info(corpus) has repeated documents with the same exact ABSTRACT, APPLICATION_ID, and PI_names. Even the dates of budget start and end are the same.
I don't think the problem lies in preprocessing, as our corpus creation is rather simple:
`def clean_text(text):
text = text.lower().strip() # Lowercase
text = re.sub(r'\s+', ' ', text) # Remove extra spaces
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
text = re.sub(r'\b\d+\b', '', text) # Remove standalone numbers
return text
data['CLEANED_ABSTRACT'] = data['ABSTRACT_TEXT'].astype(str).apply(clean_text)
corpus = data['CLEANED_ABSTRACT'].astype(str).tolist()
`
There shouldn't be repeated documents, since there are no resubmissions or renewals(which would mean a different start date on the budget, but the same abstract and PI names, but there aren't any repeats in our raw data file).
The repeated documents only appear in doc_info that was retrieved via get_document_info(corpus). We're thinking that duplicates maybe appearing because some documents are assigned 2 or more topics at once, and replacing some of the other documents assigned to those same topics with lower probabilities. However, we don't see these repeat documents in different topics, only in the same ones. And in our manual validation where we picked 15 documents for 15 topics, we will get anywhere from 2~8 repeated documents per topic.
Any idea is appreciated. Please let me know if more information is needed to answer this question. Thank you.
Beta Was this translation helpful? Give feedback.
All reactions