Duplicate Document Entries in Documents from get_document_info(corpus) #2295

zerubael · 2025-02-23T22:26:20Z

zerubael
Feb 23, 2025

Hello!
I am working on analyzing research funding trends using data collected from NIH ExPORTER.

My team and I have run into a problem where there are no duplicate documents in our raw data, as in there are not repeated ABSTRACT, APPLICATION_ID, or PI_names, but documents extracted from the BERTopic results using get_document_info(corpus) has repeated documents with the same exact ABSTRACT, APPLICATION_ID, and PI_names. Even the dates of budget start and end are the same.

I don't think the problem lies in preprocessing, as our corpus creation is rather simple:
`def clean_text(text):
text = text.lower().strip() # Lowercase
text = re.sub(r'\s+', ' ', text) # Remove extra spaces
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
text = re.sub(r'\b\d+\b', '', text) # Remove standalone numbers
return text

data['CLEANED_ABSTRACT'] = data['ABSTRACT_TEXT'].astype(str).apply(clean_text)
corpus = data['CLEANED_ABSTRACT'].astype(str).tolist()
`

There shouldn't be repeated documents, since there are no resubmissions or renewals(which would mean a different start date on the budget, but the same abstract and PI names, but there aren't any repeats in our raw data file).

The repeated documents only appear in doc_info that was retrieved via get_document_info(corpus). We're thinking that duplicates maybe appearing because some documents are assigned 2 or more topics at once, and replacing some of the other documents assigned to those same topics with lower probabilities. However, we don't see these repeat documents in different topics, only in the same ones. And in our manual validation where we picked 15 documents for 15 topics, we will get anywhere from 2~8 repeated documents per topic.

Any idea is appreciated. Please let me know if more information is needed to answer this question. Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate Document Entries in Documents from get_document_info(corpus) #2295

{{title}}

Replies: 0 comments

Select a reply

Duplicate Document Entries in Documents from get_document_info(corpus) #2295

zerubael Feb 23, 2025

Replies: 0 comments

zerubael
Feb 23, 2025