-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updated mine_hard_negatives method to include a seperate corpus for m… #2818
updated mine_hard_negatives method to include a seperate corpus for m… #2818
Conversation
…ining hard negatives.
Extended mine_hard_negatives method in sentence_transformers/util.py to enable the retrieval of negative document candidates from a seperate corpus instead of using the positives as corpus. |
(Related to #2818) Thanks for opening this! Will have a look at the details shortly.
|
sentence_transformers/util.py
Outdated
# positive_indices = indices == torch.arange(len(queries), device=indices.device).unsqueeze(-1) | ||
# scores[positive_indices] = -float("inf") | ||
positive_indices = [] | ||
for positive in positives: | ||
if positive in corpus: | ||
index = corpus.index(positive) | ||
positive_indices.append(index) | ||
else: | ||
positive_indices.append(-1) | ||
positive_indices = torch.Tensor(positive_indices).to(torch.int32).unsqueeze(-1) | ||
positive_indices = indices == positive_indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a bit of tests with this, and I ran into a bit of an issue (that already existed, too!):
The same text can occur multiple times in corpus
/positives
. For example:
pairs_dataset = Dataset.from_dict({
"query": ["What is the capital of France?", "What is the capital of Germany?", "What is the capital of Italy?", "What is Paris Hilton's first name?"],
"positive": ["Paris", "Berlin", "Rome", "Paris"],
})
If this is your input dataset
, and you don't specify a separate corpus
, then your scores before the topk
might be:
tensor([[0.6274, 0.3247, 0.3938, 0.6274],
[0.3650, 0.6055, 0.3545, 0.3650],
[0.3440, 0.3384, 0.5845, 0.3440],
[0.5039, 0.1982, 0.2269, 0.5039]], dtype=torch.float16)
This makes sense: the 1st and 4th column are identical because the 1st and 4th corpus element is identical. After topk, we get:
tensor([[0.6274, 0.6274, 0.3938, 0.3247],
[0.6055, 0.3650, 0.3650, 0.3545],
[0.5845, 0.3440, 0.3440, 0.3384],
[0.5039, 0.5039, 0.2269, 0.1982]], dtype=torch.float16)
And then the positive_indices
become:
tensor([[ True, False, False, False],
[ True, False, False, False],
[ True, False, False, False],
[ True, False, False, False]])
because this only finds the first occurrence, not all occurrences. We have this same problem with the new
positive_indices = []
for positive in positives:
if positive in corpus:
index = corpus.index(positive)
positive_indices.append(index)
else:
positive_indices.append(-1)
positive_indices = torch.Tensor(positive_indices).to(torch.int32).unsqueeze(-1)
positive_indices = indices == positive_indices
as well as the old
positive_indices = indices == torch.arange(len(queries), device=indices.device).unsqueeze(-1)
I think the best solution is to remove duplicates from corpus
(and positives
?)
- Tom Aarsen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomaarsen
Oh nice, great catch!
I agree that the easiest is to remove duplicates from the corpus
. I think assuming/forcing the corpus to contain only non-duplicates is also a reasonable assumption. As an alternative, we could try to find all occurences of a positive in the corpus and thus keep the corpus intact, but that might be computationally less efficient? But in general, I guess it makes sense that during retrieval you retrieve different documents and no duplicates (so advocating the remove duplicate option for corpus).
Regarding removing duplicates of the positives, in some parts we require that positives and queries have the same size (i.e. calculate similarity_pairwise), so it is good to keep them intact?
- Chris
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative, we could try to find all occurences of a positive in the corpus and thus keep the corpus intact, but that might be computationally less efficient?
Beyond being computationally less efficient, I now realise that keeping duplicates can hurt the quality of the mined negatives. In particular, the script searches for the range_max + 1
most similar embeddings, and then does filtering based on e.g. margin
, max_score
, etc. Lastly, we take the num_negatives
negatives from the range_max + 1
that were not filtered away. If we have duplicates, then it'll occur more often that there are not num_negatives
non-filtered negative candidates. With other words: we'll be able to find (slightly) more useful negatives if we remove duplicates.
Regarding removing duplicates of the positives, in some parts we require that positives and queries have the same size (i.e. calculate similarity_pairwise), so it is good to keep them intact?
This is indeed a tricky part. One idea that I had was to remove duplicates in corpus
, but not in positives
. Of course, this is a bit tricky to do efficiently, but I think I can make this work. Should I get started on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With other words: we'll be able to find (slightly) more useful negatives if we remove duplicates.
Yes, I totally agree!
This is indeed a tricky part. One idea that I had was to remove duplicates in corpus, but not in positives. Of course, this is a bit tricky to do efficiently, but I think I can make this work. Should I get started on this?
Yes, for me that also feels like the way to go now! :) Great to discuss with you Tom!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I solved it via 46ae230. The patch is smaller than expected, but I believe it works well. In essence, I:
- deduplicate the corpus naively (via list & set)
- create a mapping from positive indices to corpus indices
- use the aforementioned mapping to get the full positive embeddings from the deduplicates corpus embeddings
- use the aforementioned mapping to recognize when a candidate negative from the corpus is equal to a positive, so I can filter those positives away
I think this works pretty nicely: less embeddings required too.
If you can, please verify if this works for you too (just a quick glance to see if the outputs are reasonable is fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also allows us to simplify the rescoring filtering for positive pairs: aba1fb7
I think this might be good to go now. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! I just had a look at the code and couldn't find any issues, also love how neat the code is :)
I tested it on my end with my corpus data and it worked without problems, output looks reasonable too!
For me this is good to go! I will utilize this method a lot next week and can also let you know in case I find something else!
I made some slight adjustments that I'd like to get your thoughts on. I also have some more thoughts that you might be able to help brainstorm with. |
Sure, anytime! How would like to communicate, just via comments here? @tomaarsen |
Yes! I think I'll try and work on removing the duplicates from |
Add a positive to corpus indices mapping, useful to get non-deduplicated positives and to filter away positives taken from the corpus
Hey, I was about to open a PR with something similar from here: /~https://github.com/ArthurCamara/sentence-transformers/tree/hard_negatives_from_corpus This means including a relevance dictionary as a parameter too. |
Hello @ArthurCamara! Beyond your relevance-related changes, I like the I think your variant is certainly more powerful, but Curious about both of your stances on this @ChrisGeishauser @ArthurCamara
|
You can look at the trec-covid from MTEB for an example. The
Makes sense. Easy to fix. =)
I agree that both settings should be supported. I also don't like the idea that The non-faiss option is something I was planning on doing today. I was a bit sick the last couple days and finished this way too late at night. I can do some work on this PR later today with some of the changes and than we can see what it looks like. |
Hey, @tomaarsen and @ChrisGeishauser I've merged some of my changes on this PR: #2848. |
Hi @ArthurCamara and @tomaarsen! Thanks a lot for keeping me in the loop and sorry for the late reply, I had quite a busy week :) Just one question regarding Tom's suggestion to have two functions
Is this still relevant or everything is now put into a single extended method? |
Hi @tomaarsen and @ArthurCamara, just wanted to ask whether there is any update on your side regard the MR :) |
Apologies for the delay, I fell sick in July and have been recovering since. Much better now luckily.
|
…ining hard negatives.