updated mine_hard_negatives method to include a seperate corpus for m… #2818

ChrisGeishauser · 2024-07-09T13:03:22Z

…ining hard negatives.

ChrisGeishauser · 2024-07-09T13:04:21Z

Extended mine_hard_negatives method in sentence_transformers/util.py to enable the retrieval of negative document candidates from a seperate corpus instead of using the positives as corpus.

tomaarsen · 2024-07-09T13:13:09Z

(Related to #2818)

Thanks for opening this! Will have a look at the details shortly.

Tom Aarsen

sentence_transformers/util.py

tomaarsen · 2024-07-09T14:51:00Z

sentence_transformers/util.py

+    # positive_indices = indices == torch.arange(len(queries), device=indices.device).unsqueeze(-1)
+    # scores[positive_indices] = -float("inf")
+    positive_indices = []
+    for positive in positives:
+        if positive in corpus:
+            index = corpus.index(positive)
+            positive_indices.append(index)
+        else:
+            positive_indices.append(-1)
+    positive_indices = torch.Tensor(positive_indices).to(torch.int32).unsqueeze(-1)
+    positive_indices = indices == positive_indices


I did a bit of tests with this, and I ran into a bit of an issue (that already existed, too!):
The same text can occur multiple times in corpus/positives. For example:

pairs_dataset = Dataset.from_dict({ "query": ["What is the capital of France?", "What is the capital of Germany?", "What is the capital of Italy?", "What is Paris Hilton's first name?"], "positive": ["Paris", "Berlin", "Rome", "Paris"], })

If this is your input dataset, and you don't specify a separate corpus, then your scores before the topk might be:

tensor([[0.6274, 0.3247, 0.3938, 0.6274], [0.3650, 0.6055, 0.3545, 0.3650], [0.3440, 0.3384, 0.5845, 0.3440], [0.5039, 0.1982, 0.2269, 0.5039]], dtype=torch.float16)

This makes sense: the 1st and 4th column are identical because the 1st and 4th corpus element is identical. After topk, we get:

tensor([[0.6274, 0.6274, 0.3938, 0.3247], [0.6055, 0.3650, 0.3650, 0.3545], [0.5845, 0.3440, 0.3440, 0.3384], [0.5039, 0.5039, 0.2269, 0.1982]], dtype=torch.float16)

And then the positive_indices become:

tensor([[ True, False, False, False], [ True, False, False, False], [ True, False, False, False], [ True, False, False, False]])

because this only finds the first occurrence, not all occurrences. We have this same problem with the new

positive_indices = [] for positive in positives: if positive in corpus: index = corpus.index(positive) positive_indices.append(index) else: positive_indices.append(-1) positive_indices = torch.Tensor(positive_indices).to(torch.int32).unsqueeze(-1) positive_indices = indices == positive_indices

as well as the old

positive_indices = indices == torch.arange(len(queries), device=indices.device).unsqueeze(-1)

I think the best solution is to remove duplicates from corpus (and positives?)

Tom Aarsen

@tomaarsen
Oh nice, great catch!

I agree that the easiest is to remove duplicates from the corpus. I think assuming/forcing the corpus to contain only non-duplicates is also a reasonable assumption. As an alternative, we could try to find all occurences of a positive in the corpus and thus keep the corpus intact, but that might be computationally less efficient? But in general, I guess it makes sense that during retrieval you retrieve different documents and no duplicates (so advocating the remove duplicate option for corpus).

Regarding removing duplicates of the positives, in some parts we require that positives and queries have the same size (i.e. calculate similarity_pairwise), so it is good to keep them intact?

Chris

As an alternative, we could try to find all occurences of a positive in the corpus and thus keep the corpus intact, but that might be computationally less efficient?

Beyond being computationally less efficient, I now realise that keeping duplicates can hurt the quality of the mined negatives. In particular, the script searches for the range_max + 1 most similar embeddings, and then does filtering based on e.g. margin, max_score, etc. Lastly, we take the num_negatives negatives from the range_max + 1 that were not filtered away. If we have duplicates, then it'll occur more often that there are not num_negatives non-filtered negative candidates. With other words: we'll be able to find (slightly) more useful negatives if we remove duplicates.

Regarding removing duplicates of the positives, in some parts we require that positives and queries have the same size (i.e. calculate similarity_pairwise), so it is good to keep them intact?

This is indeed a tricky part. One idea that I had was to remove duplicates in corpus, but not in positives. Of course, this is a bit tricky to do efficiently, but I think I can make this work. Should I get started on this?

With other words: we'll be able to find (slightly) more useful negatives if we remove duplicates.

Yes, I totally agree!

This is indeed a tricky part. One idea that I had was to remove duplicates in corpus, but not in positives. Of course, this is a bit tricky to do efficiently, but I think I can make this work. Should I get started on this?

Yes, for me that also feels like the way to go now! :) Great to discuss with you Tom!

I believe I solved it via 46ae230. The patch is smaller than expected, but I believe it works well. In essence, I:

deduplicate the corpus naively (via list & set)

create a mapping from positive indices to corpus indices

use the aforementioned mapping to get the full positive embeddings from the deduplicates corpus embeddings

use the aforementioned mapping to recognize when a candidate negative from the corpus is equal to a positive, so I can filter those positives away

I think this works pretty nicely: less embeddings required too.

If you can, please verify if this works for you too (just a quick glance to see if the outputs are reasonable is fine)

This also allows us to simplify the rescoring filtering for positive pairs: aba1fb7

I think this might be good to go now. What do you think?

Thanks for the explanation! I just had a look at the code and couldn't find any issues, also love how neat the code is :)
I tested it on my end with my corpus data and it worked without problems, output looks reasonable too!

For me this is good to go! I will utilize this method a lot next week and can also let you know in case I find something else!

tomaarsen · 2024-07-09T14:51:28Z

I made some slight adjustments that I'd like to get your thoughts on. I also have some more thoughts that you might be able to help brainstorm with.

ChrisGeishauser · 2024-07-10T08:32:07Z

I made some slight adjustments that I'd like to get your thoughts on. I also have some more thoughts that you might be able to help brainstorm with.

Sure, anytime! How would like to communicate, just via comments here? @tomaarsen

tomaarsen · 2024-07-10T09:07:44Z

Yes! I think I'll try and work on removing the duplicates from corpus based on our thread here: #2818 (comment)

Add a positive to corpus indices mapping, useful to get non-deduplicated positives and to filter away positives taken from the corpus

ArthurCamara · 2024-07-12T09:05:01Z

Hey, I was about to open a PR with something similar from here: /~https://github.com/ArthurCamara/sentence-transformers/tree/hard_negatives_from_corpus
I think the biggest difference is that I considered the case where there could be multiple positives as well (e.g., TREC-Covid has ~500 positives per query).

This means including a relevance dictionary as a parameter too.
I can try bringing some of my changes to this PR if you think it is as a meaningful use case as well, @tomaarsen @ChrisGeishauser

tomaarsen · 2024-07-12T09:32:06Z

Hello @ArthurCamara!
I haven't worked much with datasets that have multiple positives per query, but I can imagine that they're fairly common in IR datasets indeed (please correct me if I'm wrong). In your branch, what would a normal relevance_dataset look like? A dataset with ["query-id", "corpus-id", "score"] columns? Also, if I'm looking at it right, then your function requires that both the queries_dataset and corpus_dataset have id columns right?

Beyond your relevance-related changes, I like the chunk_size option: I can imagine that you could OOM with model.encode otherwise. As for keep_title, perhaps that map should only trigger if the dataset has a title column.

I think your variant is certainly more powerful, but mine_hard_negatives from this PR is also a lot simpler to use in terms of data preprocessing. Ideally, I'd like to keep the option for the simple usage, so I think it might be best to support 2 functions. To avoid duplication, it might be possible to rewrite mine_hard_negatives to use mine_hard_negatives_from_corpus (or mine_hard_negatives_from_relevance?) under the hood. One tricky component is that mine_hard_negatives_from_corpus must then support all behavior from mine_hard_negatives, which would require some extra work (e.g. a non-FAISS implementation, perhaps add corpus deduplication that this PR adds).

Curious about both of your stances on this @ChrisGeishauser @ArthurCamara

Tom Aarsen

ArthurCamara · 2024-07-12T10:16:15Z

Hello @ArthurCamara! I haven't worked much with datasets that have multiple positives per query, but I can imagine that they're fairly common in IR datasets indeed (please correct me if I'm wrong).
They are somewhat common, but I'm mostly coming from the MTEB/BEIR side of things, where some of the datasets may have multiple positives with different scores (that's why the positive_score_threshold argument).

In your branch, what would a normal relevance_dataset look like? A dataset with ["query-id", "corpus-id", "score"] columns? Also, if I'm looking at it right, then your function requires that both the queries_dataset and corpus_dataset have id columns right?

You can look at the trec-covid from MTEB for an example. The default subset is a dataset with three columns: query-id, corpus-id and scores. This is also why we need ids for the queries and documents, so we can map the relevance back into the texts.

Beyond your relevance-related changes, I like the chunk_size option: I can imagine that you could OOM with model.encode otherwise. As for keep_title, perhaps that map should only trigger if the dataset has a title column.

Makes sense. Easy to fix. =)

I think your variant is certainly more powerful, but mine_hard_negatives from this PR is also a lot simpler to use in terms of data preprocessing. Ideally, I'd like to keep the option for the simple usage, so I think it might be best to support 2 functions. To avoid duplication, it might be possible to rewrite mine_hard_negatives to use mine_hard_negatives_from_corpus (or mine_hard_negatives_from_relevance?) under the hood. One tricky component is that mine_hard_negatives_from_corpus must then support all behavior from mine_hard_negatives, which would require some extra work (e.g. a non-FAISS implementation, perhaps add corpus deduplication that this PR adds).

Curious about both of your stances on this @ChrisGeishauser @ArthurCamara

Tom Aarsen

I agree that both settings should be supported. I also don't like the idea that mine_hard_negatives_from_corpus needs such specific Dataset formats. Probably better to support passing a list of tests for corpus and a mapping from scores. Under the hood this can be transformed into whatever format mine_hard_negatives_from_corpus ends up using, including creating some fake document ids.

The non-faiss option is something I was planning on doing today. I was a bit sick the last couple days and finished this way too late at night.

I can do some work on this PR later today with some of the changes and than we can see what it looks like.

ArthurCamara · 2024-07-18T13:47:06Z

Hey, @tomaarsen and @ChrisGeishauser I've merged some of my changes on this PR: #2848.
It allows for both a separate corpus being passed and multiple positives per query, using the same interface as before. Let me know if you prefer me to merge my changes into the branch instead.

ChrisGeishauser · 2024-07-19T16:26:07Z

Hi @ArthurCamara and @tomaarsen! Thanks a lot for keeping me in the loop and sorry for the late reply, I had quite a busy week :)
@ArthurCamara thanks for suggesting these advancements, the option to have multiple positives would complete this function definitely! I didn't have time yet to have an in-detail look into the code, but can do that probably next Tuesday (if necessary).

Just one question regarding Tom's suggestion to have two functions

I think your variant is certainly more powerful, but mine_hard_negatives from this PR is also a lot simpler to use in terms of data preprocessing. Ideally, I'd like to keep the option for the simple usage, so I think it might be best to support 2 functions.

Is this still relevant or everything is now put into a single extended method?

ChrisGeishauser · 2024-08-14T09:43:08Z

Hi @tomaarsen and @ArthurCamara, just wanted to ask whether there is any update on your side regard the MR :)

tomaarsen · 2024-09-11T13:27:36Z

Apologies for the delay, I fell sick in July and have been recovering since. Much better now luckily.
I'll be moving forward with #2848 instead which is a bit of a superset of the changes from this PR. I do want to explicitly thank you both for assisting with this! I'm looking forward to seeing the datasets that people prepare & then train with.

Tom Aarsen

updated mine_hard_negatives method to include a seperate corpus for m…

8e5927c

…ining hard negatives.

ChrisGeishauser closed this Jul 9, 2024

ChrisGeishauser reopened this Jul 9, 2024

tomaarsen added 3 commits July 9, 2024 15:30

Run 'make check'

bcac963

Update "corpus" to just a list of strings

e63938f

Prevent duplicate embeddings if no separate corpus

0736190

tomaarsen reviewed Jul 9, 2024

View reviewed changes

tomaarsen added 2 commits July 10, 2024 12:56

Deduplicate corpus

46ae230

Add a positive to corpus indices mapping, useful to get non-deduplicated positives and to filter away positives taken from the corpus

Skip rescoring positive pairs via pos_to_corpus_indices instead

aba1fb7

ArthurCamara mentioned this pull request Jul 18, 2024

[feat] Update mine_hard_negatives to using a full corpus and multiple positives #2848

Merged

tomaarsen closed this Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updated mine_hard_negatives method to include a seperate corpus for m… #2818

updated mine_hard_negatives method to include a seperate corpus for m… #2818

ChrisGeishauser commented Jul 9, 2024

ChrisGeishauser commented Jul 9, 2024

tomaarsen commented Jul 9, 2024

tomaarsen Jul 9, 2024 •

edited

Loading

ChrisGeishauser Jul 10, 2024 •

edited

Loading

tomaarsen Jul 10, 2024

ChrisGeishauser Jul 10, 2024 •

edited

Loading

tomaarsen Jul 10, 2024 •

edited

Loading

tomaarsen Jul 10, 2024 •

edited

Loading

ChrisGeishauser Jul 10, 2024

tomaarsen commented Jul 9, 2024

ChrisGeishauser commented Jul 10, 2024

tomaarsen commented Jul 10, 2024

ArthurCamara commented Jul 12, 2024

tomaarsen commented Jul 12, 2024

ArthurCamara commented Jul 12, 2024 •

edited

Loading

ArthurCamara commented Jul 18, 2024

ChrisGeishauser commented Jul 19, 2024

ChrisGeishauser commented Aug 14, 2024

tomaarsen commented Sep 11, 2024

updated mine_hard_negatives method to include a seperate corpus for m… #2818

updated mine_hard_negatives method to include a seperate corpus for m… #2818

Conversation

ChrisGeishauser commented Jul 9, 2024

ChrisGeishauser commented Jul 9, 2024

tomaarsen commented Jul 9, 2024

tomaarsen Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

ChrisGeishauser Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

tomaarsen Jul 10, 2024

Choose a reason for hiding this comment

ChrisGeishauser Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

tomaarsen Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

tomaarsen Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

ChrisGeishauser Jul 10, 2024

Choose a reason for hiding this comment

tomaarsen commented Jul 9, 2024

ChrisGeishauser commented Jul 10, 2024

tomaarsen commented Jul 10, 2024

ArthurCamara commented Jul 12, 2024

tomaarsen commented Jul 12, 2024

ArthurCamara commented Jul 12, 2024 • edited Loading

ArthurCamara commented Jul 18, 2024

ChrisGeishauser commented Jul 19, 2024

ChrisGeishauser commented Aug 14, 2024

tomaarsen commented Sep 11, 2024

tomaarsen Jul 9, 2024 •

edited

Loading

ChrisGeishauser Jul 10, 2024 •

edited

Loading

ChrisGeishauser Jul 10, 2024 •

edited

Loading

tomaarsen Jul 10, 2024 •

edited

Loading

tomaarsen Jul 10, 2024 •

edited

Loading

ArthurCamara commented Jul 12, 2024 •

edited

Loading