Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated mine_hard_negatives method to include a seperate corpus for m… #2818

Conversation

ChrisGeishauser
Copy link

…ining hard negatives.

@ChrisGeishauser
Copy link
Author

Extended mine_hard_negatives method in sentence_transformers/util.py to enable the retrieval of negative document candidates from a seperate corpus instead of using the positives as corpus.

@tomaarsen
Copy link
Collaborator

(Related to #2818)

Thanks for opening this! Will have a look at the details shortly.

  • Tom Aarsen

Comment on lines 718 to 728
# positive_indices = indices == torch.arange(len(queries), device=indices.device).unsqueeze(-1)
# scores[positive_indices] = -float("inf")
positive_indices = []
for positive in positives:
if positive in corpus:
index = corpus.index(positive)
positive_indices.append(index)
else:
positive_indices.append(-1)
positive_indices = torch.Tensor(positive_indices).to(torch.int32).unsqueeze(-1)
positive_indices = indices == positive_indices
Copy link
Collaborator

@tomaarsen tomaarsen Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a bit of tests with this, and I ran into a bit of an issue (that already existed, too!):
The same text can occur multiple times in corpus/positives. For example:

pairs_dataset = Dataset.from_dict({
    "query": ["What is the capital of France?", "What is the capital of Germany?", "What is the capital of Italy?", "What is Paris Hilton's first name?"],
    "positive": ["Paris", "Berlin", "Rome", "Paris"],
})

If this is your input dataset, and you don't specify a separate corpus, then your scores before the topk might be:

tensor([[0.6274, 0.3247, 0.3938, 0.6274],
        [0.3650, 0.6055, 0.3545, 0.3650],
        [0.3440, 0.3384, 0.5845, 0.3440],
        [0.5039, 0.1982, 0.2269, 0.5039]], dtype=torch.float16)

This makes sense: the 1st and 4th column are identical because the 1st and 4th corpus element is identical. After topk, we get:

tensor([[0.6274, 0.6274, 0.3938, 0.3247],
        [0.6055, 0.3650, 0.3650, 0.3545],
        [0.5845, 0.3440, 0.3440, 0.3384],
        [0.5039, 0.5039, 0.2269, 0.1982]], dtype=torch.float16)

And then the positive_indices become:

tensor([[ True, False, False, False],
        [ True, False, False, False],
        [ True, False, False, False],
        [ True, False, False, False]])

because this only finds the first occurrence, not all occurrences. We have this same problem with the new

    positive_indices = []
    for positive in positives:
        if positive in corpus:
            index = corpus.index(positive)
            positive_indices.append(index)
        else:
            positive_indices.append(-1)
    positive_indices = torch.Tensor(positive_indices).to(torch.int32).unsqueeze(-1)
    positive_indices = indices == positive_indices

as well as the old

    positive_indices = indices == torch.arange(len(queries), device=indices.device).unsqueeze(-1)

I think the best solution is to remove duplicates from corpus (and positives?)

  • Tom Aarsen

Copy link
Author

@ChrisGeishauser ChrisGeishauser Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomaarsen
Oh nice, great catch!

I agree that the easiest is to remove duplicates from the corpus. I think assuming/forcing the corpus to contain only non-duplicates is also a reasonable assumption. As an alternative, we could try to find all occurences of a positive in the corpus and thus keep the corpus intact, but that might be computationally less efficient? But in general, I guess it makes sense that during retrieval you retrieve different documents and no duplicates (so advocating the remove duplicate option for corpus).

Regarding removing duplicates of the positives, in some parts we require that positives and queries have the same size (i.e. calculate similarity_pairwise), so it is good to keep them intact?

  • Chris

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, we could try to find all occurences of a positive in the corpus and thus keep the corpus intact, but that might be computationally less efficient?

Beyond being computationally less efficient, I now realise that keeping duplicates can hurt the quality of the mined negatives. In particular, the script searches for the range_max + 1 most similar embeddings, and then does filtering based on e.g. margin, max_score, etc. Lastly, we take the num_negatives negatives from the range_max + 1 that were not filtered away. If we have duplicates, then it'll occur more often that there are not num_negatives non-filtered negative candidates. With other words: we'll be able to find (slightly) more useful negatives if we remove duplicates.

Regarding removing duplicates of the positives, in some parts we require that positives and queries have the same size (i.e. calculate similarity_pairwise), so it is good to keep them intact?

This is indeed a tricky part. One idea that I had was to remove duplicates in corpus, but not in positives. Of course, this is a bit tricky to do efficiently, but I think I can make this work. Should I get started on this?

Copy link
Author

@ChrisGeishauser ChrisGeishauser Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With other words: we'll be able to find (slightly) more useful negatives if we remove duplicates.

Yes, I totally agree!

This is indeed a tricky part. One idea that I had was to remove duplicates in corpus, but not in positives. Of course, this is a bit tricky to do efficiently, but I think I can make this work. Should I get started on this?

Yes, for me that also feels like the way to go now! :) Great to discuss with you Tom!

Copy link
Collaborator

@tomaarsen tomaarsen Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I solved it via 46ae230. The patch is smaller than expected, but I believe it works well. In essence, I:

  1. deduplicate the corpus naively (via list & set)
  2. create a mapping from positive indices to corpus indices
  3. use the aforementioned mapping to get the full positive embeddings from the deduplicates corpus embeddings
  4. use the aforementioned mapping to recognize when a candidate negative from the corpus is equal to a positive, so I can filter those positives away

I think this works pretty nicely: less embeddings required too.

If you can, please verify if this works for you too (just a quick glance to see if the outputs are reasonable is fine)

Copy link
Collaborator

@tomaarsen tomaarsen Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also allows us to simplify the rescoring filtering for positive pairs: aba1fb7

I think this might be good to go now. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! I just had a look at the code and couldn't find any issues, also love how neat the code is :)
I tested it on my end with my corpus data and it worked without problems, output looks reasonable too!

For me this is good to go! I will utilize this method a lot next week and can also let you know in case I find something else!

@tomaarsen
Copy link
Collaborator

I made some slight adjustments that I'd like to get your thoughts on. I also have some more thoughts that you might be able to help brainstorm with.

@ChrisGeishauser
Copy link
Author

I made some slight adjustments that I'd like to get your thoughts on. I also have some more thoughts that you might be able to help brainstorm with.

Sure, anytime! How would like to communicate, just via comments here? @tomaarsen

@tomaarsen
Copy link
Collaborator

Yes! I think I'll try and work on removing the duplicates from corpus based on our thread here: #2818 (comment)

Add a positive to corpus indices mapping, useful to get non-deduplicated positives and to filter away positives taken from the corpus
@ArthurCamara
Copy link
Contributor

Hey, I was about to open a PR with something similar from here: /~https://github.com/ArthurCamara/sentence-transformers/tree/hard_negatives_from_corpus
I think the biggest difference is that I considered the case where there could be multiple positives as well (e.g., TREC-Covid has ~500 positives per query).

This means including a relevance dictionary as a parameter too.
I can try bringing some of my changes to this PR if you think it is as a meaningful use case as well, @tomaarsen @ChrisGeishauser

@tomaarsen
Copy link
Collaborator

Hello @ArthurCamara!
I haven't worked much with datasets that have multiple positives per query, but I can imagine that they're fairly common in IR datasets indeed (please correct me if I'm wrong). In your branch, what would a normal relevance_dataset look like? A dataset with ["query-id", "corpus-id", "score"] columns? Also, if I'm looking at it right, then your function requires that both the queries_dataset and corpus_dataset have id columns right?

Beyond your relevance-related changes, I like the chunk_size option: I can imagine that you could OOM with model.encode otherwise. As for keep_title, perhaps that map should only trigger if the dataset has a title column.

I think your variant is certainly more powerful, but mine_hard_negatives from this PR is also a lot simpler to use in terms of data preprocessing. Ideally, I'd like to keep the option for the simple usage, so I think it might be best to support 2 functions. To avoid duplication, it might be possible to rewrite mine_hard_negatives to use mine_hard_negatives_from_corpus (or mine_hard_negatives_from_relevance?) under the hood. One tricky component is that mine_hard_negatives_from_corpus must then support all behavior from mine_hard_negatives, which would require some extra work (e.g. a non-FAISS implementation, perhaps add corpus deduplication that this PR adds).

Curious about both of your stances on this @ChrisGeishauser @ArthurCamara

  • Tom Aarsen

@ArthurCamara
Copy link
Contributor

ArthurCamara commented Jul 12, 2024

Hello @ArthurCamara! I haven't worked much with datasets that have multiple positives per query, but I can imagine that they're fairly common in IR datasets indeed (please correct me if I'm wrong).
They are somewhat common, but I'm mostly coming from the MTEB/BEIR side of things, where some of the datasets may have multiple positives with different scores (that's why the positive_score_threshold argument).

In your branch, what would a normal relevance_dataset look like? A dataset with ["query-id", "corpus-id", "score"] columns? Also, if I'm looking at it right, then your function requires that both the queries_dataset and corpus_dataset have id columns right?

You can look at the trec-covid from MTEB for an example. The default subset is a dataset with three columns: query-id, corpus-id and scores. This is also why we need ids for the queries and documents, so we can map the relevance back into the texts.

Beyond your relevance-related changes, I like the chunk_size option: I can imagine that you could OOM with model.encode otherwise. As for keep_title, perhaps that map should only trigger if the dataset has a title column.

Makes sense. Easy to fix. =)

I think your variant is certainly more powerful, but mine_hard_negatives from this PR is also a lot simpler to use in terms of data preprocessing. Ideally, I'd like to keep the option for the simple usage, so I think it might be best to support 2 functions. To avoid duplication, it might be possible to rewrite mine_hard_negatives to use mine_hard_negatives_from_corpus (or mine_hard_negatives_from_relevance?) under the hood. One tricky component is that mine_hard_negatives_from_corpus must then support all behavior from mine_hard_negatives, which would require some extra work (e.g. a non-FAISS implementation, perhaps add corpus deduplication that this PR adds).

Curious about both of your stances on this @ChrisGeishauser @ArthurCamara

  • Tom Aarsen

I agree that both settings should be supported. I also don't like the idea that mine_hard_negatives_from_corpus needs such specific Dataset formats. Probably better to support passing a list of tests for corpus and a mapping from scores. Under the hood this can be transformed into whatever format mine_hard_negatives_from_corpus ends up using, including creating some fake document ids.

The non-faiss option is something I was planning on doing today. I was a bit sick the last couple days and finished this way too late at night.

I can do some work on this PR later today with some of the changes and than we can see what it looks like.

@ArthurCamara
Copy link
Contributor

Hey, @tomaarsen and @ChrisGeishauser I've merged some of my changes on this PR: #2848.
It allows for both a separate corpus being passed and multiple positives per query, using the same interface as before. Let me know if you prefer me to merge my changes into the branch instead.

@ChrisGeishauser
Copy link
Author

Hi @ArthurCamara and @tomaarsen! Thanks a lot for keeping me in the loop and sorry for the late reply, I had quite a busy week :)
@ArthurCamara thanks for suggesting these advancements, the option to have multiple positives would complete this function definitely! I didn't have time yet to have an in-detail look into the code, but can do that probably next Tuesday (if necessary).

Just one question regarding Tom's suggestion to have two functions

I think your variant is certainly more powerful, but mine_hard_negatives from this PR is also a lot simpler to use in terms of data preprocessing. Ideally, I'd like to keep the option for the simple usage, so I think it might be best to support 2 functions.

Is this still relevant or everything is now put into a single extended method?

@ChrisGeishauser
Copy link
Author

Hi @tomaarsen and @ArthurCamara, just wanted to ask whether there is any update on your side regard the MR :)

@tomaarsen
Copy link
Collaborator

Apologies for the delay, I fell sick in July and have been recovering since. Much better now luckily.
I'll be moving forward with #2848 instead which is a bit of a superset of the changes from this PR. I do want to explicitly thank you both for assisting with this! I'm looking forward to seeing the datasets that people prepare & then train with.

  • Tom Aarsen

@tomaarsen tomaarsen closed this Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants