Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support BM25 parameters customization #163

Closed
rayhsieh opened this issue Jun 19, 2022 · 6 comments · Fixed by #186
Closed

Support BM25 parameters customization #163

rayhsieh opened this issue Jun 19, 2022 · 6 comments · Fixed by #186

Comments

@rayhsieh
Copy link

Would you consider to support customization of BM25 parameters? It would be very helpful for optimizing search relevance.

var k = 1.2; // Term frequency saturation point. Recommended values are between 1.2 and 2.
var b = 1.2; // Length normalization impact. Recommended values are around 0.75.
var d = 0.5; // BM25+ frequency normalization lower bound. Recommended values are between 0.5 and 1.
@lucaong
Copy link
Owner

lucaong commented Jun 20, 2022

Hi @rayhsieh ,
it would definitely be possible to support customization of the BM25 parameters, and it might be offered as a feature in a upcoming release. That said, beware that tweaking these parameters tends to be tricky, and should almost never be necessary. Users needing to fine-tune results can usually rely on boosting, which has a more predictable effect.

See the relevant discussion, where @rolftimmermans correctly notes:

I agree, but to be honest most users should not attempt to tune this. The effect is fairly subtle and requires a thorough understanding of the BM25 scoring model. Having such a knob, people will probably want to tune it. While I think in practice a field boost is a much easier way to tune a custom dataset, if the underlying scoring model is solid enough (which I hope BM25 is).

That said, there is no hard technical reason why this cannot be done. It just would need to be properly documented, also explaining that these kind of tweaks are normally discouraged. I tend to prefer keeping the API surface as small as it can be, but if there is a use case for this, it can be done.

@rayhsieh
Copy link
Author

Hi @lucaong ,
I totally agree that tweaking these parameters is tricky, and in most cases not necessary. But in my experience, the default parameters might not be the best choice when dealing with special dataset. I'm actually working on a limited Chinese dataset and I figured out that raising the impact of the length of terms by tweaking b do make the search experience way much better than the default set.

@lucaong
Copy link
Owner

lucaong commented Jun 21, 2022

That sounds like an interesting case. I personally would like to know more about how MiniSearch performs on non-latin scripts, and I am very interested in hearing feedback on what could be added or changed to improve the experience in those cases. MiniSearch follows the principle of not including any language-specific utility (such as stemming or normalization), but making it easily possible to plug those in whenever necessary.

I think I can soon prepare an update that makes BM25 parameters configurable, possibly initially as a beta feature.

@rolftimmermans
Copy link
Contributor

rolftimmermans commented Jun 21, 2022

Here are some of my notes that may help in documenting the parameters, if they're exposed. May need a bit of a rewrite :)

This article is also helpful for understanding k and b: https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

k is the BM25 term frequency saturation point.

  • Higher term frequencies means that a document has higher relevance, but BM25 makes sure that the increase in relevance quickly diminishes. Mental model: a document with 3 occurrences of a term is a better match than a document with 1 occurrence, but it's not 3x better.
  • Higher values increase the relevance difference between documents with higher/lower term frequencies.
  • Lower values reduce the relevance difference between documents with higher/lower term frequencies.
  • Default is 1.2.
  • Recommended values are between 1.2 and 2.
  • Setting this to 0 or a negative number is invalid (could be validated automatically?).

b is the BM25 length normalization impact.

  • A document with a longer field length needs to have a slightly higher term frequency to achieve the same relevance as a document with a shorter field length.
  • Higher values increase the weight that field length has on scoring.
  • Lower values decrease the weight that field length has on scoring.
  • Setting this to 0 disables the field length having an effect on scoring altogether (not recommended).
  • Default value is 0.7.
  • Recommended values are around 0.75.
  • Setting this to negative values is invalid (could be validated automatically?).

d (actually δ) is the BM25+ frequency normalization lower bound.

  • Addresses a deficiency in BM25. Long fields which do match the query term are scored unfairly by BM25.
  • Increasing this parameter increases the minimum relevance of one occurrence of a search term regardless of its (very long) field length.
  • Decreasing this parameter effectively has the effect of penalising long fields with few/one term occurrence.
  • Default value is 0.5.
  • Recommended values are between 0.5 and 1.0.
  • Setting this to 0 disables this feature (not recommended).
  • Setting this to a negative number is invalid (could be validated automatically?).

@rayhsieh
Copy link
Author

@lucaong Thank you for planning this request. While working on the search function on my dataset, it is very flexible for me the add language-specific tokenizer. I don't have other recommendation at this point since it is already met what I need.

@lucaong
Copy link
Owner

lucaong commented Nov 23, 2022

@rayhsieh this feature is now released along other features as part of v6.0.0-beta.1 (soon to be released as v6.0.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants