-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate max_seq_length
in SquadProcessor
#2740
Conversation
Hi @francescocastelli thank you for making this pull request! We will probably need a few days until we can have a look at it and give feedback. Stay tuned! 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @francescocastelli! I have a small style comment on this PR. I see you've been using the .format()
method of strings, while we generally prefer to use f-strings, mostly for readability and consistency. Could you modify the code to use f-strings instead of .format()
? Thanks 🙂
In addition, it would be great if you could add a small test for the new asserts. Reach out if you need help testing, unfortunately our test suite is not the most intuitive to deal with 😄
@ZanSara Sure! I will fix everything next week :) |
427be5e
to
6d7e3e5
Compare
@ZanSara done! I've also added a test for invalid parameters at the construction of a FARMReader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice! I will merge it soon. Thank you for the contribution! 😊
* added max_len_seq validation in SquadProcessor * fixed string formatting * added tests for invalid max_seq_len * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Related Issue(s):
#2319
Proposed changes:
What we need to validate
max_seq_len
isprocessor.tokenizer.model_max_length
. I've put the assert in theSquadProcessor
class because inFARMReader
we would have to place it in multiple different methods. Let me know if I'm missing something!Should we validate the max length also on other processors? for instance
TextSimilarityProcessor
uses two tokenizers and has bothmax_seq_len_query
andmax_seq_len_passage
.Pre-flight checklist