-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Run checklist suites in AllenNLP #5065
Conversation
allennlp/sanity_checks/task_checklists/question_answering_suite.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not looked at the tests yet, but here is a first review. It seems like there is a lot of copy and paste code that's not following the guidelines (as opposed to enforced rules) of AllenNLP code quality, like missing type annotations etc.?
allennlp/sanity_checks/task_checklists/question_answering_suite.py
Outdated
Show resolved
Hide resolved
allennlp/sanity_checks/task_checklists/question_answering_suite.py
Outdated
Show resolved
Hide resolved
allennlp/sanity_checks/task_checklists/question_answering_suite.py
Outdated
Show resolved
Hide resolved
if capabilities: | ||
self._print_summary_args["capabilities"] = capabilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want this to print when capabilities == []
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_print_summary_args
are just passed to the summary function. When capabilities == []
, the summary for tests of all capabilities are printed.
allennlp/sanity_checks/task_checklists/textual_entailment_suite.py
Outdated
Show resolved
Hide resolved
ret = [] | ||
if s != data: | ||
ret.append(s) | ||
if s + "." != data: | ||
ret.append(s + ".") | ||
return ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just return [s]
or return [s + "."]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what it's doing:
data
= "This was great!"
Returns ["This was great", "This was great."]
data
= "The movie was good"
Returns ["The movie was good."]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble getting this to work with a QA model.
I tried running
allennlp checklist \
https://storage.googleapis.com/allennlp-public-models/transformer-qa-2020-10-03.tar.gz \
question-answering
But got no output (no errors, either). I also tried the above with the --checklist-suite
parameter pointed to a download and extracted version of /~https://github.com/marcotcr/checklist/blob/master/release_suites/squad_suite.tar.gz, but I got errors from dill
:
allennlp/commands/checklist.py
Outdated
|
||
subparser.add_argument("task", type=str, help="the name of the task suite") | ||
|
||
subparser.add_argument("--checklist-suite", type=str, help="the checklist suite path") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we find these? Is this /~https://github.com/marcotcr/checklist/tree/master/release_suites?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will add our default suites to cloud. And people can also write their own suites.
@epwalsh That was because the tests were not being added by default. I've now switched the flag, so you should be able to run the command now. (It will only run 2 tests right now; I didn't want this PR to have more lines of code). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few "cosmetic" comments. I'm working my way through some examples next so I'll followup in a bit with another review
Co-authored-by: Pete <petew@allenai.org>
Co-authored-by: Pete <petew@allenai.org>
Co-authored-by: Pete <petew@allenai.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool!
The only other comment I have is about the format of the summary. It's just a little difficult to tell apart "capabilities" from specific test names. I guess if you know what you're looking at, it may be obvious. I just had a little trouble the first time I looked at it.
If it's difficult to override the default summary, I wouldn't worry about.
allennlp/sanity_checks/task_checklists/textual_entailment_suite.py
Outdated
Show resolved
Hide resolved
@epwalsh Thanks for reviewing a mountain of code! I've added custom formatting functions now. Here's a preview of what that looks like (task-specific): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that looks really great! Much easier to read!
I'm just curious, is there a built-in way to get a structured version of the output? Like in JSON format, for example? That would be a cool feature to have, especially for people doing continuous deployment of models, because they could programmatically run the checklists suites and a get a pass/fail result.
* run checklist suites from command line * specify output file * separate task from checklist suite * qa task * adding describe, misc updates * fix docs, TE suite * update changelog * bug fix * adding default tests * qa defaults * typing, docs, minor updates * more updates * set add_default_tests to True * remove commented lines * capitalizing help strings * does this work * adding start_method to test * skipping test * oops, actually fix * temp fix to check memory issues * Skip more memory hungry tests * fix * fixing professions * Update setup.py Co-authored-by: Pete <petew@allenai.org> * Update CHANGELOG.md Co-authored-by: Pete <petew@allenai.org> * Update allennlp/sanity_checks/task_checklists/task_suite.py Co-authored-by: Pete <petew@allenai.org> * formatting functions Co-authored-by: Evan Pete Walsh <petew@allenai.org>
Suites for 3 tasks (pkl files and scripts):
Run suites for models
Note:
At first, it may seem like a lot of this functionality can simply be a part of the
Predictor
class. But predictors can be more generic. Eg. text-classification-predictor can be used for sentiment analysis as well as textual-entailment. The tests in the test suites are supposed to be more task-oriented, however, and cannot be used interchangeably.