Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update categories and domains for tasks 029 - 058 #459

Merged
merged 23 commits into from
Oct 22, 2021

Conversation

garyhlai
Copy link
Contributor

@garyhlai garyhlai commented Oct 19, 2021

Update categories and domains for tasks 029 - 058

@garyhlai
Copy link
Contributor Author

@danyaljj could you link me to the dataset paper for task 043 and task 044? I can't seem to find a dataset named "essential" or "essential terms" anywhere.

@danyaljj
Copy link
Contributor

danyaljj commented Oct 19, 2021

@danyaljj could you link me to the dataset paper for task 043 and task 044? I can't seem to find a dataset named "essential" or "essential terms" anywhere.

Oh good point. The data was extracted from the experiments in this paper:
https://aclanthology.org/K17-1010/

Please add the pointer to the task files.

@danyaljj danyaljj requested a review from swarooprm October 19, 2021 03:11
@swarooprm
Copy link
Contributor

swarooprm commented Oct 19, 2021

Looks good. Some comments:

  • task031 and task032: these involve answer generation also implicitly along with question generation, wondering if we should have the category ""Question Answering -> Contextual Question Answering -> Extractive"" which you had for task033
  • task036-042 (QASC tasks)- I dont think "Reasoning -> Multihop Reasoning" should be part of categories. QASC dataset is multihop, but most (all?) tasks here are not multihop.
  • task 039- I am wondering if we should drop verification or make it a new category? i initially kept it as verification tasks because this step is typically used to verify if the generated data serves the intended purpose (multihop reasoning) or not. But, now I am wondering if we can make the "verification" category more granular.

@danyaljj
Copy link
Contributor

danyaljj commented Oct 19, 2021

  • task 039- I am wondering if we should drop verification or make it a new category? i initially kept it as verification tasks because this step is typically used to verify if the generated data serves the intended purpose (multihop reasoning) or not. But, now I am wondering if we can make the "verification" category more granular.

Yeah, agreed that we should do something about it. There is a discussion on this here: #443 (comment)

@garyhlai
Copy link
Contributor Author

garyhlai commented Oct 19, 2021

Looks good. Some comments:

  • task031 and task032: these involve answer generation also implicitly along with question generation, wondering if we should have the category ""Question Answering -> Contextual Question Answering -> Extractive"" which you had for task033

I don't think we should have "Question Answering -> Contextual Question Answering -> Extractive" for task031 and task032 because the output is a question, not an answer, unlike task033 (output is an answer).

  • task036-042 (QASC tasks)- I dont think "Reasoning -> Multihop Reasoning" should be part of categories. QASC dataset is multihop, but most (all?) tasks here are not multihop.

While most tasks themselves are not multihop per se, they require multihop reasoning ability to carry out.

  • task 039- I am wondering if we should drop verification or make it a new category? i initially kept it as verification tasks because this step is typically used to verify if the generated data serves the intended purpose (multihop reasoning) or not. But, now I am wondering if we can make the "verification" category more granular.

I agree that "Verification" should be made more granular, but for task 039, I think "Verification" should just be removed (I don't see how this is "verification") -- I only kept it there because I thought there was a good reason that you guys put it there. I will keep annotating the remaining tasks, keeping these discussions in mind, and massage subcategories of "Verification" accordingly.

@garyhlai
Copy link
Contributor Author

@danyaljj @swarooprm could you also link me to the dataset paper for task 045 - 047? (Dataset "miscellaneous")

@danyaljj
Copy link
Contributor

task046_miscellaenous_question_typing is based on this work: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.9681&rep=rep1&type=pdf

The other two are not tied to any particular works, unfortunately.

@garyhlai
Copy link
Contributor Author

task046_miscellaenous_question_typing is based on this work: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.9681&rep=rep1&type=pdf

The other two are not tied to any particular works, unfortunately.

What are their domains? Or do I need to look at the instances to guess?

@garyhlai
Copy link
Contributor Author

  • task 039- I am wondering if we should drop verification or make it a new category? i initially kept it as verification tasks because this step is typically used to verify if the generated data serves the intended purpose (multihop reasoning) or not. But, now I am wondering if we can make the "verification" category more granular.

I agree that "Verification" should be made more granular, but for task 039, I think "Verification" should just be removed (I don't see how this is "verification") -- I only kept it there because I thought there was a good reason that you guys put it there. I will keep annotating the remaining tasks, keeping these discussions in mind, and massage subcategories of "Verification" accordingly.

@swarooprm @danyaljj I updated "Verification" categories in 6445613 based on discussion #443. Some of the multirc tasks fit under these new "Verification" categories. Let me know what you think.

@swarooprm
Copy link
Contributor

swarooprm commented Oct 19, 2021

Looks good. Some comments:

  • task031 and task032: these involve answer generation also implicitly along with question generation, wondering if we should have the category ""Question Answering -> Contextual Question Answering -> Extractive"" which you had for task033

I don't think we should have "Question Answering -> Contextual Question Answering -> Extractive" for task031 and task032 because the output is a question, not an answer, unlike task033 (output is an answer).

I was actually referring to task029 and task030 (instead of task031 and 032). Sorry for the typo!

  • task036-042 (QASC tasks)- I don't think "Reasoning -> Multihop Reasoning" should be part of categories. QASC dataset is multihop, but most (all?) tasks here are not multihop.

While most tasks themselves are not multihop per se, they require multihop reasoning ability to carry out.

I am not sure. E.g. task038 is about combining 2 facts in the input, can we say it multihop?

  • task 039- I am wondering if we should drop verification or make it a new category? I initially kept it as verification tasks because this step is typically used to verify if the generated data serves the intended purpose (multihop reasoning) or not. But, now I am wondering if we can make the "verification" category more granular.
    I agree that "Verification" should be made more granular, but for task 039, I think "Verification" should just be removed (I don't see how this is "verification") -- I only kept it there because I thought there was a good reason that you guys put it there. I will keep annotating the remaining tasks, keeping these discussions in mind, and massage subcategories of "Verification" accordingly.

Well, so when we created the Natural Instruction v1 dataset by covering various intermediate steps involved in a data creation process, we thought of having the 'verification' category as this is an important part of the data creation process. Also, we did not have more tasks to create a more granular category that better represent task039. But, now since we have lots of tasks and better task categories, I think it's fine to drop the 'verification' category from task039.

@garyhlai
Copy link
Contributor Author

I was actually referring to task029 and task030 (instead of task031 and 032). Sorry for the typo!

So definitely not task030 for the same reasons I mentioned: output is a cloze-styled question. I had the same hesitation about task029 but decided to not put "Question Answering" because it is not answering a question from the input -- it is given a context word (not a question), and asked to generate questions (with answers attached to them).

  • task036-042 (QASC tasks)- I don't think "Reasoning -> Multihop Reasoning" should be part of categories. QASC dataset is multihop, but most (all?) tasks here are not multihop.

While most tasks themselves are not multihop per se, they require multihop reasoning ability to carry out.

I am not sure. E.g. task038 is about combining 2 facts in the input, can we say it multihop?

That looks like multihop to me? What would be the definition of multihop?

  • task 039- I am wondering if we should drop verification or make it a new category? I initially kept it as verification tasks because this step is typically used to verify if the generated data serves the intended purpose (multihop reasoning) or not. But, now I am wondering if we can make the "verification" category more granular.
    I agree that "Verification" should be made more granular, but for task 039, I think "Verification" should just be removed (I don't see how this is "verification") -- I only kept it there because I thought there was a good reason that you guys put it there. I will keep annotating the remaining tasks, keeping these discussions in mind, and massage subcategories of "Verification" accordingly.

Well, so when we created the Natural Instruction v1 dataset by covering various intermediate steps involved in a data creation process, we thought of having the 'verification' category as this is an important part of the data creation process. Also, we did not have more tasks to create a more granular category that better represent task039. But, now since we have lots of tasks and better task categories, I think it's fine to drop the 'verification' category from task039.

Sounds good to me!

@swarooprm
Copy link
Contributor

I am fine if we don't add the categories for task029 and task030.
Let's stick to the definition of adding the category "Question Answering -> Contextual Question Answering -> Extractive" only if the input is a question in some form. This sounds reasonable.

Re: multihop:
@danyaljj can answer better.
The setup in multihop datasets like Multirc, hotpotqa, drop are very different than what we have here in task038. We can have "multihop" in the categories, just that we need to define it clearly and follow it consistently across all tasks.

@danyaljj
Copy link
Contributor

danyaljj commented Oct 20, 2021

I am fine if we don't add the categories for task029 and task030. Let's stick to the definition of adding the category "Question Answering -> Contextual Question Answering -> Extractive" only if the input is a question in some form. This sounds reasonable.

Re: multihop: @danyaljj can answer better. The setup in multihop datasets like Multirc, hotpotqa, drop are very different than what we have here in task038. We can have "multihop" in the categories, just that we need to define it clearly and follow it consistently across all tasks.

I am actually not sure what is the discussion. But my understanding is that:

  • I define Reasoning -> Multihop Reasoning as the ability to make conclusions by "hopping" over multiple pieces of knowledge (similar, but not limited, to deductive reasoning). Note that this definition does NOT specify QA anywhere. Perhaps we should mention this in the task-hierarchy if we agree with it.
  • Therefore, answering questions in datasets such as MultiRC, HotPotQA, and possibly DROP should be labeled as Reasoning -> Multihop Reasoning (in addition to whatever applicable category related to QA).
  • According to the above definition, Reasoning -> Multihop Reasoning is applicable to non-QA tasks, as long as it fits the definition of "multi-hop-ness" (i.e., making conclusions through digesting multiple sentences).

Let me know if this helps.

@garyhlai
Copy link
Contributor Author

Thanks @danyaljj I think we have the same understanding. I also updated the rest of the tasks in the PR (it's no longer WIP) now.

@swarooprm What do you think? Can this be merged now?

@garyhlai garyhlai changed the title WIP: Update categories and domains for tasks 029 - 058 Update categories and domains for tasks 029 - 058 Oct 20, 2021
@swarooprm
Copy link
Contributor

"Scientific reasoning" is a good addition. "Text Comparison" is the appropriate category for task039.
Some comments:

  • task036: Drop "Reasoning -> Multihop Reasoning" (Input is a single sentence)
  • task 045: Drop "Question Answering"
  • task 046: Drop "Question Answering -> Contextual Question Answering" and "Reasoning -> Qualitative Reasoning" (Qualitative reasoning example https://arxiv.org/pdf/1811.08048.pdf)
  • task 054: The answer is not always extractive from the passage (e.g. the 1st example of positive examples), So modify the category "Question Answering -> Contextual Question Answering -> Extractive" to Question Answering -> Contextual Question Answering"
  • task 055: Also add "Reasoning -> Multihop Reasoning" (multihop reasoning is necessary to make sure that the generated answer is not another correct answer).
  • task 056: For the same reason as task054, the category should be changed from "Question Answering -> Contextual Question Answering -> Extractive" to Question Answering -> Contextual Question Answering". Also we need to have "Reasoning -> Multihop Reasoning" here
  • task 057: Also add "Reasoning -> Multihop Reasoning" and "Reasoning -> Commonsense Reasoning" (since we have this one in task055)
  • task058: Add "Reasoning -> Multihop Reasoning" and change "Question Answering -> Contextual Question Answering -> Extractive" to "Question Answering -> Contextual Question Answering" (for same reasons as task 054 and 056)

Task hierarchy will help design our experiments and will be a very sensitive component in our paper, So I think the discussions at this point will be very helpful later :)

@@ -65,6 +76,7 @@
- `Reasoning -> Reasoning with Symbols`: Tasks where symbols represent various things e.g. if X is the number of apples in the freeze today morning and Y is the number remaining after I ate a few apples, X-Y is the number of apples I ate.
- `Reasoning -> Spatial Reasoning`
- `Reasoning -> Temporal Reasoning`
- `Reasoning -> Scientific Reasoning`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this means. Could you define it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of reasoning required to be able to answer questions related to science

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but I am still not sure what that means.
Are you sure this is not a "domain" (rather than a reasoning type)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in order to answer a question like "Question: A student riding a bicycle observes that it moves faster on a smooth road than on a rough road. This happens because the smooth road has (A) less gravity(B) more gravity(C) less friction(D) more friction?", what do you think would be the type of reasoning, if not scientific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, talked with @swarooprm and decided that this would be Reasoning -> Qualitative Reasoning. Science would be a domain like you said (it's already included). I've updated everything and this PR should be good to go. @danyaljj what do you think?

@garyhlai
Copy link
Contributor Author

garyhlai commented Oct 21, 2021

"Scientific reasoning" is a good addition. "Text Comparison" is the appropriate category for task039. Some comments:

  • task036: Drop "Reasoning -> Multihop Reasoning" (Input is a single sentence)
  • task 045: Drop "Question Answering"
  • task 046: Drop "Question Answering -> Contextual Question Answering" and "Reasoning -> Qualitative Reasoning" (Qualitative reasoning example https://arxiv.org/pdf/1811.08048.pdf)
  • task 054: The answer is not always extractive from the passage (e.g. the 1st example of positive examples), So modify the category "Question Answering -> Contextual Question Answering -> Extractive" to Question Answering -> Contextual Question Answering"
  • task 055: Also add "Reasoning -> Multihop Reasoning" (multihop reasoning is necessary to make sure that the generated answer is not another correct answer).
  • task 056: For the same reason as task054, the category should be changed from "Question Answering -> Contextual Question Answering -> Extractive" to Question Answering -> Contextual Question Answering". Also we need to have "Reasoning -> Multihop Reasoning" here
  • task 057: Also add "Reasoning -> Multihop Reasoning" and "Reasoning -> Commonsense Reasoning" (since we have this one in task055)
  • task058: Add "Reasoning -> Multihop Reasoning" and change "Question Answering -> Contextual Question Answering -> Extractive" to "Question Answering -> Contextual Question Answering" (for same reasons as task 054 and 056)

Done.

Task hierarchy will help design our experiments and will be a very sensitive component in our paper, So I think the discussions at this point will be very helpful later :)

Yeah for sure @swarooprm. I guess it's a bit hard for me to determine whether to put down a category sometimes without having gone through the modeling stage. In particular, I'm not sure how we will use the multiple categories for each task and the category hierarchy (as I understand it, for the first paper, each task only had one category and there was no "category hierarchy", so the train-valid split seemed more straightforward). More discussions about how these new categories & hierarchy will be used would be really helpful in carving out some heuristics for the annotation process.

- `Classification -> Verification -> Sufficient Information Verification`: Verify whether a text contains sufficient information to answer a question
- `Classification -> Verification -> Grammar Verification`: Verify whether a text is grammatical
- `Classification -> Verification -> Relevance Verification`
- `Classification -> Verification -> Answer Verification`: Verify whether a text answers the question
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@swarooprm
Copy link
Contributor

swarooprm commented Oct 21, 2021

"Scientific reasoning" is a good addition. "Text Comparison" is the appropriate category for task039. Some comments:

  • task036: Drop "Reasoning -> Multihop Reasoning" (Input is a single sentence)
  • task 045: Drop "Question Answering"
  • task 046: Drop "Question Answering -> Contextual Question Answering" and "Reasoning -> Qualitative Reasoning" (Qualitative reasoning example https://arxiv.org/pdf/1811.08048.pdf)
  • task 054: The answer is not always extractive from the passage (e.g. the 1st example of positive examples), So modify the category "Question Answering -> Contextual Question Answering -> Extractive" to Question Answering -> Contextual Question Answering"
  • task 055: Also add "Reasoning -> Multihop Reasoning" (multihop reasoning is necessary to make sure that the generated answer is not another correct answer).
  • task 056: For the same reason as task054, the category should be changed from "Question Answering -> Contextual Question Answering -> Extractive" to Question Answering -> Contextual Question Answering". Also we need to have "Reasoning -> Multihop Reasoning" here
  • task 057: Also add "Reasoning -> Multihop Reasoning" and "Reasoning -> Commonsense Reasoning" (since we have this one in task055)
  • task058: Add "Reasoning -> Multihop Reasoning" and change "Question Answering -> Contextual Question Answering -> Extractive" to "Question Answering -> Contextual Question Answering" (for same reasons as task 054 and 056)

Done.

Task hierarchy will help design our experiments and will be a very sensitive component in our paper, So I think the discussions at this point will be very helpful later :)

Yeah for sure @swarooprm. I guess it's a bit hard for me to determine whether to put down a category sometimes without having gone through the modeling stage. In particular, I'm not sure how we will use the multiple categories for each task and the category hierarchy (as I understand it, for the first paper, each task only had one category and there was no "category hierarchy", so the train-valid split seemed more straightforward). More discussions about how these new categories & hierarchy will be used would be really helpful in carving out some heuristics for the annotation process.

@ghlai9665 Very good question. Will you be available sometime to chat? @danyaljj also can join if he is available during that time. I think it will be easier to discuss over chat than here.

@garyhlai
Copy link
Contributor Author

"Scientific reasoning" is a good addition. "Text Comparison" is the appropriate category for task039. Some comments:

  • task036: Drop "Reasoning -> Multihop Reasoning" (Input is a single sentence)
  • task 045: Drop "Question Answering"
  • task 046: Drop "Question Answering -> Contextual Question Answering" and "Reasoning -> Qualitative Reasoning" (Qualitative reasoning example https://arxiv.org/pdf/1811.08048.pdf)
  • task 054: The answer is not always extractive from the passage (e.g. the 1st example of positive examples), So modify the category "Question Answering -> Contextual Question Answering -> Extractive" to Question Answering -> Contextual Question Answering"
  • task 055: Also add "Reasoning -> Multihop Reasoning" (multihop reasoning is necessary to make sure that the generated answer is not another correct answer).
  • task 056: For the same reason as task054, the category should be changed from "Question Answering -> Contextual Question Answering -> Extractive" to Question Answering -> Contextual Question Answering". Also we need to have "Reasoning -> Multihop Reasoning" here
  • task 057: Also add "Reasoning -> Multihop Reasoning" and "Reasoning -> Commonsense Reasoning" (since we have this one in task055)
  • task058: Add "Reasoning -> Multihop Reasoning" and change "Question Answering -> Contextual Question Answering -> Extractive" to "Question Answering -> Contextual Question Answering" (for same reasons as task 054 and 056)

Done.

Task hierarchy will help design our experiments and will be a very sensitive component in our paper, So I think the discussions at this point will be very helpful later :)

Yeah for sure @swarooprm. I guess it's a bit hard for me to determine whether to put down a category sometimes without having gone through the modeling stage. In particular, I'm not sure how we will use the multiple categories for each task and the category hierarchy (as I understand it, for the first paper, each task only had one category and there was no "category hierarchy", so the train-valid split seemed more straightforward). More discussions about how these new categories & hierarchy will be used would be really helpful in carving out some heuristics for the annotation process.

@ghlai9665 Very good question. Will you be available sometime to chat? @danyaljj also can join if he is available during that time. I think it will be easier to discuss over chat than here.

Sure! Anytime after 5 pm California time works for me. I'm also wide open this weekend. Let me know if neither of those works.

@swarooprm
Copy link
Contributor

Let's chat at 5.30 pm PST today. I will send you invite.

@garyhlai
Copy link
Contributor Author

garyhlai commented Oct 21, 2021 via email

@swarooprm
Copy link
Contributor

Looks good!
Final comment:

  • Drop "qualitative reasoning" from task044 and 045 (as those tasks can be solved without qualitative reasoning).

From my side, this PR is good to go after the minor fix above @danyaljj

@danyaljj danyaljj requested a review from yeganehkordi October 22, 2021 15:46
Daniel Khashabi added 3 commits October 22, 2021 10:19
# Conflicts:
#	tasks/task044_essential_terms_identifying_essential_words.json
#	tasks/task045_miscellaneous_sentence_paraphrasing.json
#	tasks/task046_miscellaneous_question_typing.json
#	tasks/task048_multirc_question_generation.json
#	tasks/task049_multirc_questions_needed_to_answer.json
#	tasks/task054_multirc_write_correct_answer.json
#	tasks/task055_multirc_write_incorrect_answer.json
#	tasks/task056_multirc_classify_correct_answer.json
#	tasks/task057_multirc_classify_incorrect_answer.json
#	tasks/task058_multirc_question_answering.json
@danyaljj
Copy link
Contributor

Looks great! Thanks to both of you!
I resolved a pretty large merge conflict. Please double-check the PR, just in case I have missed anything.

@danyaljj danyaljj merged commit f955fb2 into allenai:master Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants