-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me, though I would probably make the difference in fields explicit with an option.
fields["premise"] = TextField(premise_tokens, self._token_indexers) | ||
fields["hypothesis"] = TextField(hypothesis_tokens, self._token_indexers) | ||
|
||
if isinstance(self._tokenizer, PretrainedTransformerTokenizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight preference for having this be a configuration option instead of relying on this type check.
9af93b7
to
e75f48e
Compare
@@ -29,14 +29,29 @@ class SnliReader(DatasetReader): | |||
We use this `Tokenizer` for both the premise and the hypothesis. See :class:`Tokenizer`. | |||
token_indexers : `Dict[str, TokenIndexer]`, optional (default=`{"tokens": SingleIdTokenIndexer()}`) | |||
We similarly use this for both the premise and the hypothesis. See :class:`TokenIndexer`. | |||
tokenize_separately : `bool`, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having the default be a best guess like this, though the name is a bit confusing to me. Maybe combine_fields
or use_single_field
, or something, with the default being False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is this isn't necessarily about tokenization, it's more about how the data is represented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the default being None
? Because if it's False
we can no longer take a guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, sorry, I guess I just meant negating the logic in the guess.
@@ -29,14 +29,27 @@ class SnliReader(DatasetReader): | |||
We use this `Tokenizer` for both the premise and the hypothesis. See :class:`Tokenizer`. | |||
token_indexers : `Dict[str, TokenIndexer]`, optional (default=`{"tokens": SingleIdTokenIndexer()}`) | |||
We similarly use this for both the premise and the hypothesis. See :class:`TokenIndexer`. | |||
combine_input_fields : `bool`, optional | |||
(default=`not isinstance(tokenizer, PretrainedTransformerTokenizer)`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is now wrong.
(default=`not isinstance(tokenizer, PretrainedTransformerTokenizer)`) | |
(default=`isinstance(tokenizer, PretrainedTransformerTokenizer)`) |
After this, looks great!
}, | ||
"dropout": 0.1 | ||
}, | ||
"iterator": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't notice this earlier - this needs to be updated to match #3700. We can probably also just remove the sorting_keys
key here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for the catch. Will update this (and a new coref config as well).
"embedding_dim": transformer_dim, | ||
"cls_is_last_token": cls_is_last_token | ||
}, | ||
"feedforward": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just checking Anthony's results versus the numbers you're reporting. You're about 1.5 points below Anthony's MNLI result, and one point below the published number. I looked at Anthony's code, and it looks like the only difference is that he doesn't have this feedforward layer in there. What happens if you remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RoBERTa actually has this feedforward layer (using the same model as BERT) /~https://github.com/huggingface/transformers/blob/908fa43b543cf52a3238129624f502240725a6a6/src/transformers/modeling_bert.py#L426-L438
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I tried w/ vs. w/o this on SST -- didn't make a difference.
"cut_frac": 0.06 | ||
}, | ||
"optimizer": { | ||
"type": "huggingface_adamw", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another difference with Anthony's code, though I'm not sure how much difference it makes: /~https://github.com/anthonywchen/generative-nli/blob/b81d4e3e7f635fcc1e5ca09305707e5df983ec88/configs/roberta_large_config.json#L37-L43
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weight decay is different but I took the value from the RoBERTa paper (https://arxiv.org/pdf/1907.11692.pdf, last page last table last column). I doubt 1e-5 vs. 2e-5 LR is making 1.5 difference in F1? Anthony also didn't regularize bias which is probably the right thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah in experiments I tried, the differences in different learning rates are basically negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other difference is in the beta parameters in the adamw optimizer. I set mine to be [0.9, 0.98], which is recommended on their examples.
/~https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.glue.md
SNLI: dev 92.47, test 91.65, URL https://storage.googleapis.com/allennlp-public-models/snli-roberta-large-2020.02.27.tar.gz
MNLI: dev-matched 89.10, dev-mismatched 88.71, URL https://storage.googleapis.com/allennlp-public-models/mnli-roberta-large-2020.02.27.tar.gz