-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Multilingual parser and Cross-lingual ELMo #2628
Multilingual parser and Cross-lingual ELMo #2628
Conversation
* alignment addition to elmo embedder * dataset reader for multiple sources * iterator that groups samples from the same language
* Inherit unmodified functions from the biaffine parser. * Also using defaultdict for same_lang_iterator.
* Jsonnet format
The previous one caused overflow sometimes. e-10 is good enough
Hi @TalSchuster, thanks for the PR! It looks like this could be a nice contribution. Any particular reason you closed it? |
Hi @matt-gardner, thanks for the feedback! Sorry for that, the first PR here was by mistake (instead of the fork) and then planned to get back to it to add some description but got a bit busy. Will add it now. |
Ok, great, let us know when you think this is ready for a look (I'm gathering from your comment that you're still working on something). |
It is ready for review now. |
Ok, great, we'll get on this (I'm totally booked tomorrow, and then I'm out next week, though - can someone else take this PR?). Let us know if you have questions about any of the pylint or other stuff you see. |
I did some formatting that solved most of the comments but there are still some things that pylint is not happy with, like changing the signature of a function (passing a dictionary instead of a string). I understand that it is not following the OOP conventions but I think that it should be a valid pythonic way (but I'm not an expert in this). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I made a first pass. I had a few important high-level questions that need to be figured out.
allennlp/data/dataset_readers/universal_dependencies_multilang.py
Outdated
Show resolved
Hide resolved
allennlp/data/dataset_readers/universal_dependencies_multilang.py
Outdated
Show resolved
Hide resolved
allennlp/data/dataset_readers/universal_dependencies_multilang.py
Outdated
Show resolved
Hide resolved
allennlp/data/dataset_readers/universal_dependencies_multilang.py
Outdated
Show resolved
Hide resolved
allennlp/modules/token_embedders/elmo_token_embedder_multilang.py
Outdated
Show resolved
Hide resolved
allennlp/modules/token_embedders/elmo_token_embedder_multilang.py
Outdated
Show resolved
Hide resolved
allennlp/modules/text_field_embedders/basic_text_field_embedder.py
Outdated
Show resolved
Hide resolved
…-1 into multilingual_parser
Thanks for the comments. I addressed them and I think that it looks a bit cleaner now. |
allennlp/modules/text_field_embedders/basic_text_field_embedder.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all of your work @TalSchuster, this is looking close to ready! The main thing missing at this point is tests.
These are the classes you've added that need tests:
- the dataset reader (maybe add some fixtures with a couple of instances per language, and test that
read()
gives you the instances you expect) - the data iterator (a simple test that ensures that you're splitting by language as you expect)
- the model (this can just be a simple
ensure_model_can_train_save_and_load
- there are lots of examples of this in the existing model tests; be sure to use very small hyperparameter values in the test experiment config, so the test is fast) - the multi-lingual elmo token embedder. For this one, it'd be annoying to give a detailed test that the vectors are what you expect, and I'm not sure if you have tiny elmo files that can be loaded. The main things we want to be sure of are that (1) this works inside of a
BasicTextFieldEmbedder
when you pass alang
key, and (2) the different languages get routed to the right underlying embedders. It might be possible to stub out the underlying embedders somehow to test (2), e.g., by giving no options or weight files when you construct it, and adding fake modules afterwards.
You'll also need to add docs for each of these classes that you added. If you have questions about how to do any of this, let me know.
allennlp/data/dataset_readers/universal_dependencies_multilang.py
Outdated
Show resolved
Hide resolved
allennlp/modules/text_field_embedders/basic_text_field_embedder.py
Outdated
Show resolved
Hide resolved
allennlp/modules/token_embedders/elmo_token_embedder_multilang.py
Outdated
Show resolved
Hide resolved
allennlp/modules/token_embedders/elmo_token_embedder_multilang.py
Outdated
Show resolved
Hide resolved
Thanks for the great work over the last couple of weeks in bringing this forward! In relation to the case of using this parser with the default token embedder Even though we will need the results from the tests which Matt mentioned, it was encouraging to be able to run the parser and it seems there were some favorable results from multi-treebank training on English treebanks: I compared the default parser with the multilingual one (though I realized I used a smaller LSTM In any case, the results of this comparison on the dev sets are here if you wish to view them: https://drive.google.com/drive/u/2/folders/1NsYJ0jhM79NpG8CWNPMHxHodymsL08Sl |
Thanks for the info, @Jbar-ry! Just to be clear, the point of the tests isn't just to be sure that the code works right now - it's pretty likely that it works in its current state, otherwise it wouldn't be submitted. Instead, it's to (1) ensure that as the library evolves, we don't break this code on accident, and (2) give example expected usage of this code, in the tests. When we accept code into the library, we are implicitly accepting maintenance responsibility of that code forever, so we want to be sure that it's in a state that we can maintain it, with tests to make sure we don't break it. |
Thanks @Jbar-ry ! It's nice to see more settings that it is effective for. Thank you for sharing the results! |
No worries on the time; good luck on submissions, if that's what you're working on. |
* Added dependencies data for es, fr and it + json configuration for tests * Tests for multilingual UD reader and same-language iterator * Tests for multilingual dependency parser * fixed some of the tests - not final * use not lazy option in the parser test * better doc * test basic text field emb * pylint
@TalSchuster, is this ready to look at again? |
@matt-gardner, I think that the only missing test is the second one from the last point that you wrote (verify that the input is routed to the right encoder by the language). |
Could the pieces from #2369 be used here to avoid code duplication? @TalSchuster @matt-gardner |
ok, I've added the last test for the embedder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor readability issues to fix, and then I think this looks good to merge.
allennlp/tests/data/dataset_readers/universal_dependencies_multilingual_dataset_reader_test.py
Outdated
Show resolved
Hide resolved
allennlp/tests/data/dataset_readers/universal_dependencies_multilingual_dataset_reader_test.py
Outdated
Show resolved
Hide resolved
|
||
def test_read_from_files_first_pass_false(self): | ||
''' | ||
Note: assumes that each data file contains two trees |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might consider increasing the count threshold to 20 or something, just to make the test more robust to any future changes. It also makes it more obvious that it's going to iterate forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the mini ELMo to 32 but reduced all other dimensions. The test is pretty fast (few seconds)
allennlp/tests/fixtures/biaffine_dependency_parser_multilang/experiment.json
Outdated
Show resolved
Hide resolved
allennlp/tests/fixtures/elmo_multilingual/config/multilang_token_embedder.json
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for all of your work on this!
* Multilingual version of biaffine dependency parser with elmo alignment * alignment addition to elmo embedder * dataset reader for multiple sources * iterator that groups samples from the same language * Clean multilang biaffine * Inherit unmodified functions from the biaffine parser. * Also using defaultdict for same_lang_iterator. * Multi-lang dep config example * Jsonnet format * Larger softmax value The previous one caused overflow sometimes. e-10 is good enough * formating * reorganize multilang dataset reader to work with a pathname * factoring biaffine parser to prevent duplicating code * multilangTokenEmbedder interface * fix parser factorization and pylint staff * more pylint * inspect embedder and fix params_test * make mypy happy * cr comments and doc * doc * fix doc * Multilingual tests (allenai#4) * Added dependencies data for es, fr and it + json configuration for tests * Tests for multilingual UD reader and same-language iterator * Tests for multilingual dependency parser * fixed some of the tests - not final * use not lazy option in the parser test * better doc * test basic text field emb * pylint * multilingual embedder test * cr comments * new link
Support for cross-lingual alignment of ELMo and multilingual biaffine dependency parser as described in Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing.
Modifications: