Multilingual parser and Cross-lingual ELMo #2628

TalSchuster · 2019-03-18T20:01:56Z

Support for cross-lingual alignment of ELMo and multilingual biaffine dependency parser as described in Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing.

Modifications:

Multilingual parser - supports inputs from different files and uses the matching embedder. Also, stores per language metrics.
Universal dependencies reader for multiple languages - alternates between the sources to read batches from each one.
same language iterator - breaks batches, if necessary, to contain only a single language.
ELMo multilang embedder - loads the parameters for each lanugage and an optional alignment matrix. On forward, applies the matching embedder by the batch language.
Configuration example

* alignment addition to elmo embedder * dataset reader for multiple sources * iterator that groups samples from the same language

* Inherit unmodified functions from the biaffine parser. * Also using defaultdict for same_lang_iterator.

* Jsonnet format

The previous one caused overflow sometimes. e-10 is good enough

matt-gardner · 2019-03-18T23:06:07Z

Hi @TalSchuster, thanks for the PR! It looks like this could be a nice contribution. Any particular reason you closed it?

TalSchuster · 2019-03-29T03:24:31Z

Hi @matt-gardner, thanks for the feedback!

Sorry for that, the first PR here was by mistake (instead of the fork) and then planned to get back to it to add some description but got a bit busy. Will add it now.

matt-gardner · 2019-03-29T03:55:31Z

Ok, great, let us know when you think this is ready for a look (I'm gathering from your comment that you're still working on something).

TalSchuster · 2019-03-29T04:04:44Z

It is ready for review now.
I only see some pylint issues that I'm less familiar with but I'll try to figure that out.

matt-gardner · 2019-03-29T04:12:46Z

Ok, great, we'll get on this (I'm totally booked tomorrow, and then I'm out next week, though - can someone else take this PR?). Let us know if you have questions about any of the pylint or other stuff you see.

…-1 into multilingual_parser

TalSchuster · 2019-03-31T21:10:08Z

I did some formatting that solved most of the comments but there are still some things that pylint is not happy with, like changing the signature of a function (passing a dictionary instead of a string). I understand that it is not following the OOP conventions but I think that it should be a valid pythonic way (but I'm not an expert in this).
If someone can look at it and help to solve them it will be great. Thanks!

matt-gardner

Alright, I made a first pass. I had a few important high-level questions that need to be figured out.

allennlp/data/dataset_readers/universal_dependencies_multilang.py

allennlp/modules/token_embedders/elmo_token_embedder_multilang.py

allennlp/modules/text_field_embedders/basic_text_field_embedder.py

training_config/multilang_dependency_parser.jsonnet

…-1 into multilingual_parser

TalSchuster · 2019-04-15T16:32:32Z

Thanks for the comments. I addressed them and I think that it looks a bit cleaner now.
Only thing I wasn't sure about is how the configuration tester is supposed to deal with the extvars. Can you help there?

allennlp/tests/common/params_test.py

allennlp/modules/text_field_embedders/basic_text_field_embedder.py

allennlp/models/biaffine_dependency_parser_multilang.py

training_config/multilang_dependency_parser.jsonnet

matt-gardner

Thanks for all of your work @TalSchuster, this is looking close to ready! The main thing missing at this point is tests.

These are the classes you've added that need tests:

the dataset reader (maybe add some fixtures with a couple of instances per language, and test that read() gives you the instances you expect)
the data iterator (a simple test that ensures that you're splitting by language as you expect)
the model (this can just be a simple ensure_model_can_train_save_and_load - there are lots of examples of this in the existing model tests; be sure to use very small hyperparameter values in the test experiment config, so the test is fast)
the multi-lingual elmo token embedder. For this one, it'd be annoying to give a detailed test that the vectors are what you expect, and I'm not sure if you have tiny elmo files that can be loaded. The main things we want to be sure of are that (1) this works inside of a BasicTextFieldEmbedder when you pass a lang key, and (2) the different languages get routed to the right underlying embedders. It might be possible to stub out the underlying embedders somehow to test (2), e.g., by giving no options or weight files when you construct it, and adding fake modules afterwards.

You'll also need to add docs for each of these classes that you added. If you have questions about how to do any of this, let me know.

allennlp/data/iterators/same_lang_iterator.py

allennlp/data/dataset_readers/universal_dependencies_multilang.py

allennlp/models/biaffine_dependency_parser_multilang.py

allennlp/modules/text_field_embedders/basic_text_field_embedder.py

allennlp/modules/token_embedders/elmo_token_embedder_multilang.py

training_config/multilang_dependency_parser.jsonnet

allennlp/data/iterators/same_lang_iterator.py

training_config/multilang_dependency_parser.jsonnet

jbrry · 2019-04-30T13:25:26Z

Thanks for the great work over the last couple of weeks in bringing this forward!

In relation to the case of using this parser with the default token embedder embedding; I was able to use the multilingual parser for multiple treebank training on the English UD treebanks of UD v2.3 without ELMo. Here is the config file I used.

Even though we will need the results from the tests which Matt mentioned, it was encouraging to be able to run the parser and it seems there were some favorable results from multi-treebank training on English treebanks: I compared the default parser with the multilingual one (though I realized I used a smaller LSTM hidden_size of 200 with the default parser vs. 400 for the multilingual parser).

In any case, the results of this comparison on the dev sets are here if you wish to view them: https://drive.google.com/drive/u/2/folders/1NsYJ0jhM79NpG8CWNPMHxHodymsL08Sl

matt-gardner · 2019-04-30T14:48:46Z

Thanks for the info, @Jbar-ry! Just to be clear, the point of the tests isn't just to be sure that the code works right now - it's pretty likely that it works in its current state, otherwise it wouldn't be submitted. Instead, it's to (1) ensure that as the library evolves, we don't break this code on accident, and (2) give example expected usage of this code, in the tests. When we accept code into the library, we are implicitly accepting maintenance responsibility of that code forever, so we want to be sure that it's in a state that we can maintain it, with tests to make sure we don't break it.

TalSchuster · 2019-04-30T23:04:59Z

Thanks @Jbar-ry ! It's nice to see more settings that it is effective for. Thank you for sharing the results!
@matt-gardner, this is totally fine. Sorry that it is delayed... @oriram and I are working on the tests but we are a bit busy right now with some other commitments. I hope that we will get to finish it this week.

matt-gardner · 2019-05-01T01:43:04Z

No worries on the time; good luck on submissions, if that's what you're working on.

* Added dependencies data for es, fr and it + json configuration for tests * Tests for multilingual UD reader and same-language iterator * Tests for multilingual dependency parser * fixed some of the tests - not final * use not lazy option in the parser test * better doc * test basic text field emb * pylint

matt-gardner · 2019-06-07T17:43:41Z

@TalSchuster, is this ready to look at again?

TalSchuster · 2019-06-07T20:52:57Z

@matt-gardner, I think that the only missing test is the second one from the last point that you wrote (verify that the input is routed to the right encoder by the language).
I think that I have an idea of how to do it so I'll try to get to it next week.
All of the rest of the code should be independent of that test so if you can go over it, it would be great

MaksymDel · 2019-06-11T00:06:17Z

Could the pieces from #2369 be used here to avoid code duplication? @TalSchuster @matt-gardner

TalSchuster · 2019-06-11T21:09:31Z

ok, I've added the last test for the embedder.
@matt-gardner, it is ready on my end.
@MaxDel, we've already discussed the read of multiple files earlier in this thread. There are some subtle things here, like alternating between the readers but going over all of the dataset for vocabulary creation. I didn't read the multitask code in details so there might be some things that could be better organized. Unfortunately, I don't think that I will have enough time to do it for this PR.

matt-gardner

Just some minor readability issues to fix, and then I think this looks good to merge.

allennlp/tests/data/dataset_readers/universal_dependencies_multilingual_dataset_reader_test.py

matt-gardner · 2019-06-12T04:22:02Z

allennlp/tests/data/dataset_readers/universal_dependencies_multilingual_dataset_reader_test.py

+
+    def test_read_from_files_first_pass_false(self):
+        '''
+        Note: assumes that each data file contains two trees


You might consider increasing the count threshold to 20 or something, just to make the test more robust to any future changes. It also makes it more obvious that it's going to iterate forever.

I kept the mini ELMo to 32 but reduced all other dimensions. The test is pretty fast (few seconds)

allennlp/tests/data/iterators/same_language_iterator_test.py

allennlp/tests/fixtures/biaffine_dependency_parser_multilang/experiment.json

allennlp/tests/fixtures/elmo_multilingual/config/multilang_token_embedder.json

matt-gardner

LGTM, thanks for all of your work on this!

* Multilingual version of biaffine dependency parser with elmo alignment * alignment addition to elmo embedder * dataset reader for multiple sources * iterator that groups samples from the same language * Clean multilang biaffine * Inherit unmodified functions from the biaffine parser. * Also using defaultdict for same_lang_iterator. * Multi-lang dep config example * Jsonnet format * Larger softmax value The previous one caused overflow sometimes. e-10 is good enough * formating * reorganize multilang dataset reader to work with a pathname * factoring biaffine parser to prevent duplicating code * multilangTokenEmbedder interface * fix parser factorization and pylint staff * more pylint * inspect embedder and fix params_test * make mypy happy * cr comments and doc * doc * fix doc * Multilingual tests (allenai#4) * Added dependencies data for es, fr and it + json configuration for tests * Tests for multilingual UD reader and same-language iterator * Tests for multilingual dependency parser * fixed some of the tests - not final * use not lazy option in the parser test * better doc * test basic text field emb * pylint * multilingual embedder test * cr comments * new link

TalSchuster added 3 commits March 15, 2019 17:04

Multilingual version of biaffine dependency parser with elmo alignment

a4a6728

* alignment addition to elmo embedder * dataset reader for multiple sources * iterator that groups samples from the same language

Clean multilang biaffine

02660d6

* Inherit unmodified functions from the biaffine parser. * Also using defaultdict for same_lang_iterator.

Multi-lang dep config example

f62a0f6

* Jsonnet format

TalSchuster closed this Mar 18, 2019

Larger softmax value

2101636

The previous one caused overflow sometimes. e-10 is good enough

TalSchuster reopened this Mar 29, 2019

Merge branch 'master' into multilingual_parser

d8ab96f

TalSchuster changed the title ~~Multilingual parser~~ Multilingual parser and Cross-lingual ELMo Mar 29, 2019

TalSchuster and others added 3 commits March 30, 2019 02:49

formating

7884b86

Merge branch 'multilingual_parser' of github.com:TalSchuster/allennlp…

c985456

…-1 into multilingual_parser

Merge branch 'master' into multilingual_parser

797489f

matt-gardner reviewed Apr 10, 2019

View reviewed changes

TalSchuster added 5 commits April 12, 2019 03:37

reorganize multilang dataset reader to work with a pathname

4c3ade3

Merge branch 'multilingual_parser' of github.com:TalSchuster/allennlp…

5c77a40

…-1 into multilingual_parser

factoring biaffine parser to prevent duplicating code

8b6094f

multilangTokenEmbedder interface

a374561

fix parser factorization and pylint staff

3e7ce6d

more pylint

84de432