Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Multilingual parser and Cross-lingual ELMo #2628

Merged
merged 25 commits into from
Jun 12, 2019

Conversation

TalSchuster
Copy link
Contributor

@TalSchuster TalSchuster commented Mar 18, 2019

Support for cross-lingual alignment of ELMo and multilingual biaffine dependency parser as described in Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing.

Modifications:

  • Multilingual parser - supports inputs from different files and uses the matching embedder. Also, stores per language metrics.
  • Universal dependencies reader for multiple languages - alternates between the sources to read batches from each one.
  • same language iterator - breaks batches, if necessary, to contain only a single language.
  • ELMo multilang embedder - loads the parameters for each lanugage and an optional alignment matrix. On forward, applies the matching embedder by the batch language.
  • Configuration example

* alignment addition to elmo embedder

* dataset reader for multiple sources

* iterator that groups samples from the same language
* Inherit unmodified functions from the biaffine parser.

* Also using defaultdict for same_lang_iterator.
The previous one caused overflow sometimes. e-10 is good enough
@matt-gardner
Copy link
Contributor

Hi @TalSchuster, thanks for the PR! It looks like this could be a nice contribution. Any particular reason you closed it?

@TalSchuster TalSchuster reopened this Mar 29, 2019
@TalSchuster
Copy link
Contributor Author

TalSchuster commented Mar 29, 2019

Hi @matt-gardner, thanks for the feedback!

Sorry for that, the first PR here was by mistake (instead of the fork) and then planned to get back to it to add some description but got a bit busy. Will add it now.

@TalSchuster TalSchuster changed the title Multilingual parser Multilingual parser and Cross-lingual ELMo Mar 29, 2019
@matt-gardner
Copy link
Contributor

Ok, great, let us know when you think this is ready for a look (I'm gathering from your comment that you're still working on something).

@TalSchuster
Copy link
Contributor Author

It is ready for review now.
I only see some pylint issues that I'm less familiar with but I'll try to figure that out.

@matt-gardner
Copy link
Contributor

Ok, great, we'll get on this (I'm totally booked tomorrow, and then I'm out next week, though - can someone else take this PR?). Let us know if you have questions about any of the pylint or other stuff you see.

@TalSchuster
Copy link
Contributor Author

I did some formatting that solved most of the comments but there are still some things that pylint is not happy with, like changing the signature of a function (passing a dictionary instead of a string). I understand that it is not following the OOP conventions but I think that it should be a valid pythonic way (but I'm not an expert in this).
If someone can look at it and help to solve them it will be great. Thanks!

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I made a first pass. I had a few important high-level questions that need to be figured out.

@TalSchuster
Copy link
Contributor Author

Thanks for the comments. I addressed them and I think that it looks a bit cleaner now.
Only thing I wasn't sure about is how the configuration tester is supposed to deal with the extvars. Can you help there?

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all of your work @TalSchuster, this is looking close to ready! The main thing missing at this point is tests.

These are the classes you've added that need tests:

  • the dataset reader (maybe add some fixtures with a couple of instances per language, and test that read() gives you the instances you expect)
  • the data iterator (a simple test that ensures that you're splitting by language as you expect)
  • the model (this can just be a simple ensure_model_can_train_save_and_load - there are lots of examples of this in the existing model tests; be sure to use very small hyperparameter values in the test experiment config, so the test is fast)
  • the multi-lingual elmo token embedder. For this one, it'd be annoying to give a detailed test that the vectors are what you expect, and I'm not sure if you have tiny elmo files that can be loaded. The main things we want to be sure of are that (1) this works inside of a BasicTextFieldEmbedder when you pass a lang key, and (2) the different languages get routed to the right underlying embedders. It might be possible to stub out the underlying embedders somehow to test (2), e.g., by giving no options or weight files when you construct it, and adding fake modules afterwards.

You'll also need to add docs for each of these classes that you added. If you have questions about how to do any of this, let me know.

allennlp/data/iterators/same_lang_iterator.py Outdated Show resolved Hide resolved
allennlp/models/biaffine_dependency_parser_multilang.py Outdated Show resolved Hide resolved
allennlp/models/biaffine_dependency_parser_multilang.py Outdated Show resolved Hide resolved
training_config/multilang_dependency_parser.jsonnet Outdated Show resolved Hide resolved
allennlp/data/iterators/same_lang_iterator.py Outdated Show resolved Hide resolved
@jbrry
Copy link
Contributor

jbrry commented Apr 30, 2019

Thanks for the great work over the last couple of weeks in bringing this forward!

In relation to the case of using this parser with the default token embedder embedding; I was able to use the multilingual parser for multiple treebank training on the English UD treebanks of UD v2.3 without ELMo. Here is the config file I used.

Even though we will need the results from the tests which Matt mentioned, it was encouraging to be able to run the parser and it seems there were some favorable results from multi-treebank training on English treebanks: I compared the default parser with the multilingual one (though I realized I used a smaller LSTM hidden_size of 200 with the default parser vs. 400 for the multilingual parser).

In any case, the results of this comparison on the dev sets are here if you wish to view them: https://drive.google.com/drive/u/2/folders/1NsYJ0jhM79NpG8CWNPMHxHodymsL08Sl

@matt-gardner
Copy link
Contributor

Thanks for the info, @Jbar-ry! Just to be clear, the point of the tests isn't just to be sure that the code works right now - it's pretty likely that it works in its current state, otherwise it wouldn't be submitted. Instead, it's to (1) ensure that as the library evolves, we don't break this code on accident, and (2) give example expected usage of this code, in the tests. When we accept code into the library, we are implicitly accepting maintenance responsibility of that code forever, so we want to be sure that it's in a state that we can maintain it, with tests to make sure we don't break it.

@TalSchuster
Copy link
Contributor Author

TalSchuster commented Apr 30, 2019

Thanks @Jbar-ry ! It's nice to see more settings that it is effective for. Thank you for sharing the results!
@matt-gardner, this is totally fine. Sorry that it is delayed... @oriram and I are working on the tests but we are a bit busy right now with some other commitments. I hope that we will get to finish it this week.

@matt-gardner
Copy link
Contributor

No worries on the time; good luck on submissions, if that's what you're working on.

* Added dependencies data for es, fr and it + json configuration for tests

* Tests for multilingual UD reader and same-language iterator

* Tests for multilingual dependency parser

* fixed some of the tests - not final

* use not lazy option in the parser test

* better doc

* test basic text field emb

* pylint
@matt-gardner
Copy link
Contributor

@TalSchuster, is this ready to look at again?

@TalSchuster
Copy link
Contributor Author

@matt-gardner, I think that the only missing test is the second one from the last point that you wrote (verify that the input is routed to the right encoder by the language).
I think that I have an idea of how to do it so I'll try to get to it next week.
All of the rest of the code should be independent of that test so if you can go over it, it would be great

@MaksymDel
Copy link
Contributor

Could the pieces from #2369 be used here to avoid code duplication? @TalSchuster @matt-gardner

@TalSchuster
Copy link
Contributor Author

TalSchuster commented Jun 11, 2019

ok, I've added the last test for the embedder.
@matt-gardner, it is ready on my end.
@MaxDel, we've already discussed the read of multiple files earlier in this thread. There are some subtle things here, like alternating between the readers but going over all of the dataset for vocabulary creation. I didn't read the multitask code in details so there might be some things that could be better organized. Unfortunately, I don't think that I will have enough time to do it for this PR.

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor readability issues to fix, and then I think this looks good to merge.


def test_read_from_files_first_pass_false(self):
'''
Note: assumes that each data file contains two trees
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider increasing the count threshold to 20 or something, just to make the test more robust to any future changes. It also makes it more obvious that it's going to iterate forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the mini ELMo to 32 but reduced all other dimensions. The test is pretty fast (few seconds)

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for all of your work on this!

@matt-gardner matt-gardner merged commit da16ad1 into allenai:master Jun 12, 2019
@DeNeutoy DeNeutoy mentioned this pull request Nov 12, 2019
reiyw pushed a commit to reiyw/allennlp that referenced this pull request Nov 12, 2019
* Multilingual version of biaffine dependency parser with elmo alignment

* alignment addition to elmo embedder

* dataset reader for multiple sources

* iterator that groups samples from the same language

* Clean multilang biaffine

* Inherit unmodified functions from the biaffine parser.

* Also using defaultdict for same_lang_iterator.

* Multi-lang dep config example

* Jsonnet format

* Larger softmax value

The previous one caused overflow sometimes. e-10 is good enough

* formating

* reorganize multilang dataset reader to work with a pathname

* factoring biaffine parser to prevent duplicating code

* multilangTokenEmbedder interface

* fix parser factorization and pylint staff

* more pylint

* inspect embedder and fix params_test

* make mypy happy

* cr comments and doc

* doc

* fix doc

* Multilingual tests (allenai#4)

* Added dependencies data for es, fr and it + json configuration for tests

* Tests for multilingual UD reader and same-language iterator

* Tests for multilingual dependency parser

* fixed some of the tests - not final

* use not lazy option in the parser test

* better doc

* test basic text field emb

* pylint

* multilingual embedder test

* cr comments

* new link
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants