This repository has been archived by the owner on Dec 18, 2020. It is now read-only.
Switch to CoNLL-U format
The most visible change is that from version 0.3.0 onwards, sticker2 uses the CoNLL-U format. Besides that there were many other improvements:
- Switch from CoNLL-X to CoNLL-U as the file format.
- Much-improved error messages.
- Add
TdzLemmaEncoder
This encoder uses the edit tree encoder, but performs the necessary pre- and postprocessing to produce TüBa-D/Z style lemmas. - Add an option to ℓ2-normalize sinusoidal embeddings and make it the default. This improves model convergence (suggested by @twuebi).
- Support encoding of the full features column as a string (rather than individual attributes/values).
- Permit setting a default value for features. This is useful for using features that are not annotated on every token.
- Add the
filter-len
subcommand. This filters a corpus by the sentence length in word or sentence pieces. - Improvements to serialization of encoders: remove phantom data and storing the feature <-> number bijection twice.
- Update to libtorch 1.5.0.
Models trained with versions prior to 0.3.0 are not compatible with this version. At the moment we only provide compatibility of models with each version y in 0.y.z.