Skip to content
This repository has been archived by the owner on Dec 18, 2020. It is now read-only.

Switch to CoNLL-U format

Compare
Choose a tag to compare
@danieldk danieldk released this 19 May 12:01
· 30 commits to master since this release

The most visible change is that from version 0.3.0 onwards, sticker2 uses the CoNLL-U format. Besides that there were many other improvements:

  • Switch from CoNLL-X to CoNLL-U as the file format.
  • Much-improved error messages.
  • Add TdzLemmaEncoder This encoder uses the edit tree encoder, but performs the necessary pre- and postprocessing to produce TüBa-D/Z style lemmas.
  • Add an option to ℓ2-normalize sinusoidal embeddings and make it the default. This improves model convergence (suggested by @twuebi).
  • Support encoding of the full features column as a string (rather than individual attributes/values).
  • Permit setting a default value for features. This is useful for using features that are not annotated on every token.
  • Add the filter-len subcommand. This filters a corpus by the sentence length in word or sentence pieces.
  • Improvements to serialization of encoders: remove phantom data and storing the feature <-> number bijection twice.
  • Update to libtorch 1.5.0.

Models trained with versions prior to 0.3.0 are not compatible with this version. At the moment we only provide compatibility of models with each version y in 0.y.z.