Skip to content

Latest commit

 

History

History
81 lines (53 loc) · 3.27 KB

wordpiece_vocab_EN.md

File metadata and controls

81 lines (53 loc) · 3.27 KB

한국어 | English

Wordpiece Vocabulary

The alternatives for Korean Tokenizer are Sentencepiece, Mecab, and Wordpiece.

Since BERT, ELECTRA, etc. basically use Wordpiece, the Tokenizer provided by default is also written to be compatible with this. In other words, if you want to use Sentencepiece or Mecab, you have to create a customized tokenizer, which causes inconvenience in using the model directly in the library.

The main purpose of this project was to make it possible to use the library immediately without creating an additional tokenization file, and decided to use Wordpiece.

Original wordpiece code is NOT available!

Wordpiece builder used in official BERT is not provided. Other alternatives have been presented on the BERT official Github, but the exact same Wordpiece vocab doesn't come out.

Several open sources have implemented the Wordpiece vocab builder, but when the input file is very large, memory and speed issues often occur.

Huggingface Tokenizers

Finally, a wordpiece vocabulary was created using the Tokenizers library recently released by Huggingface.

Using this library, even if corpus is very large, there is no memory issue, and it is implemented in Rust, so it is also faster than Python.

Code for building Wordpiece vocab

(This code is based on tokenizers==0.7.0. API might be changed on higher version.)

import os
import argparse
from tokenizers import BertWordPieceTokenizer

parser = argparse.ArgumentParser()

parser.add_argument("--corpus_file", type=str)
parser.add_argument("--vocab_size", type=int, default=32000)
parser.add_argument("--limit_alphabet", type=int, default=6000)

args = parser.parse_args()

tokenizer = BertWordPieceTokenizer(
    vocab_file=None,
    clean_text=True,
    handle_chinese_chars=True,
    strip_accents=False, # Must be False if cased model
    lowercase=False,
    wordpieces_prefix="##"
)

tokenizer.train(
    files=[args.corpus_file],
    limit_alphabet=args.limit_alphabet,
    vocab_size=args.vocab_size
)

tokenizer.save("./", "ch-{}-wpm-{}".format(args.limit_alphabet, args.vocab_size))
  • It should be noted that when lowercase=False, strip_accent=False must be used.

  • In order to reduce the [UNK] ratio as much as possible, it should be trained to cover all characters. (limit_alphabet)

  • After the vocab was created, 200 [unused] tokens were additionally added to the vocab.

  • Under the premise that preprocessing was done, the UNK Ratio was much lower compared to the sentencepiece.

  • For the vocab, please see vocab.txt

Reference