Skip to content

Commit

Permalink
Generalise re_replacement_seq to deal with special symbols
Browse files Browse the repository at this point in the history
This PR is similar to #90, and 
generalises the regex to deal with all the previous, and hopefully all future cases as well.

The new special case not covered by the previous approach are the `�?`
and `�,` tokens, used by Salamandra models. Since all these special
tokens (new and old) consist of one or more � symbols, with an optional
single-character prefix and/or suffix, we can simplify and generalise
the pattern to r"^.?�+.?$".
  • Loading branch information
saattrupdan authored Jan 8, 2025
1 parent 78eb908 commit cad6344
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 10 deletions.
10 changes: 5 additions & 5 deletions python/outlines_core/fsm/regex.py
Original file line number Diff line number Diff line change
Expand Up @@ -342,11 +342,11 @@ def make_deterministic_fsm(fsm: FSM) -> Tuple[BetterFSM, Dict[int, int]]:

re_llama_byte_token = re.compile(r"^<0x[0-9A-F]{2}>$")

# The "▁*" prefix is required to handle Gemma and GPT-SW3 tokenizers.
# The "\.*" suffix is required to handle the NorwAI tokenizer.
# The "\.*" prefix is required to handle the Salamandra tokenizer.
# The "s*$" suffix is required to handle the OpenCoder tokenizer.
re_replacement_seq = re.compile(r"^▁*\.*�+\.*s*$")
# The ".?" prefix and suffix is to handle special cases in some model vocabularies. This
# includes Gemma models (which use "▁�" as a token), NorwAI models (which use ".�" as a
# token), Salamandra models (which use ".�" and "�?" as tokens) and OpenCoder models
# (which use "�s" as a token).
re_replacement_seq = re.compile(r"^.?�+.?$")


# Copied from transformers.models.gpt2.tokenization_gpt2.bytes_to_unicode
Expand Down
12 changes: 7 additions & 5 deletions tests/fsm/test_regex.py
Original file line number Diff line number Diff line change
Expand Up @@ -542,12 +542,14 @@ def convert_token_to_string(self, token):
"�",
"��",
"�.",
"�..",
".�",
".�.",
"▁�",
"▁▁�",
"▁�.",
"▁�.",
"▁▁�..",
"�▁",
"▁�▁",
"?�",
"�?",
"?�?",
],
)
def test_reduced_vocabulary_with_rare_tokens(rare_token):
Expand Down

0 comments on commit cad6344

Please sign in to comment.