Skip to content

BPE Tokenizer: Multiple newlines doesn't merge into a single token #6809

Closed
@Lyrcaxis

Description

So, I found out that \n\n if appended by a character tokenizes as ['\n',\n'] ([198, 198]) instead of ['\n\n'] ([271]).
(I'm using Llama3 for this example, but this extends to other models as well)

Here's an example prompt:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You're Psy, user's assistant, and a master of concise replies.<|eot_id|><|start_header_id|>user<|end_header_id|>

Write a short poem<|eot_id|><|start_header_id|>assistant<|end_header_id|>


And the tokenized text:
image

If I switch the template to use \n\n\n\n (1038) it tokenizes as ['\n\n\n', '\n'] ([1432, 198]):
image

(Note: I know there've been efforts in making special tokens render, but rn I understand they don't have a textual representation, so you can ignore tokens like 128000, 128006 and 128007 in the sequences above)

In C# I patch the issue like so:

var tokensCount = NativeApi.llama_tokenize(model, bytesPtr, bytes.Length, tokensPtr, tokenBuffer.Length, add_bos, special);
var list = new List<LLamaToken>();
for (int i = 0; i < tokensCount; i++) { // Hack: ['\n','\n'] --> ['\n\n']
    if (tokenBuffer[i] == 198 && tokenBuffer[i + 1] == 198) { list.Add(271); i++; }
    else { list.Add(tokenBuffer[i]); }
}
return list.ToArray();

(ignoring all \n merges except the \n\n which is common for the template)

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions