BPE Tokenizer: Multiple newlines doesn't merge into a single token

So, I found out that `\n\n` if appended by a character tokenizes as `['\n',\n']` (`[198, 198]`) instead of `['\n\n']` (`[271]`).
(I'm using Llama3 for this example, but this extends to other models as well)

Here's an example prompt:
```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You're Psy, user's assistant, and a master of concise replies.<|eot_id|><|start_header_id|>user<|end_header_id|>

Write a short poem<|eot_id|><|start_header_id|>assistant<|end_header_id|>


```
And the tokenized text:
![image](/~https://github.com/ggerganov/llama.cpp/assets/32474602/b78fc581-59ea-408c-bc27-a14dc13bb7ad)

If I switch the template to use `\n\n\n\n` (`1038`) it tokenizes as `['\n\n\n', '\n']` (`[1432, 198]`):
![image](/~https://github.com/ggerganov/llama.cpp/assets/32474602/fdcd6093-81f6-4de2-8f85-d57f6def8d46)

(Note: I know there've been efforts in making special tokens render, but rn I understand they don't have a textual representation, so you can ignore tokens like 128000, 128006 and 128007 in the sequences above)

In C# I patch the issue like so:
```cs
var tokensCount = NativeApi.llama_tokenize(model, bytesPtr, bytes.Length, tokensPtr, tokenBuffer.Length, add_bos, special);
var list = new List<LLamaToken>();
for (int i = 0; i < tokensCount; i++) { // Hack: ['\n','\n'] --> ['\n\n']
    if (tokenBuffer[i] == 198 && tokenBuffer[i + 1] == 198) { list.Add(271); i++; }
    else { list.Add(tokenBuffer[i]); }
}
return list.ToArray();
```
(ignoring all \n merges except the `\n\n` which is common for the template)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPE Tokenizer: Multiple newlines doesn't merge into a single token #6809

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development