BPE Tokenizer: Multiple newlines doesn't merge into a single token #6809
Closed
Description
So, I found out that \n\n
if appended by a character tokenizes as ['\n',\n']
([198, 198]
) instead of ['\n\n']
([271]
).
(I'm using Llama3 for this example, but this extends to other models as well)
Here's an example prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You're Psy, user's assistant, and a master of concise replies.<|eot_id|><|start_header_id|>user<|end_header_id|>
Write a short poem<|eot_id|><|start_header_id|>assistant<|end_header_id|>
If I switch the template to use \n\n\n\n
(1038
) it tokenizes as ['\n\n\n', '\n']
([1432, 198]
):
(Note: I know there've been efforts in making special tokens render, but rn I understand they don't have a textual representation, so you can ignore tokens like 128000, 128006 and 128007 in the sequences above)
In C# I patch the issue like so:
var tokensCount = NativeApi.llama_tokenize(model, bytesPtr, bytes.Length, tokensPtr, tokenBuffer.Length, add_bos, special);
var list = new List<LLamaToken>();
for (int i = 0; i < tokensCount; i++) { // Hack: ['\n','\n'] --> ['\n\n']
if (tokenBuffer[i] == 198 && tokenBuffer[i + 1] == 198) { list.Add(271); i++; }
else { list.Add(tokenBuffer[i]); }
}
return list.ToArray();
(ignoring all \n merges except the \n\n
which is common for the template)