Fix llama2 decode bug in tokenizer #22

FanhaiLu1 · 2024-03-29T17:25:10Z

This PR fix llama2 decode bug in tokenizer. There 344 token ids are decoded to hex str before this PR, all these 344 token ids are decoded to right str value with this PR.

Below are decode results for token_id 13 (before vs with this PR):

<0x0A>
\n

Here is an example of output with this PR:

to find purpose and fulfillment. Here are some reasons why:\n\n1. Purpose gives life meaning: Ha
ving a purpose in life gives it meaning and direction. It helps us to focus on what is important and to make decisions that
align with our values and goals.\n2. Fulfillment comes from pursuing purpose: When we pursue our purpose, we experience fulf
illment and satisfaction. This is because we are doing something that is meaningful and important to us, and we are making p
rogress towards our goals.\n3. Purpose is unique to each person: Everyone has their own unique purpose in life, and it is im
portant to discover and pursue our own purpose. This is because when we are doing something that is meaningful and important
to us, we are more likely to be happy and fulfilled.\n4. Purpose can be found in many areas of life: Our purpose can be fou
nd in many areas of life, such as our work, relationships, hobbies, and personal growth. It is important to find ways to inc
orporate our purpose into all of these areas of life.\n5.

patemotter · 2024-03-29T17:36:57Z

jetstream/engine/token_utils.py

+    If IdToPiece returns a hex string (e.g., '<0x0A>') for a token within these 344,
+      utilize IdToPiece to convert it into a string, likely with a space placeholder (' ') for the corresponding tokens.
+    """
+    p_token = vocab.tokenizer.IdToPiece([tok_id])


It looks like tests are failing because there is a discrepancy between the the real IdToPiece and mocked version expecting different parameter types. int vs [int].

Thanks Pate! The are some weird type error when I pass int to IdToPiece, I changed to int[] for type checking, the int type is right one. Fixed it now.

patemotter · 2024-03-29T17:40:02Z

jetstream/engine/token_utils.py

+      utilize IdToPiece to convert it into a string, likely with a space placeholder (' ') for the corresponding tokens.
+    """
+    p_token = vocab.tokenizer.IdToPiece([tok_id])
+    p_token = p_token[0].replace('▁', ' ').replace('_', ' ')


Nit: You might want a clarifying comment here because it looks like these two characters are the same since the difference is so subtle. It looks like you had that comment originally but removed it in this change.

Yes, current logic of space is not correct. I first made change in this PR, but then I planed to do it in a new PR as this PR focus on the hex str part. Below are some token token impacted examples FYI:
903;; ; _
['▁']
4770;; ; __
['▁']
7471;$; $ ; $
['▁$']
7722;{; {; {
['▁{']
9423;(; ( ; (
['▁(']
11119;"; " ; "_
['▁"']
17117;,; ,; ,
['▁,']
19392;_; ; _ ['▁']
21326;__; ; __ ['▁__']
22868;'; ' ; '_
["▁'"]
23160;[; [ ; [_
['▁[']
24459;); ); )
['▁)']

patemotter · 2024-03-29T17:44:10Z

jetstream/tests/engine/test_token_utils.py

+          #decode
+          d_t = self.sp_tokenizer.decode([n])
+          #IdToPiece
+          p_t = self.jt_tokenizer.decode([n])


Nit: rather than comments here use more descriptive variable names.

Changed, thanks Pate!

patemotter

LGTM

Fix llama2 decode bug in tokenizer

0c9ed57

FanhaiLu1 requested a review from vipannalla as a code owner March 29, 2024 17:25

FanhaiLu1 requested review from patemotter, morgandu and JoeZijunZhou March 29, 2024 17:25

patemotter reviewed Mar 29, 2024

View reviewed changes

FanhaiLu1 added 4 commits March 29, 2024 17:47

add decode mock function

f3d1d42

rename variable

c577a8d

add comments

f4850b6

add comments

6681594

patemotter approved these changes Mar 29, 2024

View reviewed changes

FanhaiLu1 merged commit 9376d50 into AI-Hypercomputer:main Mar 29, 2024
3 checks passed

FanhaiLu1 deleted the decode_fix branch March 29, 2024 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix llama2 decode bug in tokenizer #22

Fix llama2 decode bug in tokenizer #22

FanhaiLu1 commented Mar 29, 2024

patemotter Mar 29, 2024

FanhaiLu1 Mar 29, 2024

patemotter Mar 29, 2024

FanhaiLu1 Mar 29, 2024

patemotter Mar 29, 2024

FanhaiLu1 Mar 29, 2024

patemotter left a comment

Fix llama2 decode bug in tokenizer #22

Fix llama2 decode bug in tokenizer #22

Conversation

FanhaiLu1 commented Mar 29, 2024

patemotter Mar 29, 2024

Choose a reason for hiding this comment

FanhaiLu1 Mar 29, 2024

Choose a reason for hiding this comment

patemotter Mar 29, 2024

Choose a reason for hiding this comment

FanhaiLu1 Mar 29, 2024

Choose a reason for hiding this comment

patemotter Mar 29, 2024

Choose a reason for hiding this comment

FanhaiLu1 Mar 29, 2024

Choose a reason for hiding this comment

patemotter left a comment

Choose a reason for hiding this comment