-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix llama2 decode bug in tokenizer #22
Conversation
jetstream/engine/token_utils.py
Outdated
If IdToPiece returns a hex string (e.g., '<0x0A>') for a token within these 344, | ||
utilize IdToPiece to convert it into a string, likely with a space placeholder (' ') for the corresponding tokens. | ||
""" | ||
p_token = vocab.tokenizer.IdToPiece([tok_id]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like tests are failing because there is a discrepancy between the the real IdToPiece
and mocked version expecting different parameter types. int
vs [int]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Pate! The are some weird type error when I pass int to IdToPiece, I changed to int[] for type checking, the int type is right one. Fixed it now.
jetstream/engine/token_utils.py
Outdated
utilize IdToPiece to convert it into a string, likely with a space placeholder (' ') for the corresponding tokens. | ||
""" | ||
p_token = vocab.tokenizer.IdToPiece([tok_id]) | ||
p_token = p_token[0].replace('▁', ' ').replace('_', ' ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: You might want a clarifying comment here because it looks like these two characters are the same since the difference is so subtle. It looks like you had that comment originally but removed it in this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, current logic of space is not correct. I first made change in this PR, but then I planed to do it in a new PR as this PR focus on the hex str part. Below are some token token impacted examples FYI:
903;; ; _
['▁']
4770;; ; __
['▁']
7471;$; $ ; $
['▁$']
7722;{; {; {
['▁{']
9423;(; ( ; (
['▁(']
11119;"; " ; "_
['▁"']
17117;,; ,; ,
['▁,']
19392;_;
; _ ['▁
']
21326;__;
; __ ['▁
__']
22868;'; ' ; '_
["▁'"]
23160;[; [ ; [_
['▁[']
24459;); ); )
['▁)']
#decode | ||
d_t = self.sp_tokenizer.decode([n]) | ||
#IdToPiece | ||
p_t = self.jt_tokenizer.decode([n]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: rather than comments here use more descriptive variable names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, thanks Pate!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR fix llama2 decode bug in tokenizer. There 344 token ids are decoded to hex str before this PR, all these 344 token ids are decoded to right str value with this PR.
Below are decode results for token_id 13 (before vs with this PR):
<0x0A>
\n
Here is an example of output with this PR: