Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix llama2 decode bug in tokenizer #22

Merged
merged 5 commits into from
Mar 29, 2024

Conversation

FanhaiLu1
Copy link
Contributor

This PR fix llama2 decode bug in tokenizer. There 344 token ids are decoded to hex str before this PR, all these 344 token ids are decoded to right str value with this PR.

Below are decode results for token_id 13 (before vs with this PR):

  • <0x0A>

  • \n

Here is an example of output with this PR:

to find purpose and fulfillment. Here are some reasons why:\n\n1. Purpose gives life meaning: Ha
ving a purpose in life gives it meaning and direction. It helps us to focus on what is important and to make decisions that
align with our values and goals.\n2. Fulfillment comes from pursuing purpose: When we pursue our purpose, we experience fulf
illment and satisfaction. This is because we are doing something that is meaningful and important to us, and we are making p
rogress towards our goals.\n3. Purpose is unique to each person: Everyone has their own unique purpose in life, and it is im
portant to discover and pursue our own purpose. This is because when we are doing something that is meaningful and important
to us, we are more likely to be happy and fulfilled.\n4. Purpose can be found in many areas of life: Our purpose can be fou
nd in many areas of life, such as our work, relationships, hobbies, and personal growth. It is important to find ways to inc
orporate our purpose into all of these areas of life.\n5.

If IdToPiece returns a hex string (e.g., '<0x0A>') for a token within these 344,
utilize IdToPiece to convert it into a string, likely with a space placeholder (' ') for the corresponding tokens.
"""
p_token = vocab.tokenizer.IdToPiece([tok_id])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like tests are failing because there is a discrepancy between the the real IdToPiece and mocked version expecting different parameter types. int vs [int].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pate! The are some weird type error when I pass int to IdToPiece, I changed to int[] for type checking, the int type is right one. Fixed it now.

utilize IdToPiece to convert it into a string, likely with a space placeholder (' ') for the corresponding tokens.
"""
p_token = vocab.tokenizer.IdToPiece([tok_id])
p_token = p_token[0].replace('▁', ' ').replace('_', ' ')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You might want a clarifying comment here because it looks like these two characters are the same since the difference is so subtle. It looks like you had that comment originally but removed it in this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, current logic of space is not correct. I first made change in this PR, but then I planed to do it in a new PR as this PR focus on the hex str part. Below are some token token impacted examples FYI:
903;; ; _
['▁
']
4770;; ; __
['▁
']
7471;$; $ ; $
['▁$']
7722;
{; {; {
['▁
{']
9423;(; ( ; (
['▁(']
11119;"
; " ; "_
['▁"']
17117;
,; ,; ,
['▁
,']
19392;_; ; _ ['▁']
21326;__; ; __ ['▁__']
22868;'
; ' ; '_
["▁'"]
23160;[
; [ ; [_
['▁[']
24459;
); ); )
['▁
)']

Comment on lines 61 to 64
#decode
d_t = self.sp_tokenizer.decode([n])
#IdToPiece
p_t = self.jt_tokenizer.decode([n])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rather than comments here use more descriptive variable names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, thanks Pate!

Copy link
Collaborator

@patemotter patemotter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@FanhaiLu1 FanhaiLu1 merged commit 9376d50 into AI-Hypercomputer:main Mar 29, 2024
3 checks passed
@FanhaiLu1 FanhaiLu1 deleted the decode_fix branch March 29, 2024 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants