Fix the bug that tokenize_and_concatenate function not working for small dataset #725

xiuyuz · 2024-09-19T20:54:04Z

Description

As stated in the docstring of the function ("Note: There is a bug when inputting very small datasets (eg, <1 batch per process) where it just outputs nothing. I'm not super sure why"), tokenize_and_concatenate currently cannot handle small datasets. This is because the tokenize_function inside tokenize_and_concatenate isn't producing any tokens when dealing with small datasets, i.e., the tokenization process doesn't produce any output, leading to the absence of the 'tokens' column.

Issue with small datasets:

When the dataset is small, full_text may not be long enough to produce a significant number of tokens.
num_batches can become zero if num_tokens is less than seq_len.
When num_batches is zero, slicing and rearranging operations result in empty arrays.

Fixes # (issue)

Check if num_tokens is greater than or equal to seq_len before proceeding with rearrangement.
If not, set num_batches to one and handle the tokens differently to ensure they're still returned.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Screenshots

Please attach before and after screenshots of the change if applicable.

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have not rewritten tests relating to key interfaces which would affect backward compatibility

…all datasets

xiuyuz · 2024-09-19T21:03:24Z

The unsuccessful checks seem to be caused by a connection issue with Huggingface.

fix the bug that tokenize_and_concatenate function not working for sm…

d1d7a3d

…all datasets

bryce13950 merged commit 336df99 into TransformerLensOrg:main Oct 15, 2024
12 checks passed

bryce13950 mentioned this pull request Oct 16, 2024

Upstream update #755

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the bug that tokenize_and_concatenate function not working for small dataset #725

Fix the bug that tokenize_and_concatenate function not working for small dataset #725

xiuyuz commented Sep 19, 2024

xiuyuz commented Sep 19, 2024

Fix the bug that tokenize_and_concatenate function not working for small dataset #725

Fix the bug that tokenize_and_concatenate function not working for small dataset #725

Conversation

xiuyuz commented Sep 19, 2024

Description

Type of change

Screenshots

Checklist:

xiuyuz commented Sep 19, 2024