Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the bug that tokenize_and_concatenate function not working for small dataset #725

Merged

Conversation

xiuyuz
Copy link
Contributor

@xiuyuz xiuyuz commented Sep 19, 2024

Description

As stated in the docstring of the function ("Note: There is a bug when inputting very small datasets (eg, <1 batch per process) where it just outputs nothing. I'm not super sure why"), tokenize_and_concatenate currently cannot handle small datasets. This is because the tokenize_function inside tokenize_and_concatenate isn't producing any tokens when dealing with small datasets, i.e., the tokenization process doesn't produce any output, leading to the absence of the 'tokens' column.

Issue with small datasets:

  • When the dataset is small, full_text may not be long enough to produce a significant number of tokens.
  • num_batches can become zero if num_tokens is less than seq_len.
  • When num_batches is zero, slicing and rearranging operations result in empty arrays.

Fixes # (issue)

  • Check if num_tokens is greater than or equal to seq_len before proceeding with rearrangement.
  • If not, set num_batches to one and handle the tokens differently to ensure they're still returned.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Screenshots

Please attach before and after screenshots of the change if applicable.

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

@xiuyuz
Copy link
Contributor Author

xiuyuz commented Sep 19, 2024

The unsuccessful checks seem to be caused by a connection issue with Huggingface.

@bryce13950 bryce13950 merged commit 336df99 into TransformerLensOrg:main Oct 15, 2024
12 checks passed
@bryce13950 bryce13950 mentioned this pull request Oct 16, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants