Fix the bug that tokenize_and_concatenate function not working for small dataset #725
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
As stated in the docstring of the function ("Note: There is a bug when inputting very small datasets (eg, <1 batch per process) where it just outputs nothing. I'm not super sure why"),
tokenize_and_concatenate
currently cannot handle small datasets. This is because thetokenize_function
insidetokenize_and_concatenate
isn't producing any tokens when dealing with small datasets, i.e., the tokenization process doesn't produce any output, leading to the absence of the 'tokens' column.Issue with small datasets:
full_text
may not be long enough to produce a significant number of tokens.num_batches
can become zero ifnum_tokens
is less thanseq_len
.num_batches
is zero, slicing and rearranging operations result in empty arrays.Fixes # (issue)
num_tokens
is greater than or equal toseq_len
before proceeding with rearrangement.num_batches
to one and handle the tokens differently to ensure they're still returned.Type of change
Please delete options that are not relevant.
Screenshots
Please attach before and after screenshots of the change if applicable.
Checklist: