You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 4, 2023. It is now read-only.
Right now, parameter sample of StaticTokenizerEncoder must be a list (explicit check).
It forces the user to pre-load the whole dataset in memory, which is not desirable for very large datasets.
Expected Behavior
It would be great if StaticTokenizerEncoder (and all child classes) could take any iterable for sample (not necessarily a list).
Therefore, sample could be for instance an iterator : the encoder would go once through the whole dataset to compute token counts, which could then be saved (e.g. pickled) for later use.
And token counts are typically much smaller than the dataset itself.
Steps to Reproduce the Problem
This raises a TypeError: Sample must be a list.
from torchnlp.encoders.text import WhitespaceEncoder
iterable = (x for x in ['hello world', 'PyTorch NLP'])
encoder = WhiteSpaceEncoder(iterable)
Proposal
This virtually just implies removing the explicit check (if not isinstance(sample, list) at torchnlp.encoders.text.StaticTokenizerEncoder:67).
I tried, and tests pass just fine. I can make a PR with this if you think this is a good idea.
The text was updated successfully, but these errors were encountered:
Actual Behavior
Right now, parameter
sample
ofStaticTokenizerEncoder
must be a list (explicit check).It forces the user to pre-load the whole dataset in memory, which is not desirable for very large datasets.
Expected Behavior
It would be great if
StaticTokenizerEncoder
(and all child classes) could take any iterable forsample
(not necessarily a list).Therefore,
sample
could be for instance an iterator : the encoder would go once through the whole dataset to compute token counts, which could then be saved (e.g. pickled) for later use.And token counts are typically much smaller than the dataset itself.
Steps to Reproduce the Problem
This raises a
TypeError: Sample must be a list.
Proposal
This virtually just implies removing the explicit check (
if not isinstance(sample, list)
attorchnlp.encoders.text.StaticTokenizerEncoder:67
).I tried, and tests pass just fine. I can make a PR with this if you think this is a good idea.
The text was updated successfully, but these errors were encountered: