Allow StaticTokenizerEncoder to take any iterable #85

tbelhalfaoui · 2019-11-03T15:37:39Z

Actual Behavior

Right now, parameter sample of StaticTokenizerEncoder must be a list (explicit check).

It forces the user to pre-load the whole dataset in memory, which is not desirable for very large datasets.

Expected Behavior

It would be great if StaticTokenizerEncoder (and all child classes) could take any iterable for sample (not necessarily a list).

Therefore, sample could be for instance an iterator : the encoder would go once through the whole dataset to compute token counts, which could then be saved (e.g. pickled) for later use.
And token counts are typically much smaller than the dataset itself.

Steps to Reproduce the Problem

This raises a TypeError: Sample must be a list.

from torchnlp.encoders.text import WhitespaceEncoder
iterable = (x for x in ['hello world', 'PyTorch NLP'])
encoder = WhiteSpaceEncoder(iterable)

Proposal

This virtually just implies removing the explicit check (if not isinstance(sample, list) at torchnlp.encoders.text.StaticTokenizerEncoder:67).
I tried, and tests pass just fine. I can make a PR with this if you think this is a good idea.

The text was updated successfully, but these errors were encountered:

PetrochukM · 2019-11-04T03:44:45Z

Hi There! I fixed this in #84. Thanks!

tbelhalfaoui · 2019-11-04T09:56:24Z

Wow, that was quick. Thanks!

PetrochukM closed this as completed Nov 4, 2019

PetrochukM mentioned this issue Nov 4, 2019

PyTorch-NLP 0.5.0 #84

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow StaticTokenizerEncoder to take any iterable #85

Allow StaticTokenizerEncoder to take any iterable #85

tbelhalfaoui commented Nov 3, 2019

PetrochukM commented Nov 4, 2019

tbelhalfaoui commented Nov 4, 2019

Allow StaticTokenizerEncoder to take any iterable #85

Allow StaticTokenizerEncoder to take any iterable #85

Comments

tbelhalfaoui commented Nov 3, 2019

Actual Behavior

Expected Behavior

Steps to Reproduce the Problem

Proposal

PetrochukM commented Nov 4, 2019

tbelhalfaoui commented Nov 4, 2019