[speedup] Use indices mappings instead of deepcopy for all the samples reordering methods #513

thomwolf · 2020-08-18T17:36:02Z

Use an indices mapping instead of rewriting the dataset for all the samples re-ordering/selection methods (select, sort, shuffle, shard, train_test_split).

Added a flatten_indices method which copy the dataset to a new table to remove the indices mapping with tests.

All the samples re-ordering/selection methods should be a lot faster. The downside is that iterating on very large batch of the dataset might be a little slower when we have changed the order of the samples since with in these case we use pyarrow.Table.take instead of pyarrow.Table.slice. There is no free lunch but the speed of iterating over the dataset is rarely the bottleneck.

Backward breaking change: the cache_file_name argument in all the samples re-ordering/selection methods (select, sort, shuffle, shard, train_test_split) is now called indices_cache_file_name on purpose to make it explicit to the user that this caching file is used for caching the indices mapping and not the dataset itself.

thomwolf · 2020-08-18T17:39:10Z

src/nlp/arrow_dataset.py

            writer_batch_size=writer_batch_size,
            verbose=verbose,
        )

-    def export(


Moved this method without modification to keep all the samples re-ordering/selection methods (select, sort, shuffle, shard, train_test_split) in the same part of the file. Sorry for that.

thomwolf · 2020-08-18T17:39:49Z

src/nlp/arrow_dataset.py

@@ -1419,8 +1482,8 @@ def train_test_split(
        generator: Optional[np.random.Generator] = None,
        keep_in_memory: bool = False,
        load_from_cache_file: bool = True,
-        train_cache_file_name: Optional[str] = None,
-        test_cache_file_name: Optional[str] = None,
+        train_indices_cache_file_name: Optional[str] = None,


This is a bit long but I think it's important that the user does not mistake this cache for the dataset table cache.

lhoestq

Very cool !
Things will be so much faster :)

a few comments:

src/nlp/arrow_dataset.py

lhoestq · 2020-08-19T09:07:34Z

src/nlp/arrow_dataset.py

@@ -998,7 +1090,7 @@ def apply_function_on_filtered_inputs(inputs, indices, check_same_num_examples=F

    def filter(


Shall we use the indices mapping for filter too ?

lhoestq · 2020-08-19T09:10:22Z

src/nlp/arrow_dataset.py

        cache_file_name: Optional[str] = None,
        writer_batch_size: Optional[int] = 1000,
-        reader_batch_size: Optional[int] = 1000,
+        features: Optional[Features] = None,


Not sure why we have features here ?
I agree that it can be done for free but I didn't expect to see that here.

Free lunch :-)

src/nlp/arrow_dataset.py

lhoestq · 2020-08-19T09:38:16Z

src/nlp/arrow_dataset.py

@@ -1549,28 +1612,34 @@ def train_test_split(
                    "seed": seed,
                    "keep_in_memory": keep_in_memory,
                    "load_from_cache_file": load_from_cache_file,
-                    "train_cache_file_name": train_cache_file_name,
-                    "test_cache_file_name": test_cache_file_name,


As in shuffle you probably need to add "length": len(self) here

I'm not sure about that because I feel like it's handled by the hashes on the indices and data files.
I add some tests on this.

Yes you're right, it's taken into account in previous_files_string

lhoestq · 2020-08-19T09:38:58Z

src/nlp/arrow_dataset.py

@@ -1589,16 +1658,14 @@ def train_test_split(
        train_split = self.select(
            indices=train_indices,
            keep_in_memory=keep_in_memory,
-            load_from_cache_file=load_from_cache_file,


Maybe keep load_from_cache_file here as well

lhoestq · 2020-08-19T09:40:17Z

src/nlp/arrow_dataset.py

@@ -1611,8 +1678,7 @@ def shard(
        index: int,
        contiguous: bool = False,
        keep_in_memory: bool = False,
-        load_from_cache_file: bool = True,


we can keep load_from_cache_file here and also pass it to .select()

tests/test_arrow_dataset.py

lhoestq · 2020-08-27T15:23:28Z

Ok I fixed concatenate_datasets and added tests
Feel free to merge if it's good for you @thomwolf

thomwolf · 2020-08-27T20:22:27Z

Ok, adding some benchmarks for map/filters and then I'll merge

thomwolf · 2020-08-27T22:33:57Z

Warning from pytorch that we should maybe consider at some point @lhoestq:

/__w/nlp/nlp/src/nlp/arrow_dataset.py:648: UserWarning: The given NumPy array is not writeable,
and PyTorch does not support non-writeable tensors. This means you can write to the underlying
(supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to
protect its data or make it writeable before converting it to a tensor. This type of warning will be
suppressed for the rest of this program.
(Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
532
  return torch.tensor(x, **format_kwargs)

…ices

lhoestq · 2020-08-28T07:54:51Z

src/nlp/arrow_dataset.py

@@ -717,7 +717,7 @@ def _map_indices(self, indices: Union[int, slice, pa.Array, Iterable]):

        # We can do a slice
        if array_indices is None:
-            return self._indices.column(0).slice(array_indices[0], array_indices[1] - array_indices[0])
+            return self._indices.column(0).slice(slice_indices[0], slice_indices[1] - slice_indices[0])


good catch !

lhoestq · 2020-08-28T07:58:07Z

Warning from pytorch that we should maybe consider at some point @lhoestq:

/__w/nlp/nlp/src/nlp/arrow_dataset.py:648: UserWarning: The given NumPy array is not writeable,
and PyTorch does not support non-writeable tensors. This means you can write to the underlying
(supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to
protect its data or make it writeable before converting it to a tensor. This type of warning will be
suppressed for the rest of this program.
(Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
532
  return torch.tensor(x, **format_kwargs)

Not sure why we have that, it's probably linked to zero copy from arrow to numpy

using indices

0afab7c

thomwolf commented Aug 18, 2020

View reviewed changes

thomwolf changed the title ~~[speedup] Use a indices mapping for all the samples reordering methods~~ [speedup] Use indices mappings instead of deepcopy for all the samples reordering methods Aug 18, 2020

thomwolf requested a review from lhoestq August 18, 2020 17:40

thomwolf mentioned this pull request Aug 18, 2020

dataset.shuffle() and select() resets format. Intended? #511

Closed

vegarab mentioned this pull request Aug 18, 2020

dataset.shuffle(keep_in_memory=True) is never allowed #514

Closed

lhoestq reviewed Aug 19, 2020

View reviewed changes

thomwolf added 6 commits August 19, 2020 13:11

upgrade numpy reqs - fix #510

52bd0ad

CI with and without apache beam

03e64a7

update CI config

2ffada5

fix code quality

c2b2fcb

add back beam

0705dac

updates following QL's comments

66c2eb4

lhoestq mentioned this pull request Aug 20, 2020

Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets #363

Merged

thomwolf added 5 commits August 25, 2020 12:15

Merge branch 'master' into indices

851ef25

fix tests and style

b7a043e

fix tests

ad5a91d

adding tests and fixing tests

922dd8c

fix tests for pyarrow 0.16

8a52c03

lhoestq mentioned this pull request Aug 25, 2020

Fix ArrayXD for pyarrow 0.17.1 by using non fixed length list arrays #533

Merged

thomwolf and others added 9 commits August 26, 2020 10:09

Merge branch 'master' into indices

6f6704b

update CI

f74478c

CI job

4029f1d

Merge branch 'master' into indices

f728b07

add indices_data_files attribute

ebeb481

fix caching

0c756e2

new black

c751344

Merge branch 'master' into indices

3a830c3

style

f8a999b

thomwolf and others added 2 commits August 27, 2020 12:14

try to fix concurrent pytest and metrics interactions

6a0e72a

fix concatenate_datasets + add tests

15bf931

lhoestq mentioned this pull request Aug 27, 2020

Fingerprint #536

Merged

update benchmark format and add benchmark for map/filter

3d9c39c

thomwolf added 6 commits August 27, 2020 22:32

add transformers to the tests

fa3a371

clean up format and metrics tests

d646a74

adding iterating bechmark

4ebe3c6

fix slice bug

b174ae6

style

523cf0b

faster benchmarks installs

0dcfd6f

thomwolf added 6 commits August 28, 2020 00:38

fix metrics flaky test?

2b45b5f

fixing metrics tests

a729dd4

Merge branch 'master' into indices

a31ff36

testing dual bench pyarrow

0ecb846

Merge branch 'indices' of /~https://github.com/huggingface/nlp into ind…

408afa5

…ices

fix

7b0f7da

lhoestq reviewed Aug 28, 2020

View reviewed changes

thomwolf added 2 commits August 28, 2020 10:05

update report

11add00

tweak report

a238868

thomwolf merged commit 6655008 into master Aug 28, 2020

thomwolf deleted the indices branch August 28, 2020 08:41

lhoestq mentioned this pull request Sep 7, 2020

Faster Shuffling? #406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[speedup] Use indices mappings instead of deepcopy for all the samples reordering methods #513

[speedup] Use indices mappings instead of deepcopy for all the samples reordering methods #513

thomwolf commented Aug 18, 2020 •

edited

Loading

thomwolf Aug 18, 2020

thomwolf Aug 18, 2020

lhoestq left a comment

lhoestq Aug 19, 2020

lhoestq Aug 19, 2020

thomwolf Aug 19, 2020

lhoestq Aug 19, 2020

thomwolf Aug 19, 2020

lhoestq Aug 19, 2020

lhoestq Aug 19, 2020

lhoestq Aug 19, 2020

lhoestq commented Aug 27, 2020

thomwolf commented Aug 27, 2020

thomwolf commented Aug 27, 2020

lhoestq Aug 28, 2020

lhoestq commented Aug 28, 2020 •

edited

Loading

		@@ -998,7 +1090,7 @@ def apply_function_on_filtered_inputs(inputs, indices, check_same_num_examples=F

		def filter(

[speedup] Use indices mappings instead of deepcopy for all the samples reordering methods #513

[speedup] Use indices mappings instead of deepcopy for all the samples reordering methods #513

Conversation

thomwolf commented Aug 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq commented Aug 27, 2020

thomwolf commented Aug 27, 2020

thomwolf commented Aug 27, 2020

Choose a reason for hiding this comment

lhoestq commented Aug 28, 2020 • edited Loading

thomwolf commented Aug 18, 2020 •

edited

Loading

lhoestq commented Aug 28, 2020 •

edited

Loading