Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keys should be unique error on code_search_net #2552

Closed
thomwolf opened this issue Jun 28, 2021 · 8 comments · Fixed by #2555
Closed

Keys should be unique error on code_search_net #2552

thomwolf opened this issue Jun 28, 2021 · 8 comments · Fixed by #2555
Labels
bug Something isn't working

Comments

@thomwolf
Copy link
Member

Describe the bug

Loading code_search_net seems not possible at the moment.

Steps to reproduce the bug

>>> load_dataset('code_search_net')
Downloading: 8.50kB [00:00, 3.09MB/s]                                                                                                                                           
Downloading: 19.1kB [00:00, 10.1MB/s]                                                                                                                                           
No config specified, defaulting to: code_search_net/all
Downloading and preparing dataset code_search_net/all (download: 4.77 GiB, generated: 5.99 GiB, post-processed: Unknown size, total: 10.76 GiB) to /Users/thomwolf/.cache/huggingface/datasets/code_search_net/all/1.0.0/b3e8278faf5d67da1d06981efbeac3b76a2900693bd2239bbca7a4a3b0d6e52a...
Traceback (most recent call last):         
  File "/Users/thomwolf/Documents/GitHub/datasets/src/datasets/builder.py", line 1067, in _prepare_split
    writer.write(example, key)
  File "/Users/thomwolf/Documents/GitHub/datasets/src/datasets/arrow_writer.py", line 343, in write
    self.check_duplicate_keys()
  File "/Users/thomwolf/Documents/GitHub/datasets/src/datasets/arrow_writer.py", line 354, in check_duplicate_keys
    raise DuplicatedKeysError(key)
datasets.keyhash.DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: 48
Keys should be unique and deterministic in nature

Environment info

  • datasets version: 1.8.1.dev0
  • Platform: macOS-10.15.7-x86_64-i386-64bit
  • Python version: 3.8.5
  • PyArrow version: 2.0.0
@thomwolf thomwolf added the bug Something isn't working label Jun 28, 2021
@thomwolf
Copy link
Member Author

Two questions:

  • with datasets-cli env we don't have any information on the dataset script version used. Should we give access to this somehow? Either as a note in the Error message or as an argument with the name of the dataset to datasets-cli env?
  • I don't really understand why the id is duplicated in the code of code_search_net, how can I debug this actually?

@lhoestq
Copy link
Member

lhoestq commented Jun 28, 2021

Thanks for reporting. There was indeed an issue with the keys. The key was the addition of the file id and row id, which resulted in collisions. I just opened a PR to fix this at #2555

To help users debug this kind of errors we could try to show a message like this

DuplicateKeysError: both 42th and 1337th examples have the same keys `48`.
Please fix the dataset script at <path/to/the/dataset/script>

This way users who what to look for if they want to debug this issue. I opened an issue to track this: #2556

@thomwolf
Copy link
Member Author

and are we sure there are not a lot of datasets which are now broken with this change?

@lhoestq
Copy link
Member

lhoestq commented Jun 28, 2021

Thanks to the dummy data, we know for sure that most of them work as expected.
code_search_net wasn't caught because the dummy data only have one dummy data file while the dataset script can actually load several of them using os.listdir. Let me take a look at all the other datasets that use os.listdir to see if the keys are alright

@lhoestq
Copy link
Member

lhoestq commented Jun 28, 2021

I found one issue on fever (PR here: #2557)
All the other ones seem fine :)

@SolomidHero
Copy link

Hi! Got same error when loading other dataset:

load_dataset('wikicorpus', 'raw_en')

tb:

---------------------------------------------------------------------------
DuplicatedKeysError                       Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/datasets/builder.py in _prepare_split(self, split_generator)
   1109                     example = self.info.features.encode_example(record)
-> 1110                     writer.write(example, key)
   1111             finally:

/opt/conda/lib/python3.8/site-packages/datasets/arrow_writer.py in write(self, example, key, writer_batch_size)
    341             if self._check_duplicates:
--> 342                 self.check_duplicate_keys()
    343                 # Re-intializing to empty list for next batch

/opt/conda/lib/python3.8/site-packages/datasets/arrow_writer.py in check_duplicate_keys(self)
    352             if hash in tmp_record:
--> 353                 raise DuplicatedKeysError(key)
    354             else:

DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: 519
Keys should be unique and deterministic in nature

Version: datasets==1.11.0

@albertvillanova
Copy link
Member

Fixed by #2555.

@lhoestq
Copy link
Member

lhoestq commented Sep 6, 2021

The wikicorpus issue has been fixed by #2844

We'll do a new release of datasets soon :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants