-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keys should be unique error on code_search_net #2552
Comments
Two questions:
|
Thanks for reporting. There was indeed an issue with the keys. The key was the addition of the file id and row id, which resulted in collisions. I just opened a PR to fix this at #2555 To help users debug this kind of errors we could try to show a message like this DuplicateKeysError: both 42th and 1337th examples have the same keys `48`.
Please fix the dataset script at <path/to/the/dataset/script> This way users who what to look for if they want to debug this issue. I opened an issue to track this: #2556 |
and are we sure there are not a lot of datasets which are now broken with this change? |
Thanks to the dummy data, we know for sure that most of them work as expected. |
I found one issue on |
Hi! Got same error when loading other dataset: load_dataset('wikicorpus', 'raw_en') tb: ---------------------------------------------------------------------------
DuplicatedKeysError Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/datasets/builder.py in _prepare_split(self, split_generator)
1109 example = self.info.features.encode_example(record)
-> 1110 writer.write(example, key)
1111 finally:
/opt/conda/lib/python3.8/site-packages/datasets/arrow_writer.py in write(self, example, key, writer_batch_size)
341 if self._check_duplicates:
--> 342 self.check_duplicate_keys()
343 # Re-intializing to empty list for next batch
/opt/conda/lib/python3.8/site-packages/datasets/arrow_writer.py in check_duplicate_keys(self)
352 if hash in tmp_record:
--> 353 raise DuplicatedKeysError(key)
354 else:
DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: 519
Keys should be unique and deterministic in nature Version: datasets==1.11.0 |
Fixed by #2555. |
The wikicorpus issue has been fixed by #2844 We'll do a new release of |
Describe the bug
Loading
code_search_net
seems not possible at the moment.Steps to reproduce the bug
Environment info
datasets
version: 1.8.1.dev0The text was updated successfully, but these errors were encountered: