-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about XSUM #672
Comments
We should try to regenerate the data using the official script. I'll let you know when the dataset is updated |
Thanks, looking forward to hearing your update on this thread. This is a blocking issue for us; would appreciate any progress on this front. We can also help with the fix, if you deem it appropriately. |
I just started the generation on my side, I'll let you know how it goes :) |
Hmm after a first run I'm still missing 136668/226711 urls. |
Update: I'm missing 36/226711 urls but I haven't managed to download them yet |
Thanks! That sounds like a reasonable number! |
So I managed to download them all but when parsing only 226,181/226,711 worked. |
Maybe @sshleifer can help, I think he's already played with xsum at one point |
Thanks @lhoestq |
I gave up at an even earlier point. The dataset I use has 204,017 train examples. |
@lhoestq @sshleifer like @jbragg said earlier, the main issue for us is that the current XSUM dataset (in your package) does not have IDs suggested by the original dataset (here is the file.) Would appreciate if you update the XSUM dataset to include the instance IDs. The missing instances is also a problem, but likely not worth pursuing given its relatively small scale. |
@lhoestq any chance we could update the HF-hosted dataset with the IDs in your new version? Happy to help if there's something I can do. |
Well I couldn't parse what I downloaded. |
Resolved via #754 |
Hi there ✋
I'm looking into your
xsum
dataset and I have several questions on that.So here is how I loaded the data:
The first issue is, the instance counts don’t match what I see on the dataset's website (11,333 vs 11,334 for test set; 204,017 vs 204,045 for training set)
Any thoughts why? Perhaps @mariamabarham could help here, since she recently had a PR on this dataaset #289 (reviewed by @patrickvonplaten)
Another issue is that the instances don't seem to have IDs. The original datasets provides IDs for the instances: /~https://github.com/EdinburghNLP/XSum/blob/master/XSum-Dataset/XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json but to be able to use them, the dataset sizes need to match.
CC @jbragg
The text was updated successfully, but these errors were encountered: