Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about XSUM #672

Closed
danyaljj opened this issue Sep 26, 2020 · 14 comments
Closed

Questions about XSUM #672

danyaljj opened this issue Sep 26, 2020 · 14 comments

Comments

@danyaljj
Copy link
Contributor

danyaljj commented Sep 26, 2020

Hi there ✋

I'm looking into your xsum dataset and I have several questions on that.
So here is how I loaded the data:

>>> data = datasets.load_dataset('xsum', version='1.0.1')
>>> data['train']
Dataset(features: {'document': Value(dtype='string', id=None), 'summary': Value(dtype='string', id=None)}, num_rows: 204017)
>>> data['test']
Dataset(features: {'document': Value(dtype='string', id=None), 'summary': Value(dtype='string', id=None)}, num_rows: 11333)

The first issue is, the instance counts don’t match what I see on the dataset's website (11,333 vs 11,334 for test set; 204,017 vs 204,045 for training set)

 … training (90%, 204,045), validation (5%, 11,332), and test (5%, 11,334) set.

Any thoughts why? Perhaps @mariamabarham could help here, since she recently had a PR on this dataaset #289 (reviewed by @patrickvonplaten)

Another issue is that the instances don't seem to have IDs. The original datasets provides IDs for the instances: /~https://github.com/EdinburghNLP/XSum/blob/master/XSum-Dataset/XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json but to be able to use them, the dataset sizes need to match.

CC @jbragg

@lhoestq
Copy link
Member

lhoestq commented Sep 28, 2020

We should try to regenerate the data using the official script.
But iirc that's what we used in the first place, so not sure why it didn't match in the first place.

I'll let you know when the dataset is updated

@danyaljj
Copy link
Contributor Author

Thanks, looking forward to hearing your update on this thread.

This is a blocking issue for us; would appreciate any progress on this front. We can also help with the fix, if you deem it appropriately.

@lhoestq
Copy link
Member

lhoestq commented Sep 29, 2020

I just started the generation on my side, I'll let you know how it goes :)

@lhoestq
Copy link
Member

lhoestq commented Sep 29, 2020

Hmm after a first run I'm still missing 136668/226711 urls.
I'll relaunch it tomorrow to try to get the remaining ones.

@lhoestq
Copy link
Member

lhoestq commented Oct 1, 2020

Update: I'm missing 36/226711 urls but I haven't managed to download them yet

@danyaljj
Copy link
Contributor Author

danyaljj commented Oct 1, 2020

Thanks! That sounds like a reasonable number!

@lhoestq
Copy link
Member

lhoestq commented Oct 1, 2020

So I managed to download them all but when parsing only 226,181/226,711 worked.
Not sure if it's worth digging and debugging parsing at this point :/

@lhoestq
Copy link
Member

lhoestq commented Oct 1, 2020

Maybe @sshleifer can help, I think he's already played with xsum at one point

@jbragg
Copy link
Contributor

jbragg commented Oct 1, 2020

Thanks @lhoestq
It would be great to improve coverage, but IDs are the really crucial part for us. We'd really appreciate an update to the dataset with IDs either way!

@sshleifer
Copy link
Contributor

sshleifer commented Oct 1, 2020

I gave up at an even earlier point. The dataset I use has 204,017 train examples.

@danyaljj
Copy link
Contributor Author

@lhoestq @sshleifer like @jbragg said earlier, the main issue for us is that the current XSUM dataset (in your package) does not have IDs suggested by the original dataset (here is the file.) Would appreciate if you update the XSUM dataset to include the instance IDs.

The missing instances is also a problem, but likely not worth pursuing given its relatively small scale.

@jbragg
Copy link
Contributor

jbragg commented Oct 20, 2020

So I managed to download them all but when parsing only 226,181/226,711 worked.

@lhoestq any chance we could update the HF-hosted dataset with the IDs in your new version? Happy to help if there's something I can do.

@lhoestq
Copy link
Member

lhoestq commented Oct 20, 2020

Well I couldn't parse what I downloaded.
Unfortunately I think I won't be able to take a look at it this week.
I can try to send you what I got if you want to give it a shot @jbragg
Otherwise feel free to re-run the xsum download script, maybe you'll be luckier than me

@mariosasko
Copy link
Collaborator

Resolved via #754

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants