Questions about XSUM #672

danyaljj · 2020-09-26T17:16:24Z

Hi there ✋

I'm looking into your xsum dataset and I have several questions on that.
So here is how I loaded the data:

>>> data = datasets.load_dataset('xsum', version='1.0.1')
>>> data['train']
Dataset(features: {'document': Value(dtype='string', id=None), 'summary': Value(dtype='string', id=None)}, num_rows: 204017)
>>> data['test']
Dataset(features: {'document': Value(dtype='string', id=None), 'summary': Value(dtype='string', id=None)}, num_rows: 11333)

The first issue is, the instance counts don’t match what I see on the dataset's website (11,333 vs 11,334 for test set; 204,017 vs 204,045 for training set)

 … training (90%, 204,045), validation (5%, 11,332), and test (5%, 11,334) set.

Any thoughts why? Perhaps @mariamabarham could help here, since she recently had a PR on this dataaset #289 (reviewed by @patrickvonplaten)

Another issue is that the instances don't seem to have IDs. The original datasets provides IDs for the instances: /~https://github.com/EdinburghNLP/XSum/blob/master/XSum-Dataset/XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json but to be able to use them, the dataset sizes need to match.

CC @jbragg

The text was updated successfully, but these errors were encountered:

lhoestq · 2020-09-28T08:12:01Z

We should try to regenerate the data using the official script.
But iirc that's what we used in the first place, so not sure why it didn't match in the first place.

I'll let you know when the dataset is updated

danyaljj · 2020-09-28T20:24:30Z

Thanks, looking forward to hearing your update on this thread.

This is a blocking issue for us; would appreciate any progress on this front. We can also help with the fix, if you deem it appropriately.

lhoestq · 2020-09-29T09:03:50Z

I just started the generation on my side, I'll let you know how it goes :)

lhoestq · 2020-09-29T16:47:10Z

Hmm after a first run I'm still missing 136668/226711 urls.
I'll relaunch it tomorrow to try to get the remaining ones.

lhoestq · 2020-10-01T16:58:43Z

Update: I'm missing 36/226711 urls but I haven't managed to download them yet

danyaljj · 2020-10-01T17:01:35Z

Thanks! That sounds like a reasonable number!

lhoestq · 2020-10-01T21:32:22Z

So I managed to download them all but when parsing only 226,181/226,711 worked.
Not sure if it's worth digging and debugging parsing at this point :/

lhoestq · 2020-10-01T21:33:38Z

Maybe @sshleifer can help, I think he's already played with xsum at one point

jbragg · 2020-10-01T21:41:37Z

Thanks @lhoestq
It would be great to improve coverage, but IDs are the really crucial part for us. We'd really appreciate an update to the dataset with IDs either way!

sshleifer · 2020-10-01T23:45:51Z

I gave up at an even earlier point. The dataset I use has 204,017 train examples.

danyaljj · 2020-10-16T01:35:36Z

@lhoestq @sshleifer like @jbragg said earlier, the main issue for us is that the current XSUM dataset (in your package) does not have IDs suggested by the original dataset (here is the file.) Would appreciate if you update the XSUM dataset to include the instance IDs.

The missing instances is also a problem, but likely not worth pursuing given its relatively small scale.

jbragg · 2020-10-20T05:31:37Z

So I managed to download them all but when parsing only 226,181/226,711 worked.

@lhoestq any chance we could update the HF-hosted dataset with the IDs in your new version? Happy to help if there's something I can do.

lhoestq · 2020-10-20T09:16:07Z

Well I couldn't parse what I downloaded.
Unfortunately I think I won't be able to take a look at it this week.
I can try to send you what I got if you want to give it a shot @jbragg
Otherwise feel free to re-run the xsum download script, maybe you'll be luckier than me

mariosasko · 2022-10-04T17:30:17Z

Resolved via #754

jbragg mentioned this issue Oct 23, 2020

Use full released xsum dataset #754

Merged

sshleifer mentioned this issue Feb 7, 2021

[s2s examples] dataset porting huggingface/transformers#10044

Closed

mariosasko closed this as completed Oct 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about XSUM #672

Questions about XSUM #672

danyaljj commented Sep 26, 2020 •

edited

Loading

lhoestq commented Sep 28, 2020

danyaljj commented Sep 28, 2020

lhoestq commented Sep 29, 2020

lhoestq commented Sep 29, 2020

lhoestq commented Oct 1, 2020

danyaljj commented Oct 1, 2020

lhoestq commented Oct 1, 2020 •

edited by sshleifer

Loading

lhoestq commented Oct 1, 2020

jbragg commented Oct 1, 2020

sshleifer commented Oct 1, 2020 •

edited

Loading

danyaljj commented Oct 16, 2020

jbragg commented Oct 20, 2020 •

edited

Loading

lhoestq commented Oct 20, 2020

mariosasko commented Oct 4, 2022

Questions about XSUM #672

Questions about XSUM #672

Comments

danyaljj commented Sep 26, 2020 • edited Loading

lhoestq commented Sep 28, 2020

danyaljj commented Sep 28, 2020

lhoestq commented Sep 29, 2020

lhoestq commented Sep 29, 2020

lhoestq commented Oct 1, 2020

danyaljj commented Oct 1, 2020

lhoestq commented Oct 1, 2020 • edited by sshleifer Loading

lhoestq commented Oct 1, 2020

jbragg commented Oct 1, 2020

sshleifer commented Oct 1, 2020 • edited Loading

danyaljj commented Oct 16, 2020

jbragg commented Oct 20, 2020 • edited Loading

lhoestq commented Oct 20, 2020

mariosasko commented Oct 4, 2022

danyaljj commented Sep 26, 2020 •

edited

Loading

lhoestq commented Oct 1, 2020 •

edited by sshleifer

Loading

sshleifer commented Oct 1, 2020 •

edited

Loading

jbragg commented Oct 20, 2020 •

edited

Loading