Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid using skip() in hf_datasets #835

Closed
wants to merge 2 commits into from
Closed

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Feb 12, 2025

Stack from ghstack (oldest at bottom):

We found out that skip() may not behave as expected. This is a workaround solution while we are investiging the root cause.

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 12, 2025
@fegin fegin requested a review from mori360 February 12, 2025 00:13
[ghstack-poisoned]
@fegin fegin marked this pull request as draft February 12, 2025 00:23
if isinstance(self._data, Dataset) and self._sample_idx == len(self._data):
return iter([])

return iter(self._data.skip(self._sample_idx))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two questions:

  1. The wrong behavior is not captured by unit test /~https://github.com/pytorch/torchtitan/blob/main/tests/unit_tests/test_dataset_checkpointing.py. Is it because the error only happens in distributed environment, on some rank but not others?
  2. Does skip fail for both map-style datasets (c4_test) and IterableDatasets (c4)? Asking because from my PR uniformly use skip for both (map-style) Dataset and IterableDataset #521 I recall skip for IterableDataset is a "new" feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on testing loss issue(whether still happens) and whether difference between datasets c4/c4_test right now.
unit_test would be the next PR after fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It only happens in distributed environment and on certain ranks only.
  2. I could not reproduce this with c4_test.

I don't know the detail of the root cause though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not land this PR but I verified all ranks output the expected sample even after resuming from checkpoints, @mori360 would work on loss parity with unittest verifcation.

@fegin fegin closed this Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants