-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding to_tf_dataset method #2731
Merged
Merged
Changes from all commits
Commits
Show all changes
46 commits
Select commit
Hold shift + click to select a range
92cad15
Rebase onto master
Rocketknight1 74b5bad
Support multiple label_cols, replaced tokenizer with collate_fn, supp…
Rocketknight1 97917bc
Standardize int and float dtypes to keep TF happy
Rocketknight1 4eb79f5
Add a prefetch buffer for improved performance
Rocketknight1 bed394a
TF dataset is actually kinda performant now!
Rocketknight1 ea525a2
TF dataset is actually kinda performant now!
Rocketknight1 d3a8140
Style pass
Rocketknight1 3ce6dc4
Helpful error message if my code gets caught off-guard by unexpected …
Rocketknight1 67c0657
Style pass
Rocketknight1 2963f0a
Added drop_remainder argument, removed pad_to
Rocketknight1 7f11d76
Correct shape signatures when we're not dropping the remainder
Rocketknight1 bbf6197
Style pass
Rocketknight1 f902bde
Support ClassLabel columns too!
Rocketknight1 990f150
Re-enable `tf.ragged` by avoiding `tf.ragged.constant` unless absolut…
Rocketknight1 fa06206
Style pass
Rocketknight1 29415cd
Adding a comment to explain myself in tf_formatter.py
Rocketknight1 ca93c34
Fixes for shuffling and the case where the collator adds new columns
Rocketknight1 d78cd50
Style pass
Rocketknight1 0bf0050
Ensuring we respect TF dtype args
Rocketknight1 6c91fc7
Style pass
Rocketknight1 1954862
Updating tests
Rocketknight1 7f2a8f1
Updating tests
Rocketknight1 6eef188
Fixing things so they work in TF2.6
Rocketknight1 a63dfb9
Style pass
Rocketknight1 d7048a4
Correctly set output shapes - fixes a whole lot of issues
Rocketknight1 56ea08f
Fix an embarrassing regression bug
Rocketknight1 2ddf7c6
Style pass
Rocketknight1 ddfda69
Added `config.TF_AVAILABLE` checks and dict literals
Rocketknight1 c87d47e
Handling for special cases around label/labels and very nested dtypes
Rocketknight1 e7d1ce8
Fix for accidentally shuffling even when flag was False
Rocketknight1 48045fb
Adding dummy labels by default
Rocketknight1 ec4f7d4
Adding docstrings and type hints
Rocketknight1 88e9f1e
Style pass
Rocketknight1 a7b4574
Add tests, bugfix to handling scalar columns
Rocketknight1 b35267d
Style pass
Rocketknight1 6273d73
Fix to `numpy_pad`
Rocketknight1 4ff6d2e
Replace assertion with more robust syntax
Rocketknight1 589c575
Add cleanup deletion of tf_dataset in tests
Rocketknight1 d70fe94
Rebasing onto Master
Rocketknight1 a189740
Fixes for the new approach
Rocketknight1 c8f251b
Force dtype to ensure Windows compatibility
Rocketknight1 f1f8888
Fixing things because I am bad at merging
Rocketknight1 ef9a7bb
Fix issues with passing a mutable list to columns argument
Rocketknight1 b8523e4
Update src/datasets/arrow_dataset.py
lhoestq 46c2507
Merge branch 'master' into tf_dataset_conversion
Rocketknight1 397bcb7
Fix unused import
Rocketknight1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this work for string types or nested types ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had some success with nested dtypes (in multiple choice datasets). This does fail on string types though - the
tf.data.Dataset
is intended to be passed straight to a model, so the assumption was that everything coming out of it would be convertable to a tf.Tensor. We could possibly make strings work in this context, though - but I'd need to think about a more generic approach to building the dataset and doing shape inference.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok ! Maybe we can mention this in the docstring ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just mentioned that numeric data only are expected in the docstring :)