Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend text to support yielding lines, paragraphs or documents #3442

Merged
merged 4 commits into from
Dec 20, 2021

Conversation

albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Dec 16, 2021

Add config.row option to text module to allow yielding lines (default, current case), paragraphs or documents.

Feel free to comment on the name of the config parameter row:

  • Currently, the docs state datasets are made of rows and columns
  • Other names I considered: example, item

@albertvillanova albertvillanova changed the title Extend text to suport yielding lines, paragraphs or documents Extend text to support yielding lines, paragraphs or documents Dec 16, 2021
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome !

@lhoestq
Copy link
Member

lhoestq commented Dec 17, 2021

The parameter can also be named split_by with values "line", "paragraph" or "document" (no 's' at the end)

@albertvillanova
Copy link
Member Author

albertvillanova commented Dec 17, 2021

The parameter can also be named split_by with values "line", "paragraph" or "document" (no 's' at the end)

@lhoestq @mariosasko I would avoid the term split in this context and keep it only for "train", "validation" and "test" splits.

Please note that in the documentation, one of the terms more frequently used in this context is "row":

Other of the terms more frequently used in the docs (in the code as well) is "example":

Less frequently used: "item":

Other term used in the docs (although less frequently) is "sample". The advantage of this word is that it is also a verb, so we can use the parameter: "sample_by" (if you insist on using a verb instead of a noun).

In summary, these proposals:

  • config.row
  • config.example
  • config.item
  • config.sample
  • config.sample_by

@lhoestq
Copy link
Member

lhoestq commented Dec 20, 2021

I like sample_by. Another idea I had was separate_by.

It could also be sampling, sampling_method, separation_method.

Not a big fan of the proposed nouns alone since they are very generic, that's why I tried to have something more specific.

I also agree that we actually should avoid split to avoid any confusion

@mariosasko
Copy link
Collaborator

Thanks for the analysis of the used terms. I also like sample_by (separate_by is good too).

@albertvillanova albertvillanova merged commit 395e426 into master Dec 20, 2021
@albertvillanova albertvillanova deleted the extend-text branch December 20, 2021 16:39
@lhoestq
Copy link
Member

lhoestq commented Dec 20, 2021

Thank you !! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants