Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Sort categorical ids by frequency #799

Closed
karlhigley opened this issue May 10, 2021 · 7 comments · Fixed by #963
Closed

[FEA] Sort categorical ids by frequency #799

karlhigley opened this issue May 10, 2021 · 7 comments · Fixed by #963
Assignees
Labels
enhancement New feature or request P1

Comments

@karlhigley
Copy link
Contributor

karlhigley commented May 10, 2021

Is your feature request related to a problem? Please describe.
I'd like to use Tensorflow's log_uniform_candidate_sampler (which is the default for sampled_softmax_loss.) log_uniform_candidate_sampler relies on ordered ids to sample from a Zipfian distribution.

Describe the solution you'd like
Ideally the categorify op would output ids in frequency order by default.

Describe alternatives you've considered
We could forego support for this candidate sampling strategy, or use a flag to enable ordered ids.

Additional context
TF Sampled Softmax Loss

sampled_values a tuple of (sampled_candidates, true_expected_count, sampled_expected_count) returned by a *_candidate_sampler function. (if None, we default to log_uniform_candidate_sampler)

TF Log Uniform Candidate Sampler

This sampler is useful when the target classes approximately follow such a distribution - for example, if the classes represent words in a lexicon sorted in decreasing order of frequency. If your classes are not ordered by decreasing frequency, do not use this op.

@karlhigley karlhigley added the enhancement New feature or request label May 10, 2021
@karlhigley
Copy link
Contributor Author

This would also help us with feature caching, because we could warm up the cache with features for the items with the lowest ids (i.e. the most frequent items), up to the threshold of however many items will fit in the cache.

@benfred
Copy link
Member

benfred commented May 10, 2021

I'd also like to see this.

We're currently sorting the categoricals, which has some downsides. Having the most common categorical id's grouped together with low ids will help with caching (both in feature caching, but also L2 memory cache during training) - but also having sorted categorical's won't work well with incremental preprocessing (#798 and #597)

This will require us to:

  • Update the 'fit' code to sort the categorical id's by frequency. This is a relatively easy change, since we already have code to use calculate the count etc for the joingroupby op - we just need to generate the count and then use that as the sort order for generating id's.
  • Update the 'transform' function to not require sorted categoricals. This is a little trickier because we are using cudf's merge or search_sorted functionality to do the category id lookup. One first step here might be to decouple the id from the row# .

@rnyak
Copy link
Contributor

rnyak commented May 10, 2021

@benfred @karlhigley dont we do that already with categorify with max_size param? this should sort a categorical columns based on their frequency values.. did you try?

@karlhigley
Copy link
Contributor Author

Haven't tried it! I couldn't tell from the docs/comments whether or not the output would be strictly sorted by descending frequency. (Note: I have no idea whether the output when using max_size is sorted by descending frequency. It totally might be, but I didn't realize I should try it to find out.)

@rnyak
Copy link
Contributor

rnyak commented May 11, 2021

@karlhigley good point. we can add an explanation in the docs.

@benfred
Copy link
Member

benfred commented May 13, 2021

see also #811 -

@viswa-nvidia viswa-nvidia added this to the NVTabular v0.6 milestone Jun 2, 2021
@benfred benfred added the P1 label Jun 4, 2021
@benfred
Copy link
Member

benfred commented Jun 21, 2021

We should also benchmark this (tracked in a separate ticket).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants