[FEA] Sort categorical ids by frequency #799

karlhigley · 2021-05-10T15:49:20Z

Is your feature request related to a problem? Please describe.
I'd like to use Tensorflow's log_uniform_candidate_sampler (which is the default for sampled_softmax_loss.) log_uniform_candidate_sampler relies on ordered ids to sample from a Zipfian distribution.

Describe the solution you'd like
Ideally the categorify op would output ids in frequency order by default.

Describe alternatives you've considered
We could forego support for this candidate sampling strategy, or use a flag to enable ordered ids.

Additional context
TF Sampled Softmax Loss

sampled_values	a tuple of (sampled_candidates, true_expected_count, sampled_expected_count) returned by a *_candidate_sampler function. (if None, we default to log_uniform_candidate_sampler)

TF Log Uniform Candidate Sampler

This sampler is useful when the target classes approximately follow such a distribution - for example, if the classes represent words in a lexicon sorted in decreasing order of frequency. If your classes are not ordered by decreasing frequency, do not use this op.

The text was updated successfully, but these errors were encountered:

karlhigley · 2021-05-10T15:51:26Z

This would also help us with feature caching, because we could warm up the cache with features for the items with the lowest ids (i.e. the most frequent items), up to the threshold of however many items will fit in the cache.

benfred · 2021-05-10T16:27:42Z

I'd also like to see this.

We're currently sorting the categoricals, which has some downsides. Having the most common categorical id's grouped together with low ids will help with caching (both in feature caching, but also L2 memory cache during training) - but also having sorted categorical's won't work well with incremental preprocessing (#798 and #597)

This will require us to:

Update the 'fit' code to sort the categorical id's by frequency. This is a relatively easy change, since we already have code to use calculate the count etc for the joingroupby op - we just need to generate the count and then use that as the sort order for generating id's.
Update the 'transform' function to not require sorted categoricals. This is a little trickier because we are using cudf's merge or search_sorted functionality to do the category id lookup. One first step here might be to decouple the id from the row# .

rnyak · 2021-05-10T23:52:09Z

@benfred @karlhigley dont we do that already with categorify with max_size param? this should sort a categorical columns based on their frequency values.. did you try?

karlhigley · 2021-05-11T02:11:53Z

Haven't tried it! I couldn't tell from the docs/comments whether or not the output would be strictly sorted by descending frequency. (Note: I have no idea whether the output when using max_size is sorted by descending frequency. It totally might be, but I didn't realize I should try it to find out.)

rnyak · 2021-05-11T13:53:48Z

@karlhigley good point. we can add an explanation in the docs.

benfred · 2021-05-13T03:00:48Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Sort categorical ids by frequency #799

[FEA] Sort categorical ids by frequency #799

karlhigley commented May 10, 2021 •

edited

Loading

karlhigley commented May 10, 2021

benfred commented May 10, 2021

rnyak commented May 10, 2021

karlhigley commented May 11, 2021

rnyak commented May 11, 2021

benfred commented May 13, 2021

benfred commented Jun 21, 2021

[FEA] Sort categorical ids by frequency #799

[FEA] Sort categorical ids by frequency #799

Comments

karlhigley commented May 10, 2021 • edited Loading

karlhigley commented May 10, 2021

benfred commented May 10, 2021

rnyak commented May 10, 2021

karlhigley commented May 11, 2021

rnyak commented May 11, 2021

benfred commented May 13, 2021

benfred commented Jun 21, 2021

karlhigley commented May 10, 2021 •

edited

Loading