-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Sort categorical ids by frequency #799
Comments
This would also help us with feature caching, because we could warm up the cache with features for the items with the lowest ids (i.e. the most frequent items), up to the threshold of however many items will fit in the cache. |
I'd also like to see this. We're currently sorting the categoricals, which has some downsides. Having the most common categorical id's grouped together with low ids will help with caching (both in feature caching, but also L2 memory cache during training) - but also having sorted categorical's won't work well with incremental preprocessing (#798 and #597) This will require us to:
|
@benfred @karlhigley dont we do that already with categorify with |
Haven't tried it! I couldn't tell from the docs/comments whether or not the output would be strictly sorted by descending frequency. (Note: I have no idea whether the output when using |
@karlhigley good point. we can add an explanation in the docs. |
see also #811 - |
We should also benchmark this (tracked in a separate ticket). |
Is your feature request related to a problem? Please describe.
I'd like to use Tensorflow's
log_uniform_candidate_sampler
(which is the default forsampled_softmax_loss
.)log_uniform_candidate_sampler
relies on ordered ids to sample from a Zipfian distribution.Describe the solution you'd like
Ideally the
categorify
op would output ids in frequency order by default.Describe alternatives you've considered
We could forego support for this candidate sampling strategy, or use a flag to enable ordered ids.
Additional context
TF Sampled Softmax Loss
TF Log Uniform Candidate Sampler
The text was updated successfully, but these errors were encountered: