-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING] Change stack
to create a PooledArray{String}
column by default
#2391
Conversation
This is less likely to confuse users (it's the dplyr and pandas default), and it would allow dropping the dependency on CategoricalArrays if we wanted to. The drawback of returning strings is that the order of old columns is only reflected by the order of rows, and no longer by the sorting order of levels. Also return a `PooledArray{Symbol}` when `variable_eltype=Symbol` is passed. That's more consistent and more efficient, e.g. if you want to group on symbols later. Any element type is now accepted, with `CategoricalValue{String}` being handled by the general path. In practice it would allow using a custom `AbstractString` type, though that's relatively unlikely.
I like this as I found myself almost always having to unwrap the column after a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this as well. I like defaulting to PooledArrays as a lighter dependency w/ the efficient storage, while still allowing CategoricalArray to work automatically. I think the more we can refactor things after this pattern, the better it will be all around; we want to make sure CategoricalArray "just works", but it would also be great to break the dependency.
stack
to create a PooledArray{String}
column by defaultstack
to create a PooledArray{String}
column by default
src/abstractdataframe/reshape.jl
Outdated
# this covers CategoricalArray{String} in particular, | ||
# as copyto! inserts levels in their order of appearance | ||
nms = names(df, ints_measure_vars) | ||
catnms = copyto!(similar(nms, variable_eltype), nms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This relies on similar
for CategoricalValue{String}
producing a CategoricalVector
. Is this something that will be kept being provided in CategoricalArrays.jl? (I guess you intend so, but just making sure.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. Given improvements at JuliaLang/julia#37163, we may be able to keep that function in the end. Anyway if CategoricalArrays stops doing that, it will be in a major release so we'll be able to adapt. But yeah if we drop that we'll have to find another approach if we want to drop the CategoricalArrays dependency...
Concretely, for this PR, I could also simply go back to explicitly creating a CategoricalArray
. That would probably be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is what we can do in this PR? I would go for the "smart" path at the moment we decide do drop CategoricalArrays.jl dependency of DataFrames.jl. I will post on #data to ask about it.
This is the goal and I think if we want to go for it we should clean up all places where we have it in 0.22. The major things left are:
|
Given #2321 I would go for removing CategoricalArrays.jl here as you proposed. Then in the docstring we should say that CategoricalArrays.jl should be loaded for what is recommended to work (to make sure we do not forget to add it later). I have also noticed that I think it makes no sense to create |
Right, see #2394.
Yes, the API is already there, I just need a way to implement it. The issue I've discovered is that if I want
Hopefully it should be easy.
OK I'll have a look.
Actually I think the benefit is exactly the same as for |
It is not the same, as |
Well at least julia> x = repeat(["a", "b"], outer=2)
4-element Array{String,1}:
"a"
"b"
"a"
"b"
julia> x[1] === x[3]
true But yeah with symbols the difference is that you're sure that two different pointers refer to two different symbols, which isn't the case for strings (that's a big problem for grouping). |
Well 😄 :
so you see that
|
Actually after looking at it I'd rather merge this and #2394 first, and only them stop reexporting CategoricalArrays: that can't be done before merging either of these, and that's a logically independent step. |
src/abstractdataframe/reshape.jl
Outdated
elseif variable_eltype <: String | ||
catnms = names(df, measure_vars) | ||
if variable_eltype === Symbol | ||
catnms = PooledArray(_names(df)[measure_vars]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is probably not much benefit from it here but we can have it for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Yeah that's probably counter-productive here so I've reverted the change. What could make sense is to define copy(::RepeatedVector)
to return a PooledArray
, but that's non-breaking so we can do that later.
I have left two minor comments - can you please merge this PR when you decide on them. Thank you! |
Looks good to merge after the CI passes |
This is less likely to confuse users (it's the dplyr and pandas default), and it would allow dropping the dependency on CategoricalArrays if we wanted to.
The drawback of returning strings is that the order of old columns is only reflected by the order of rows, and no longer by the sorting order of levels.
Also return a
PooledArray{Symbol}
whenvariable_eltype=Symbol
is passed. That's more consistent and more efficient, e.g. if you want to group on symbols later.Any element type is now accepted, with
CategoricalValue{String}
being handled by the general path. In practice it would allow using a customAbstractString
type, though that's relatively unlikely.See #1947 for a previous discussion.