-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix type instability in sort for few columns case and fix issorted bug #2746
Changes from 7 commits
ee56676
b9ac3ec
607b89b
ce18e13
88a99bc
9489491
e1241d1
786db70
a16741d
d53e6ae
38ffe35
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -93,14 +93,15 @@ ordering(col::ColumnIndex, lt::Function, by::Function, rev::Bool, order::Orderin | |||||||||
|
||||||||||
# DFPerm: defines a permutation on a particular DataFrame, using | ||||||||||
# a single ordering (O<:Ordering) or a list of column orderings | ||||||||||
# (O<:AbstractVector{Ordering}), one per DataFrame column | ||||||||||
# (NTuple of Ordering), one per DataFrame column | ||||||||||
# | ||||||||||
# If a user only specifies a few columns, the DataFrame | ||||||||||
# contained in the DFPerm only contains those columns, and | ||||||||||
# the permutation induced by this ordering is used to | ||||||||||
# sort the original (presumably larger) DataFrame | ||||||||||
|
||||||||||
struct DFPerm{O<:Union{Ordering, AbstractVector}, T<:Tuple} <: Ordering | ||||||||||
struct DFPerm{O<:Union{Ordering, Tuple{Vararg{Ordering}}}, | ||||||||||
T<:Tuple{Vararg{AbstractVector}}} <: Ordering | ||||||||||
ord::O | ||||||||||
cols::T | ||||||||||
end | ||||||||||
|
@@ -109,24 +110,48 @@ function DFPerm(ords::AbstractVector{O}, cols::T) where {O<:Ordering, T<:Tuple} | |||||||||
if length(ords) != length(cols) | ||||||||||
error("DFPerm: number of column orderings does not equal the number of columns") | ||||||||||
end | ||||||||||
DFPerm{typeof(ords), T}(ords, cols) | ||||||||||
DFPerm(Tuple(ords), cols) | ||||||||||
end | ||||||||||
|
||||||||||
DFPerm(o::Union{Ordering, AbstractVector}, df::AbstractDataFrame) = | ||||||||||
DFPerm(o, ntuple(i -> df[!, i], ncol(df))) | ||||||||||
|
||||||||||
# get ordering function for the i-th column used for ordering | ||||||||||
col_ordering(o::DFPerm{O}, i::Int) where {O<:Ordering} = o.ord | ||||||||||
col_ordering(o::DFPerm{V}, i::Int) where {V<:AbstractVector} = o.ord[i] | ||||||||||
@inline col_ordering(o::Ordering) = o | ||||||||||
@inline ord_tail(o::Ordering) = o | ||||||||||
@inline col_ordering(o::Tuple{Vararg{Ordering}}) = @inbounds o[1] | ||||||||||
@inline ord_tail(o::Tuple{Vararg{Ordering}}) = Base.tail(o) | ||||||||||
|
||||||||||
Sort.lt(o::DFPerm{<:Any, Tuple{}}, a, b) = false | ||||||||||
|
||||||||||
function Sort.lt(o::DFPerm{<:Any, <:Tuple}, a, b) | ||||||||||
ord = o.ord | ||||||||||
cols = o.cols | ||||||||||
length(cols) > 16 && return unstable_lt(ord, cols, a, b) | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a comment explaining how the 16 threshold was chosen?
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added. I have not tuned 16 specifically. I just assume that 16 is a safe threshold. Probably we could pass some higher number here, but I think that normally one does not sort on more than something like 4 columns. |
||||||||||
|
||||||||||
function Sort.lt(o::DFPerm, a, b) | ||||||||||
@inbounds for i in 1:length(o.cols) | ||||||||||
ord = col_ordering(o, i) | ||||||||||
col = o.cols[i] | ||||||||||
@inbounds begin | ||||||||||
bkamins marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
ord1 = col_ordering(ord) | ||||||||||
col = first(cols) | ||||||||||
va = col[a] | ||||||||||
vb = col[b] | ||||||||||
lt(ord, va, vb) && return true | ||||||||||
lt(ord, vb, va) && return false | ||||||||||
lt(ord1, va, vb) && return true | ||||||||||
lt(ord1, vb, va) && return false | ||||||||||
end | ||||||||||
return Sort.lt(DFPerm(ord_tail(ord), Base.tail(cols)), a, b) | ||||||||||
end | ||||||||||
|
||||||||||
# get ordering function for the i-th column used for ordering | ||||||||||
col_ordering(o::Ordering, i::Int) where {O<:Ordering} = o | ||||||||||
col_ordering(o::Tuple{Vararg{Ordering}}, i::Int) = @inbounds o[i] | ||||||||||
|
||||||||||
function unstable_lt(ord::Union{Ordering, Tuple{Vararg{Ordering}}}, | ||||||||||
cols::Tuple{Vararg{AbstractVector}}, a, b) | ||||||||||
for i in 1:length(cols) | ||||||||||
ordi = col_ordering(ord, i) | ||||||||||
@inbounds coli = cols[i] | ||||||||||
@inbounds va = coli[a] | ||||||||||
@inbounds vb = coli[b] | ||||||||||
lt(ordi, va, vb) && return true | ||||||||||
lt(ordi, vb, va) && return false | ||||||||||
end | ||||||||||
false # a and b are equal | ||||||||||
end | ||||||||||
|
@@ -325,7 +350,7 @@ function Base.issorted(df::AbstractDataFrame, cols=[]; | |||||||||
end | ||||||||||
if cols isa ColumnIndex | ||||||||||
return issorted(df[!, cols], lt=lt, by=by, rev=rev, order=order) | ||||||||||
elseif length(cols) == 1 | ||||||||||
elseif cols isa AbstractVector{<:ColumnIndex} && length(cols) == 1 | ||||||||||
return issorted(df[!, cols[1]], lt=lt, by=by, rev=rev, order=order) | ||||||||||
else | ||||||||||
return issorted(1:nrow(df), ordering(df, cols, lt, by, rev, order)) | ||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to make
O
always aTuple{Vararg{Ordering}}
? That would avoid the need forord isa Ordering
below.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that if you do
sort(df)
you want a singleOrdering
that is reused to avoid excessive compilation. I would assume thatord isa Ordering
check should be optimized out by the compiler so it should have no performance penalty.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried simplifying it but I always ended up with compiler allocating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it. Why would wrapping the
Ordering
in a one-element tuple trigger more compilation or allocations?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - one element
Tuple
is not a problem, but in this case we would anyway have to branch if the tuple is one element or matches length of the column vector.What allocates is if we wanted to have one tuple where each element would hold a tuple consisting of order and column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Anyway always wrapping
Ordering
in a tuple sounds simpler conceptually.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change it if I can make work it fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the conclusion regarding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to leave the union (as in the original design) as in the end the logic was simpler (through dispatch).