Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix type instability in sort for few columns case and fix issorted bug #2746

Merged
merged 11 commits into from
May 30, 2021
51 changes: 38 additions & 13 deletions src/abstractdataframe/sort.jl
Original file line number Diff line number Diff line change
Expand Up @@ -93,14 +93,15 @@ ordering(col::ColumnIndex, lt::Function, by::Function, rev::Bool, order::Orderin

# DFPerm: defines a permutation on a particular DataFrame, using
# a single ordering (O<:Ordering) or a list of column orderings
# (O<:AbstractVector{Ordering}), one per DataFrame column
# (NTuple of Ordering), one per DataFrame column
#
# If a user only specifies a few columns, the DataFrame
# contained in the DFPerm only contains those columns, and
# the permutation induced by this ordering is used to
# sort the original (presumably larger) DataFrame

struct DFPerm{O<:Union{Ordering, AbstractVector}, T<:Tuple} <: Ordering
struct DFPerm{O<:Union{Ordering, Tuple{Vararg{Ordering}}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to make O always a Tuple{Vararg{Ordering}}? That would avoid the need for ord isa Ordering below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if you do sort(df) you want a single Ordering that is reused to avoid excessive compilation. I would assume that ord isa Ordering check should be optimized out by the compiler so it should have no performance penalty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried simplifying it but I always ended up with compiler allocating.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. Why would wrapping the Ordering in a one-element tuple trigger more compilation or allocations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - one element Tuple is not a problem, but in this case we would anyway have to branch if the tuple is one element or matches length of the column vector.

What allocates is if we wanted to have one tuple where each element would hold a tuple consisting of order and column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Anyway always wrapping Ordering in a tuple sounds simpler conceptually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it if I can make work it fast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the conclusion regarding this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to leave the union (as in the original design) as in the end the logic was simpler (through dispatch).

T<:Tuple{Vararg{AbstractVector}}} <: Ordering
ord::O
cols::T
end
Expand All @@ -109,24 +110,48 @@ function DFPerm(ords::AbstractVector{O}, cols::T) where {O<:Ordering, T<:Tuple}
if length(ords) != length(cols)
error("DFPerm: number of column orderings does not equal the number of columns")
end
DFPerm{typeof(ords), T}(ords, cols)
DFPerm(Tuple(ords), cols)
end

DFPerm(o::Union{Ordering, AbstractVector}, df::AbstractDataFrame) =
DFPerm(o, ntuple(i -> df[!, i], ncol(df)))

# get ordering function for the i-th column used for ordering
col_ordering(o::DFPerm{O}, i::Int) where {O<:Ordering} = o.ord
col_ordering(o::DFPerm{V}, i::Int) where {V<:AbstractVector} = o.ord[i]
@inline col_ordering(o::Ordering) = o
@inline ord_tail(o::Ordering) = o
@inline col_ordering(o::Tuple{Vararg{Ordering}}) = @inbounds o[1]
@inline ord_tail(o::Tuple{Vararg{Ordering}}) = Base.tail(o)

Sort.lt(o::DFPerm{<:Any, Tuple{}}, a, b) = false

function Sort.lt(o::DFPerm{<:Any, <:Tuple}, a, b)
ord = o.ord
cols = o.cols
length(cols) > 16 && return unstable_lt(ord, cols, a, b)
Copy link
Member

@nalimilan nalimilan May 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining how the 16 threshold was chosen?

Suggested change
length(cols) > 16 && return unstable_lt(ord, cols, a, b)
# if there are too many columns fall back to type unstable mode to avoid high compilation cost
# it is expected that in practice users sort data frames on only few columns
length(cols) > 16 && return unstable_lt(ord, cols, a, b)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added. I have not tuned 16 specifically. I just assume that 16 is a safe threshold. Probably we could pass some higher number here, but I think that normally one does not sort on more than something like 4 columns.


function Sort.lt(o::DFPerm, a, b)
@inbounds for i in 1:length(o.cols)
ord = col_ordering(o, i)
col = o.cols[i]
@inbounds begin
bkamins marked this conversation as resolved.
Show resolved Hide resolved
ord1 = col_ordering(ord)
col = first(cols)
va = col[a]
vb = col[b]
lt(ord, va, vb) && return true
lt(ord, vb, va) && return false
lt(ord1, va, vb) && return true
lt(ord1, vb, va) && return false
end
return Sort.lt(DFPerm(ord_tail(ord), Base.tail(cols)), a, b)
end

# get ordering function for the i-th column used for ordering
col_ordering(o::Ordering, i::Int) where {O<:Ordering} = o
col_ordering(o::Tuple{Vararg{Ordering}}, i::Int) = @inbounds o[i]

function unstable_lt(ord::Union{Ordering, Tuple{Vararg{Ordering}}},
cols::Tuple{Vararg{AbstractVector}}, a, b)
for i in 1:length(cols)
ordi = col_ordering(ord, i)
@inbounds coli = cols[i]
@inbounds va = coli[a]
@inbounds vb = coli[b]
lt(ordi, va, vb) && return true
lt(ordi, vb, va) && return false
end
false # a and b are equal
end
Expand Down Expand Up @@ -325,7 +350,7 @@ function Base.issorted(df::AbstractDataFrame, cols=[];
end
if cols isa ColumnIndex
return issorted(df[!, cols], lt=lt, by=by, rev=rev, order=order)
elseif length(cols) == 1
elseif cols isa AbstractVector{<:ColumnIndex} && length(cols) == 1
return issorted(df[!, cols[1]], lt=lt, by=by, rev=rev, order=order)
else
return issorted(1:nrow(df), ordering(df, cols, lt, by, rev, order))
Expand Down
Loading