Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

push!/pop! always does a ccall #24909

Closed
chethega opened this issue Dec 4, 2017 · 13 comments · Fixed by #51319
Closed

push!/pop! always does a ccall #24909

chethega opened this issue Dec 4, 2017 · 13 comments · Fixed by #51319
Labels
arrays [a, r, r, a, y, s] performance Must go faster

Comments

@chethega
Copy link
Contributor

chethega commented Dec 4, 2017

When looking over my code_native, I recently saw that every push! costs me a call into the runtime library. This appears...excessive, at least for the standard case where the push can be accommodated without copying.

So, lets compare the push! and the fastpush!

First, lets look at what a vector actually is:

struct array_T
    data :: Ptr{Void}
    length:: UInt64
    flags::UInt16
    elsize::UInt16
    offset::UInt32
    nrows::UInt64
    ncols_maxsz::UInt64
    stuff1::UInt64
    stuff2::UInt64
    stuff3::UInt64
end

This gives us the fastpush:

@inline function fastpush!(A::Vector{T},val::T) where T
    if isbits(T)
        arr = unsafe_load(convert(Ptr{array_T}, pointer_from_objref(A)))
        if arr.length + arr.offset  < arr.ncols_maxsz       
            unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 2)
            unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 4)
            @inbounds A[arr.length+1] = val
        else
            push!(A,val)
        end
    else
        push!(A,val)
    end
    A
end

And lets run a race:

function pushtest(A,n)
    resize!(A, 0);
    for i = 1:n push!(A,i) end
    nothing
end

function fastpushtest(A,n)
    resize!(A, 0)
    for i = 1:n fastpush!(A,i) end
    nothing
end

function settest(A,n)
    resize!(A, n)
    for i = 1:n  A[i]=i end
    nothing
end

yielding:

Nm = 100_000_000; A = Vector{Int}(Nm); resize!(A, 0);
#after warmup:

@time settest(A,Nm); @time settest(A,Nm);
  0.356040 seconds (4 allocations: 160 bytes)
  0.326640 seconds (4 allocations: 160 bytes)

@time pushtest(A,Nm); @time pushtest(A,Nm);
  0.899436 seconds (4 allocations: 160 bytes)
  0.895555 seconds (4 allocations: 160 bytes)

@time fastpushtest(A,Nm); @time fastpushtest(A,Nm);
  0.577715 seconds (4 allocations: 160 bytes)
  0.606671 seconds (4 allocations: 160 bytes)

Hence, a 30% speedup is possible if we manage to get the fastpath inlined.

It would mayhaps be nice if it was somehow possible to make llvm aware of julia's runtime lib. For example, compile to llvm_IR (using clang or dragonegg?), and call into a nice library of llvm code including optimization-relevant metadata, and possibly inline current ccall-functions.

Not sure whether this performance problem is worth trying to fix (and how much more expensive it is if we lose important registers on the call). It is easy to avoid, though, by implementing the checks by hand (for tight loops that need to push a lot). In a perfect world, the optimizer would reduce to only a single check for unrolled pushing loops. In my applications, I do enough other things between push!/pop! that this ends up being benign.

PS. This is related to #24901, which removes the boundcheck from push!/pop!.

I also posted this in discourse.julialang.org before. The fastpush is deliberately not a pull request, because it is dirty and architecture dependent.

@quinnj
Copy link
Member

quinnj commented Dec 4, 2017

I think the broader solution here would be to reimplement Array in pure Julia. It'd be messy, but certainly do-able. The hardest part would be factoring out what needs to still exist in array.c, mostly from a GC perspective and what can be moved to julia. Certainly all the array growing/shrinking code could be moved to julia, but we would still need something like a jl_buffer_t type in src that the Julia Array type would have as a data field.

@chethega
Copy link
Contributor Author

chethega commented Dec 4, 2017

That would definitely be very cool. The most interesting thing would be using the 32 bytes remaining in Vector to store very small vectors inline (so that the data and the array_T are in the same cache line).

The second most interesting thing would be to make non-resizeable arrays reliably be one cache-line before their data, for the net effect that type-inferred array access can skip one level of indirection (this placement is already relatively reliable, but afaik neither used for prefetching nor for removal of indirection).

@oxinabox
Copy link
Contributor

oxinabox commented Dec 6, 2017

@quinnj Would we really need a jl_buffer_t in src ?
Couldn't we have a LowLevelBuffer{T} in julia,
which is just an a thin wrapper around a Ptr{T}, a size::Int, and a ccall to malloc ?

something like:


mutable struct LowLevelBuffer{T}
	nelements::Int
	data :: Ptr{T}
	function LowLevelBuffer{T}(nelements) where {T}
		nbytes = nelements*sizeof(T)
		data = Ptr{T}(Libc.malloc(nbytes))
		this = new(nelements, data)
		finalizer(this) do buff
			Libc.free(buff.data)
		end
	end
end

Base.endof(buff::LowLevelBuffer) = buff.nelements
Base.unsafe_getindex(buff::LowLevelBuffer, ii::Integer) = unsafe_load(buff.data, ii)
Base.unsafe_setindex!(buff::LowLevelBuffer, x, ii::Integer) = unsafe_store!(buff.data, x, ii)

# define getindex and setindex by adding checks to the unsafe versions
# and the rest
# don't define too much as got to leave some work for `Array`

Or would it be too hard to make that performant?

@quinnj
Copy link
Member

quinnj commented Dec 6, 2017

There's quite a bit more involved here; the julia GC has no awareness of your Libc.malloc memory. We also need to make sure the GC is aware of any boxed elements. I'm not well-versed in the GC internals to say exactly what a reasonable split would be between what's defined in src/ vs. Julia itself.

@vtjnash
Copy link
Member

vtjnash commented Dec 6, 2017

Or would it be too hard to make that performant?

Yes, the simple answer would be "because malloc is so very slow".

@chethega
Copy link
Contributor Author

Upon reconsideration, the push! is really bad, also in practice. Consider:

@inline function fpush!(vec, val, pos)
    if pos >= length(vec)
        resize!(vec, 10+ 2*length(vec))
    end
    @inbounds vec[pos+1] = val
    return pos+1
end

function filter_f(pred, A)
    res = Vector{eltype(A)}()
    pos = 0
    @inbounds for a in A 
        if pred(a)
            pos=fpush!(res, a, pos)
        end
    end
    resize!(res, pos)
    res
end

And then

A = rand(10^6); pred(x) = (x<0.5);

@btime res = filter(pred, A);
  12.272 ms (19 allocations: 5.00 MiB)

@btime res = filter_f(pred, A);
  8.468 ms (17 allocations: 6.25 MiB)

@KristofferC
Copy link
Member

KristofferC commented Sep 9, 2018

Introducing, the PushVector

mutable struct PushVector{T, A<:AbstractVector{T}} <: AbstractVector{T}
    v::A
    l::Int
end

PushVector{T}() where {T} = PushVector(Vector{T}(undef, 4), 0)

Base.length(v::PushVector) = v.l
Base.size(v::PushVector) = (v.l,)
@inline function Base.getindex(v::PushVector, i)
    @boundscheck checkbounds(v, i)
    @inbounds v.v[i]
end

function Base.push!(v::PushVector, i)
    v.l += 1
    if v.l > length(v.v)
        resize!(v.v, v.l * 2)
    end
    v.v[v.l] = i
    return v
end

finish!(v::PushVector) = resize!(v.v, v.l)

using BenchmarkTools


function pushit(v)
    for i in 1:10^4
        push!(v, i)
    end
end
@btime begin
    p = PushVector{Int64}()
    pushit(p)
    finish!(p)
end
# 24.601 μs (13 allocations: 192.58 KiB)

@btime begin
    p = Vector{Int64}()
    pushit(p)
end
# 71.176 μs (14 allocations: 256.70 KiB)

Mostly a joke

@Jutho
Copy link
Contributor

Jutho commented Oct 23, 2018

This is a user question, but here I immediately reach the people that can answer this:

In LinearMaps.jl I use resize! on a temporary array when combining several linear maps. However, this fails if the user applies a linear map which resizes reshapes the incoming or outgoing vector. Therefore, we would like a cheap way to detect whether an array can be resized, without triggering the error and using a try ... catch block.

Is this the way to go:

if Sys.WORD_SIZE == 64
 _flags(A::Array) = unsafe_load(convert(Ptr{UInt16}, pointer_from_objref(A)), 9)
elseif Sys.WORD_SIZE == 32
 _flags(A::Array) = unsafe_load(convert(Ptr{UInt16}, pointer_from_objref(A)), 5)
else
 error("Unknown word size")
end
_isshared(A::Array) = !(_flags(A) & 0x4000 == 0x0000)

I don't have a 32 bit machine, so I don't know if the position 5 is correct in that case. Do both the pointer to the data and the length of the array become a 32 bit field? And should I worry about the flags.how == 3 case if I allocated the array myself using the Array{T}(undef) constructor and just passed it along to other functions?

@chethega
Copy link
Contributor Author

Do both the pointer to the data and the length of the array become a 32 bit field?

Mostly yes. But I think the proper way would be
_flags(A::Array) = unsafe_load(convert(Ptr{UInt16}, pointer_from_objref(A)), 1 + sizeof(Csize_t)>>1 + sizeof(Ptr{Cvoid})>>1 )

But then, there is the compile-time option STORE_ARRAY_LEN. I don't know how to check for that at runtime.

And should I worry about the flags.how == 3 case ...?

I don't know, sorry. But is the exception-route so bad in your case?

@yuyichao
Copy link
Contributor

No don't do that. None of the code above are well defined.

@Jutho
Copy link
Contributor

Jutho commented Oct 23, 2018

The fun of hacking around :-). Any other suggestion for accessing the isshared flag from within Julia? I assume there is some overhead to using try ... catch versus checking a simple flag?

@tpapp
Copy link
Contributor

tpapp commented May 1, 2019

FWIW, I packaged the workaround of @KristofferC (with his kind permission) as

/~https://github.com/tpapp/PushVectors.jl

because I found I was copy-pasting it to various pieces of code.

I intend to obsolete/retire it when this issue is fixed, but in the meantime it may be useful.

@laborg
Copy link
Contributor

laborg commented Oct 18, 2023

Probably closable when #51319 lands

vtjnash added a commit that referenced this issue Oct 27, 2023
See <https://hackmd.io/@vtjnash/rkzazi7an> for the Julep describing this
change.

Also makes memory allocation ccalls safer, for catching Serialization
and deepcopy bugs in packages.

Fixes #24909

Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants