push!/pop! always does a ccall #24909

chethega · 2017-12-04T11:22:17Z

When looking over my code_native, I recently saw that every push! costs me a call into the runtime library. This appears...excessive, at least for the standard case where the push can be accommodated without copying.

So, lets compare the push! and the fastpush!

First, lets look at what a vector actually is:

struct array_T
    data :: Ptr{Void}
    length:: UInt64
    flags::UInt16
    elsize::UInt16
    offset::UInt32
    nrows::UInt64
    ncols_maxsz::UInt64
    stuff1::UInt64
    stuff2::UInt64
    stuff3::UInt64
end

This gives us the fastpush:

@inline function fastpush!(A::Vector{T},val::T) where T
    if isbits(T)
        arr = unsafe_load(convert(Ptr{array_T}, pointer_from_objref(A)))
        if arr.length + arr.offset  < arr.ncols_maxsz       
            unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 2)
            unsafe_store!(convert(Ptr{UInt64}, pointer_from_objref(A)), arr.length+1, 4)
            @inbounds A[arr.length+1] = val
        else
            push!(A,val)
        end
    else
        push!(A,val)
    end
    A
end

And lets run a race:

function pushtest(A,n)
    resize!(A, 0);
    for i = 1:n push!(A,i) end
    nothing
end

function fastpushtest(A,n)
    resize!(A, 0)
    for i = 1:n fastpush!(A,i) end
    nothing
end

function settest(A,n)
    resize!(A, n)
    for i = 1:n  A[i]=i end
    nothing
end

yielding:

Nm = 100_000_000; A = Vector{Int}(Nm); resize!(A, 0);
#after warmup:

@time settest(A,Nm); @time settest(A,Nm);
  0.356040 seconds (4 allocations: 160 bytes)
  0.326640 seconds (4 allocations: 160 bytes)

@time pushtest(A,Nm); @time pushtest(A,Nm);
  0.899436 seconds (4 allocations: 160 bytes)
  0.895555 seconds (4 allocations: 160 bytes)

@time fastpushtest(A,Nm); @time fastpushtest(A,Nm);
  0.577715 seconds (4 allocations: 160 bytes)
  0.606671 seconds (4 allocations: 160 bytes)

Hence, a 30% speedup is possible if we manage to get the fastpath inlined.

It would mayhaps be nice if it was somehow possible to make llvm aware of julia's runtime lib. For example, compile to llvm_IR (using clang or dragonegg?), and call into a nice library of llvm code including optimization-relevant metadata, and possibly inline current ccall-functions.

Not sure whether this performance problem is worth trying to fix (and how much more expensive it is if we lose important registers on the call). It is easy to avoid, though, by implementing the checks by hand (for tight loops that need to push a lot). In a perfect world, the optimizer would reduce to only a single check for unrolled pushing loops. In my applications, I do enough other things between push!/pop! that this ends up being benign.

PS. This is related to #24901, which removes the boundcheck from push!/pop!.

I also posted this in discourse.julialang.org before. The fastpush is deliberately not a pull request, because it is dirty and architecture dependent.

The text was updated successfully, but these errors were encountered:

quinnj · 2017-12-04T12:16:26Z

I think the broader solution here would be to reimplement Array in pure Julia. It'd be messy, but certainly do-able. The hardest part would be factoring out what needs to still exist in array.c, mostly from a GC perspective and what can be moved to julia. Certainly all the array growing/shrinking code could be moved to julia, but we would still need something like a jl_buffer_t type in src that the Julia Array type would have as a data field.

chethega · 2017-12-04T12:48:35Z

That would definitely be very cool. The most interesting thing would be using the 32 bytes remaining in Vector to store very small vectors inline (so that the data and the array_T are in the same cache line).

The second most interesting thing would be to make non-resizeable arrays reliably be one cache-line before their data, for the net effect that type-inferred array access can skip one level of indirection (this placement is already relatively reliable, but afaik neither used for prefetching nor for removal of indirection).

oxinabox · 2017-12-06T08:29:30Z

@quinnj Would we really need a jl_buffer_t in src ?
Couldn't we have a LowLevelBuffer{T} in julia,
which is just an a thin wrapper around a Ptr{T}, a size::Int, and a ccall to malloc ?

something like:


mutable struct LowLevelBuffer{T}
	nelements::Int
	data :: Ptr{T}
	function LowLevelBuffer{T}(nelements) where {T}
		nbytes = nelements*sizeof(T)
		data = Ptr{T}(Libc.malloc(nbytes))
		this = new(nelements, data)
		finalizer(this) do buff
			Libc.free(buff.data)
		end
	end
end

Base.endof(buff::LowLevelBuffer) = buff.nelements
Base.unsafe_getindex(buff::LowLevelBuffer, ii::Integer) = unsafe_load(buff.data, ii)
Base.unsafe_setindex!(buff::LowLevelBuffer, x, ii::Integer) = unsafe_store!(buff.data, x, ii)

# define getindex and setindex by adding checks to the unsafe versions
# and the rest
# don't define too much as got to leave some work for `Array`

Or would it be too hard to make that performant?

quinnj · 2017-12-06T22:22:26Z

There's quite a bit more involved here; the julia GC has no awareness of your Libc.malloc memory. We also need to make sure the GC is aware of any boxed elements. I'm not well-versed in the GC internals to say exactly what a reasonable split would be between what's defined in src/ vs. Julia itself.

vtjnash · 2017-12-06T22:27:48Z

Or would it be too hard to make that performant?

Yes, the simple answer would be "because malloc is so very slow".

chethega · 2017-12-12T14:44:26Z

Upon reconsideration, the push! is really bad, also in practice. Consider:

@inline function fpush!(vec, val, pos)
    if pos >= length(vec)
        resize!(vec, 10+ 2*length(vec))
    end
    @inbounds vec[pos+1] = val
    return pos+1
end

function filter_f(pred, A)
    res = Vector{eltype(A)}()
    pos = 0
    @inbounds for a in A 
        if pred(a)
            pos=fpush!(res, a, pos)
        end
    end
    resize!(res, pos)
    res
end

And then

A = rand(10^6); pred(x) = (x<0.5);

@btime res = filter(pred, A);
  12.272 ms (19 allocations: 5.00 MiB)

@btime res = filter_f(pred, A);
  8.468 ms (17 allocations: 6.25 MiB)

KristofferC · 2018-09-09T17:37:13Z

Introducing, the PushVector

mutable struct PushVector{T, A<:AbstractVector{T}} <: AbstractVector{T}
    v::A
    l::Int
end

PushVector{T}() where {T} = PushVector(Vector{T}(undef, 4), 0)

Base.length(v::PushVector) = v.l
Base.size(v::PushVector) = (v.l,)
@inline function Base.getindex(v::PushVector, i)
    @boundscheck checkbounds(v, i)
    @inbounds v.v[i]
end

function Base.push!(v::PushVector, i)
    v.l += 1
    if v.l > length(v.v)
        resize!(v.v, v.l * 2)
    end
    v.v[v.l] = i
    return v
end

finish!(v::PushVector) = resize!(v.v, v.l)

using BenchmarkTools


function pushit(v)
    for i in 1:10^4
        push!(v, i)
    end
end

@btime begin
    p = PushVector{Int64}()
    pushit(p)
    finish!(p)
end
# 24.601 μs (13 allocations: 192.58 KiB)

@btime begin
    p = Vector{Int64}()
    pushit(p)
end
# 71.176 μs (14 allocations: 256.70 KiB)

Mostly a joke

Jutho · 2018-10-23T07:20:08Z

This is a user question, but here I immediately reach the people that can answer this:

In LinearMaps.jl I use resize! on a temporary array when combining several linear maps. However, this fails if the user applies a linear map which ~~resizes~~ reshapes the incoming or outgoing vector. Therefore, we would like a cheap way to detect whether an array can be resized, without triggering the error and using a try ... catch block.

Is this the way to go:

if Sys.WORD_SIZE == 64
 _flags(A::Array) = unsafe_load(convert(Ptr{UInt16}, pointer_from_objref(A)), 9)
elseif Sys.WORD_SIZE == 32
 _flags(A::Array) = unsafe_load(convert(Ptr{UInt16}, pointer_from_objref(A)), 5)
else
 error("Unknown word size")
end
_isshared(A::Array) = !(_flags(A) & 0x4000 == 0x0000)

I don't have a 32 bit machine, so I don't know if the position 5 is correct in that case. Do both the pointer to the data and the length of the array become a 32 bit field? And should I worry about the flags.how == 3 case if I allocated the array myself using the Array{T}(undef) constructor and just passed it along to other functions?

chethega · 2018-10-23T14:37:05Z

Do both the pointer to the data and the length of the array become a 32 bit field?

Mostly yes. But I think the proper way would be
_flags(A::Array) = unsafe_load(convert(Ptr{UInt16}, pointer_from_objref(A)), 1 + sizeof(Csize_t)>>1 + sizeof(Ptr{Cvoid})>>1 )

But then, there is the compile-time option STORE_ARRAY_LEN. I don't know how to check for that at runtime.

And should I worry about the flags.how == 3 case ...?

I don't know, sorry. But is the exception-route so bad in your case?

yuyichao · 2018-10-23T15:00:37Z

No don't do that. None of the code above are well defined.

Jutho · 2018-10-23T15:11:57Z

The fun of hacking around :-). Any other suggestion for accessing the isshared flag from within Julia? I assume there is some overhead to using try ... catch versus checking a simple flag?

tpapp · 2019-05-01T07:24:35Z

FWIW, I packaged the workaround of @KristofferC (with his kind permission) as

https://github.com/tpapp/PushVectors.jl

because I found I was copy-pasting it to various pieces of code.

I intend to obsolete/retire it when this issue is fixed, but in the meantime it may be useful.

laborg · 2023-10-18T06:40:43Z

Probably closable when #51319 lands

See <https://hackmd.io/@vtjnash/rkzazi7an> for the Julep describing this change. Also makes memory allocation ccalls safer, for catching Serialization and deepcopy bugs in packages. Fixes #24909 Co-authored-by: Jameson Nash <vtjnash@gmail.com>

JeffBezanson added arrays [a, r, r, a, y, s] performance Must go faster labels Dec 19, 2017

nalimilan mentioned this issue Jan 10, 2018

Fix performance regression of some find() methods #25489

Closed

nalimilan mentioned this issue Feb 23, 2018

Performance regression from findall depwarn suggestion #26183

Closed

KristofferC mentioned this issue Mar 6, 2018

improve performance for Matrix -> SparseMatrix #26334

Merged

chethega mentioned this issue Apr 20, 2018

Feature request: mutable unsafe vectors JuliaArrays/UnsafeArrays.jl#2

Open

cako mentioned this issue Oct 24, 2018

Add fallback when resize! does not work JuliaLinearAlgebra/LinearMaps.jl#31

Merged

KristofferC mentioned this issue Nov 4, 2018

Improve number parsing in streams: use a PushVector to work around slow push! in Base JuliaIO/JSON.jl#264

Merged

chethega mentioned this issue Feb 17, 2019

Make length(A.nzval)==nnz(A) #30662 #30676

Closed

KristofferC mentioned this issue Apr 10, 2019

Add optimised methods for reducing(hcat/vcat on any iterators of vectors #31644

Closed

tpapp mentioned this issue Apr 27, 2019

Documenting performance model of empty!, sizehint!, push! & friends #31855

Closed

nalimilan mentioned this issue Jul 4, 2019

sizehint! + push! about 10x slower than resize! + indexing #32497

Closed

non-Jedi mentioned this issue Aug 28, 2020

Store first value in Dict directly in innerjoin JuliaData/SplitApplyCombine.jl#29

Open

Tokazama mentioned this issue Jan 8, 2021

Managing array memory JuliaArrays/ArrayInterface.jl#104

Open

Tokazama mentioned this issue Aug 4, 2021

Approach to alloc JuliaSIMD/ManualMemory.jl#5

Open

ranocha added a commit to SciML/RootedTrees.jl that referenced this issue Oct 14, 2021

introduce DeleteVector as workaround of JuliaLang/julia#24909

e9b7d35

ranocha mentioned this issue Oct 14, 2021

WIP: introduce DeleteVector SciML/RootedTrees.jl#53

Draft

Tokazama mentioned this issue Nov 11, 2022

Query if Array is shared #47537

Closed

Tokazama mentioned this issue Nov 11, 2022

Query Array ownership with isshared and isowner #47540

Closed

vtjnash mentioned this issue Oct 18, 2023

Add Memory type #51319

Merged

vtjnash closed this as completed in #51319 Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

push!/pop! always does a ccall #24909

push!/pop! always does a ccall #24909

chethega commented Dec 4, 2017

quinnj commented Dec 4, 2017

chethega commented Dec 4, 2017

oxinabox commented Dec 6, 2017 •

edited

Loading

quinnj commented Dec 6, 2017

vtjnash commented Dec 6, 2017

chethega commented Dec 12, 2017

KristofferC commented Sep 9, 2018 •

edited

Loading

Jutho commented Oct 23, 2018 •

edited

Loading

chethega commented Oct 23, 2018

yuyichao commented Oct 23, 2018

Jutho commented Oct 23, 2018

tpapp commented May 1, 2019

laborg commented Oct 18, 2023

push!/pop! always does a ccall #24909

push!/pop! always does a ccall #24909

Comments

chethega commented Dec 4, 2017

quinnj commented Dec 4, 2017

chethega commented Dec 4, 2017

oxinabox commented Dec 6, 2017 • edited Loading

quinnj commented Dec 6, 2017

vtjnash commented Dec 6, 2017

chethega commented Dec 12, 2017

KristofferC commented Sep 9, 2018 • edited Loading

Jutho commented Oct 23, 2018 • edited Loading

chethega commented Oct 23, 2018

yuyichao commented Oct 23, 2018

Jutho commented Oct 23, 2018

tpapp commented May 1, 2019

laborg commented Oct 18, 2023

oxinabox commented Dec 6, 2017 •

edited

Loading

KristofferC commented Sep 9, 2018 •

edited

Loading

Jutho commented Oct 23, 2018 •

edited

Loading