Bug: StructArrays allocate using @batch even if underlying type is equivalent #136

AhmedSalih3d · 2024-03-02T11:45:23Z

Hello!

I noticed that there is a difference between feeding in Vector{SVector{2,Float64}} if it is packed in a StructArray or if it is "raw". My findings are here:

using Base.Threads
using Bumper
using StrideArrays # Not necessary, but can make operations like broadcasting with Bumper.jl faster.
using Polyester
using BenchmarkTools
using ChunkSplitters
using StaticArrays
using StructArrays

struct DimensionalData{D, T <: AbstractFloat}
    vectors::Tuple{Vararg{Vector{T}, D}}
    V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}}

    # General constructor for vectors
    function DimensionalData(vectors::Vector{T}...) where {T}
        D = length(vectors)
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(Tuple(vectors), V)
    end

    # Constructor for initializing with all zeros, adapting to dimension D
    function DimensionalData{D, T}(len::Int) where {D, T}
        vectors = ntuple(d -> zeros(T, len), D) # Create D vectors of zeros
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(vectors, V)
    end
end

function ReductionFunctionChunk!(dρdtI, I, J, drhopLp, drhopLn)
    XT = eltype(dρdtI); XL = length(dρdtI); X0 = zero(XT)
    nchunks = nthreads()  # Assuming nchunks is defined somewhere as nthreads()

    @inbounds @no_escape begin
        local_X = @alloc(XT, XL, nchunks)

        fill!(local_X,X0)

        # Directly iterate over the chunks
        @batch for ichunk in 1:nchunks
            chunk_inds = getchunk(I, ichunk; n=nchunks)
            for idx in chunk_inds
                i = I[idx]
                j = J[idx]

                # Accumulate the contributions into the correct place
                local_X[i, ichunk] += drhopLp[idx]
                local_X[j, ichunk] += drhopLn[idx]
            end
        end

        # Reduction step
        # using @tturbo is slightly faster than batch (28 mus), but @batch (30 mus) works for svector, so we prefer this.
        @batch for ix in 1:XL
            for chunk in 1:nchunks
                dρdtI[ix] += local_X[ix, chunk]
            end
        end
    end
    
    return nothing
end

begin
    ProblemScaleFactor  = 1
    NumberOfPoints      = 6195*ProblemScaleFactor
    NumberOfInterations = 50000*ProblemScaleFactor
    I           = rand(1:NumberOfPoints, NumberOfInterations)
    J           = I; #rand(1:NumberOfPoints, NumberOfInterations)

    V           = zeros(SVector{2,Float64},NumberOfPoints)
    VL          = rand(eltype(V),NumberOfInterations)

    VD          = DimensionalData{2, Float64}(NumberOfPoints)
    VDL         = DimensionalData{2, Float64}(NumberOfInterations)
    VDL.V      .= VL


    V .*= 0
    ReductionFunctionChunk!(V,I,J,VL,VL)
    println("Value when doing chunk svector reduction: ", sum(V))

    VD.V .*= 0
    ReductionFunctionChunk!(VD.V,I,J,VDL.V,VDL.V)
    println("Value when doing chunk svector reduction with dimensional data: ", sum(VD.V))

    println("Chunk function:")
    display(@benchmark ReductionFunctionChunk!($V,$I,$J,$VL,$VL))

    println("Chunk function with struct:")
    K  = VD.V
    KL = VDL.V
    @benchmark ReductionFunctionChunk!($K,$I,$J,$KL,$KL) 

end

Value when doing chunk svector reduction: [50079.66957585603, 50105.60196762815]
Value when doing chunk svector reduction with dimensional data: [50079.66957585603, 50105.60196762815]
Chunk function:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  37.800 μs … 434.800 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     54.200 μs               ┊ GC (median):    0.00%        
 Time  (mean ± σ):   51.479 μs ±   9.509 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁                                  █
  ▁█▇▃▁▁▂▃▁▃▃▂▄▄▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▅▆██▃▁▂▆▇▆▄▆▂▃▂▂▂▂▂▂▂▂▁▁▁▁▁ ▂
  37.8 μs         Histogram: frequency by time           65 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Chunk function with struct:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  47.900 μs … 355.600 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     77.800 μs               ┊ GC (median):    0.00%        
 Time  (mean ± σ):   68.857 μs ±  15.964 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▁▃▂▁                                ▄▆█▇▆▅▃▂▂▁▁            ▂
  ███████▇███▇█▆▇▆▅▄▆▄▄▅▄▅▃▄▄▇▆▆█▇████▆▇████████████▇▇▆▆▅▆▅▄▄▂ █
  47.9 μs       Histogram: log(frequency) by time      92.7 μs <

 Memory estimate: 176 bytes, allocs estimate: 2.

My findings are as follows:

The 3 * 64 bytes, seem to come from indexing operations. By uncommenting an index operation inside of ReductionFunctionChunk! the amount of memory usage goes down.

More importantly, when removing @batch everything goes to zero allocations as expected.

Kind regards

The text was updated successfully, but these errors were encountered:

chriselrod · 2024-03-02T14:31:20Z

You could add support through a PkgExtention to StrideArraysCore.jl.

AhmedSalih3d · 2024-03-04T10:32:12Z

Thank you for explaining!

For my future reference the discourse thread here:

https://discourse.julialang.org/t/hidden-allocations-when-parallelizing-with-batch/111051/11

I will see if I am able to do it later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: StructArrays allocate using @batch even if underlying type is equivalent #136

Bug: StructArrays allocate using @batch even if underlying type is equivalent #136

AhmedSalih3d commented Mar 2, 2024

chriselrod commented Mar 2, 2024

AhmedSalih3d commented Mar 4, 2024

Bug: StructArrays allocate using @batch even if underlying type is equivalent #136

Bug: StructArrays allocate using @batch even if underlying type is equivalent #136

Comments

AhmedSalih3d commented Mar 2, 2024

chriselrod commented Mar 2, 2024

AhmedSalih3d commented Mar 4, 2024