Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: StructArrays allocate using @batch even if underlying type is equivalent #136

Open
AhmedSalih3d opened this issue Mar 2, 2024 · 2 comments

Comments

@AhmedSalih3d
Copy link

Hello!

I noticed that there is a difference between feeding in Vector{SVector{2,Float64}} if it is packed in a StructArray or if it is "raw". My findings are here:

using Base.Threads
using Bumper
using StrideArrays # Not necessary, but can make operations like broadcasting with Bumper.jl faster.
using Polyester
using BenchmarkTools
using ChunkSplitters
using StaticArrays
using StructArrays

struct DimensionalData{D, T <: AbstractFloat}
    vectors::Tuple{Vararg{Vector{T}, D}}
    V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}}

    # General constructor for vectors
    function DimensionalData(vectors::Vector{T}...) where {T}
        D = length(vectors)
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(Tuple(vectors), V)
    end

    # Constructor for initializing with all zeros, adapting to dimension D
    function DimensionalData{D, T}(len::Int) where {D, T}
        vectors = ntuple(d -> zeros(T, len), D) # Create D vectors of zeros
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(vectors, V)
    end
end

function ReductionFunctionChunk!(dρdtI, I, J, drhopLp, drhopLn)
    XT = eltype(dρdtI); XL = length(dρdtI); X0 = zero(XT)
    nchunks = nthreads()  # Assuming nchunks is defined somewhere as nthreads()

    @inbounds @no_escape begin
        local_X = @alloc(XT, XL, nchunks)

        fill!(local_X,X0)

        # Directly iterate over the chunks
        @batch for ichunk in 1:nchunks
            chunk_inds = getchunk(I, ichunk; n=nchunks)
            for idx in chunk_inds
                i = I[idx]
                j = J[idx]

                # Accumulate the contributions into the correct place
                local_X[i, ichunk] += drhopLp[idx]
                local_X[j, ichunk] += drhopLn[idx]
            end
        end

        # Reduction step
        # using @tturbo is slightly faster than batch (28 mus), but @batch (30 mus) works for svector, so we prefer this.
        @batch for ix in 1:XL
            for chunk in 1:nchunks
                dρdtI[ix] += local_X[ix, chunk]
            end
        end
    end
    
    return nothing
end

begin
    ProblemScaleFactor  = 1
    NumberOfPoints      = 6195*ProblemScaleFactor
    NumberOfInterations = 50000*ProblemScaleFactor
    I           = rand(1:NumberOfPoints, NumberOfInterations)
    J           = I; #rand(1:NumberOfPoints, NumberOfInterations)

    V           = zeros(SVector{2,Float64},NumberOfPoints)
    VL          = rand(eltype(V),NumberOfInterations)

    VD          = DimensionalData{2, Float64}(NumberOfPoints)
    VDL         = DimensionalData{2, Float64}(NumberOfInterations)
    VDL.V      .= VL


    V .*= 0
    ReductionFunctionChunk!(V,I,J,VL,VL)
    println("Value when doing chunk svector reduction: ", sum(V))

    VD.V .*= 0
    ReductionFunctionChunk!(VD.V,I,J,VDL.V,VDL.V)
    println("Value when doing chunk svector reduction with dimensional data: ", sum(VD.V))

    println("Chunk function:")
    display(@benchmark ReductionFunctionChunk!($V,$I,$J,$VL,$VL))

    println("Chunk function with struct:")
    K  = VD.V
    KL = VDL.V
    @benchmark ReductionFunctionChunk!($K,$I,$J,$KL,$KL) 

end
Value when doing chunk svector reduction: [50079.66957585603, 50105.60196762815]
Value when doing chunk svector reduction with dimensional data: [50079.66957585603, 50105.60196762815]
Chunk function:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  37.800 μs  434.800 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     54.200 μs               ┊ GC (median):    0.00%        
 Time  (mean ± σ):   51.479 μs ±   9.509 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁                                  █
  ▁█▇▃▁▁▂▃▁▃▃▂▄▄▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▅▆██▃▁▂▆▇▆▄▆▂▃▂▂▂▂▂▂▂▂▁▁▁▁▁ ▂
  37.8 μs         Histogram: frequency by time           65 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Chunk function with struct:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  47.900 μs  355.600 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     77.800 μs               ┊ GC (median):    0.00%        
 Time  (mean ± σ):   68.857 μs ±  15.964 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▁▃▂▁                                ▄▆█▇▆▅▃▂▂▁▁            ▂
  ███████▇███▇█▆▇▆▅▄▆▄▄▅▄▅▃▄▄▇▆▆█▇████▆▇████████████▇▇▆▆▅▆▅▄▄▂ █
  47.9 μs       Histogram: log(frequency) by time      92.7 μs <

 Memory estimate: 176 bytes, allocs estimate: 2.

My findings are as follows:

The 3 * 64 bytes, seem to come from indexing operations. By uncommenting an index operation inside of ReductionFunctionChunk! the amount of memory usage goes down.

More importantly, when removing @batch everything goes to zero allocations as expected.

Kind regards

@chriselrod
Copy link
Member

You could add support through a PkgExtention to StrideArraysCore.jl.

@AhmedSalih3d
Copy link
Author

Thank you for explaining!

For my future reference the discourse thread here:

https://discourse.julialang.org/t/hidden-allocations-when-parallelizing-with-batch/111051/11

I will see if I am able to do it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants