Skip to content

Incorrect results for argmax with multithreaded parsing #1128

@yurivish

Description

@yurivish

Hello. I was trying to find the index of the maximum value in a data frame that I loaded using CSV.jl.

Julia has a function called argmax which does this.

argmax(itr)

Return the index or key of the maximal element in a collection. If there are multiple maximal elements, then the first one will be returned.

Unfortunately, the results I was getting back were incorrect – I called argmax on an integer column and got back an answer that was a valid index, but not the index of the maximum value.

After some digging, I was able to reduce the issue to a small example, which you can find below.

julia> using SentinelArrays

julia> arrays = [
         [18, 70, 92, 15, 65],
         [25, 14, 95, 54, 57]
       ];

julia> cv = ChainedVector(arrays);

julia> argmax(cv)
95

julia> argmax(collect(cv))
8

Due to multithreading, the CSV column was loaded in as a chained vector, which comes from a package SentinelArrays, which incorrectly implements argmax. It returns the maximum value, rather than its index.

The implementation is here and is tested, though the test only tests a simple special case for which the values and indices are exactly equal.

This seems like a fairly serious bug to me because it allows silently incorrect results to be returned during data processing and analysis (this is what happened to me). If I collect the column before calling argmax, then correct value is returned. If I load the file in with CSV.read(..., ntasks=1) I also see the correct results.

I saw that SentinelArrays has a recent commit to "fix out-of-bounds bug in BufferedVector" and a few other indications of potential hidden issues with that package. In my experience, CSV and DataFrames have tended to have less bugs than other parts of the Julia ecosystem so I wanted to file this bug upstream in case removing the dependency is a reasonable idea in this case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions