-
Notifications
You must be signed in to change notification settings - Fork 144
Description
Hello. I was trying to find the index of the maximum value in a data frame that I loaded using CSV.jl.
Julia has a function called argmax which does this.
argmax(itr)
Return the index or key of the maximal element in a collection. If there are multiple maximal elements, then the first one will be returned.
Unfortunately, the results I was getting back were incorrect – I called argmax
on an integer column and got back an answer that was a valid index, but not the index of the maximum value.
After some digging, I was able to reduce the issue to a small example, which you can find below.
julia> using SentinelArrays
julia> arrays = [
[18, 70, 92, 15, 65],
[25, 14, 95, 54, 57]
];
julia> cv = ChainedVector(arrays);
julia> argmax(cv)
95
julia> argmax(collect(cv))
8
Due to multithreading, the CSV column was loaded in as a chained vector, which comes from a package SentinelArrays, which incorrectly implements argmax
. It returns the maximum value, rather than its index.
The implementation is here and is tested, though the test only tests a simple special case for which the values and indices are exactly equal.
This seems like a fairly serious bug to me because it allows silently incorrect results to be returned during data processing and analysis (this is what happened to me). If I collect
the column before calling argmax
, then correct value is returned. If I load the file in with CSV.read(..., ntasks=1)
I also see the correct results.
I saw that SentinelArrays
has a recent commit to "fix out-of-bounds bug in BufferedVector" and a few other indications of potential hidden issues with that package. In my experience, CSV and DataFrames have tended to have less bugs than other parts of the Julia ecosystem so I wanted to file this bug upstream in case removing the dependency is a reasonable idea in this case.