correctly parse all integer types. fixes #9289 #9316

tanmaykm · 2014-12-11T19:53:50Z

Used parseint and parsefloat methods instead of float64_isvalid.

StefanKarpinski · 2014-12-11T19:56:10Z

base/datafmt.jl

+                            end
+                        end
+                    end
+                end


Yikes. That can't be efficient :-\

I don't like it either. But couldn't find a better way. Using parse was worse.

And it is only when an abstract Number type is specified to readdlm. Other cases perform as before.

This is our fault for not having a good API for this. Given when you have to work with, this is the only way. But it's still terrible :-)

The last try...end does not have a catch. Is there a reason to use try...end for that case instead of just val = parseint(BigInt, sval)?

P.S. I vaguely recall something like this showing up in a C++ benchmark, which caused compiler implementers to come up with crazy optimizations for this kind of practice, even though the practice was nominally discouraged.

@ArchRobison that is to let it fall through and try for bool or just set a string, when T is Any or Number.

@JeffBezanson Yes, I think we can just limit to Int. Also, won't the specialized methods be able to optimize away conditions that don't apply? Which will then be better than using Union.

The method will be specialized anyway; it's just more efficient not to have 5 copies of basically the same source code. Using @eval forces all 5 copies to exist; relying on specialization only generates new code for cases that are actually used. Also @eval and $ are just ugly.

this is unfortunate, since stack unwinding is so incredibly expensive on windows and we always create the stack trace on an error

It will probably be useful to have versions of parseint and parsefloat that indicate failure instead of throwing an exception.

+1 to failures as return variables rather than exceptions.

As part of the String redesign, I'd like to see functions like those in https://github.com/johnmyleswhite/CSVReaders.jl/blob/master/src/05_parsers.jl get moved into Base.

Used parseint instead of float64_isvalid.

tanmaykm · 2014-12-14T05:24:45Z

Modified the PR:

no try-catch if concrete types are specified and only one (for parseint) otherwise
removed support for specifying abstract numeric types. They would probably be not be of any practical use anyway
also not relying on return value of the try-end construct

Some performance comparisons with a 3000000x6 CSV of Ints:

Master:

@time d = readcsv("test.csv", Any);
elapsed time: 21.261691912 seconds (3379880904 bytes allocated, 78.64% gc time)

@time d = readcsv("test.csv", Int);
elapsed time: 3.682649125 seconds (2473946568 bytes allocated, 10.52% gc time)

@time d = readcsv("test.csv", Float64);
elapsed time: 5.819523024 seconds (2473946568 bytes allocated, 44.50% gc time)

Branch:

julia> @time d = readcsv("test.csv", Any);
elapsed time: 7.013307727 seconds (2792678272 bytes allocated, 46.28% gc time)

julia> @time d = readcsv("test.csv", Int);
elapsed time: 3.993296468 seconds (2473946496 bytes allocated, 9.67% gc time)

julia> @time d = readcsv("test.csv", Float64);
elapsed time: 3.930931881 seconds (2775887328 bytes allocated, 13.01% gc time)

tanmaykm · 2014-12-18T03:09:42Z

Shall merge this tomorrow if there are no objections and is still open.

ViralBShah · 2014-12-18T05:35:46Z

Nice speed bumps. Any idea why the second test is slower? Also, there was a recent try-end return value fix. Not sure if that is relevant here.

vtjnash · 2014-12-18T06:05:34Z

i'm mildly concerned this may have a major speed regression on windows do to the dependence on try/catch. do you happen to have a machine you could test on for comparison?

i would instead suggest:

if sval == "true"
    cells[row,col] = true
elseif sval == "false"
    cells[row,col] = false
elseif (sval[1] == '-' && isdigit(SubString(sval,2))) || isdigit(sval) #xxx: need to handle 0x, 0b, 0o here also
    cells[row,col] = parseint(sval)
elseif float64_isvalid(sval, tmp64)
    cells[row,col] = tmp64[1]
else
    return true
end
return false

or, just rewrite float64_isvalid to return a Nullable{Float64} and get rid of this silly nonsense about needing a output argument for a function because it can only return one value.

tanmaykm · 2014-12-18T06:16:15Z

@ViralBShah I can see the second test result vary between 3.9 to 4.2, so seems to be within the margin. The recent try-end return value fix won't affect this as this does not rely on the return value any more.

@vtjnash Will try the options suggested. I tried to avoid too many string comparisons, but probably that will be faster than the try-catch.

vtjnash · 2014-12-18T06:27:45Z

a six character ascii-string compare is going to be faster than just about anything else you could attempt (plus, it will almost always short circuit after the first character is not matched in t/f). the main issue with my suggestion is defining the is_parsable_as_int functionality. i'm actually more interested in the Nullable rewrite, since that also gets rid of the global variable.

vtjnash · 2014-12-18T06:37:44Z

also, it's worth pointing out that the time to capture a stack trace is going to be proportional to the size of the stack. benchmark code is typically going to have a much shorter call stack than a real function.

(full disclosure: I'm currently testing stack overflow code on windows ed5a8fc via f()=f(); @time try f() end, and it takes a good full minute on my slow atom processor to unwind that stack trace, vs 0.25s on my speedier Xeon linux and i7 mac machines)

tanmaykm · 2014-12-18T06:40:30Z

Ouch. Should definitely avoid the try-catch.

JeffBezanson · 2014-12-18T06:52:26Z

Returning a Nullable from float64_isvalid would be great.

johnmyleswhite · 2014-12-18T17:44:07Z

I like the idea of using Nullable, but have come to think that the parsing code may need to throw numbered errors so that we can distinguish between failure cases. For integers, you at least want to distinguish: (1) is an integer, but caused overflow, (2) byte sequence is a number, but is a floating point number, and (3) byte sequence contains non-numeric characters and is actually a non-numeric string.

tanmaykm · 2014-12-20T18:32:34Z

Here's a draft implementation of float conversion methods returning Nullable instead of requiring an output parameter - https://github.com/tanmaykm/julia/compare/maybefloat (diff).

The Nullable version seems to have a larger gc footprint. I hope I have done it correctly. @vtjnash @JeffBezanson can you please take a look at it?

julia> function time_float64(N::Int64, s::AbstractString)
           println("maybefloat64:")
           gc()
           @time for i in 1:N
               f1 = Base.maybefloat64(s)
           end

           println("float64:")
           gc()
           @time for i in 1:N
               f2 = float64(s)
           end
       end
time_float64 (generic function with 1 method)

julia> time_float64(10^8, "23423432423.234234324324")
maybefloat64:
elapsed time: 19.220756599 seconds (4000000000 bytes allocated, 7.71% gc time)
float64:
elapsed time: 16.762540432 seconds (1600000000 bytes allocated, 3.58% gc time)

vtjnash · 2014-12-20T18:56:29Z

doing the work in C eliminates the ability to perform the crucial optimizations that make this fast. instead, use ccall's ability to interpret output struct parameters and add a C representation of Julia's Nullable{Float64} type:

typedef struct {
    uint8_t isnull;
    double value;
} nullable_float64;
nullable_float64 maybefloat64(char *s) {
   nullable_float64 ret = {1, strtod(s));
   return ret;
}

ccall(:maybefloat64, Nullable{Float64}, (Ptr{Uint8},), s)

tanmaykm · 2014-12-20T20:11:40Z

Thanks! With that change:

julia> time_float32(10^8, "23423432423.234234324324")
maybefloat32:
elapsed time: 8.332165974 seconds (1600000000 bytes allocated, 7.56% gc time)
float32:
elapsed time: 8.755044282 seconds (1600000000 bytes allocated, 6.64% gc time)

julia> time_float64(10^8, "23423432423.234234324324")
maybefloat64:
elapsed time: 16.754755825 seconds (2400000000 bytes allocated, 5.79% gc time)
float64:
elapsed time: 16.434358898 seconds (1600000000 bytes allocated, 3.59% gc time)

vtjnash · 2014-12-20T20:49:47Z

that's good. now my other remaining question would be: why isn't the allocation 0 bytes?

tanmaykm · 2014-12-22T15:38:11Z

This seems to fix the extra allocation tanmaykm@94ee10b, @vtjnash please verify.

Now:

julia> time_float64(10^8, "23423432423.234234324324");
maybefloat64:
elapsed time: 3.64e-6 seconds (0 bytes allocated)
float64:
elapsed time: 4.068e-6 seconds (16 bytes allocated)

vtjnash · 2014-12-22T16:11:15Z

wow, yes, good find! that code largely predated immutable types. that missing condition seems to have been an oversight.

any idea why that affected float64 also?

tanmaykm · 2014-12-22T16:18:49Z

Oops sorry, that was for a single loop (I had commented out the loop for some testing in my repl). Here's the correct result for 10^8 loops:

julia> time_float64(10^8, "23423432423.234234324324");
maybefloat64:
elapsed time: 15.228390363 seconds (0 bytes allocated)
float64:
elapsed time: 16.982878151 seconds (1600000000 bytes allocated, 3.47% gc time)

Introduces following methods that parse a string as the indicated type and return a `Nullable` with the result instead of throwing exception: - `maybeint{T<:Integer}(::Type{T<:Integer},s::AbstractString)` - `maybefloat32(s::AbstractString)` and `maybefloat64(s::AbstractString)` Ref: discussions at JuliaLang#9316, JuliaLang#3631, JuliaLang#5704

Introduces the tryparse method: - tryparse{T<:Integer}(::Type{T<:Integer},s::AbstractString) - tryparse(::Type{Float..},s::AbstractString) - a few variants of the above And: - tryparse(Float.., ...) call the corresponding C functions jl_try_strtof, jl_try_substrtof, jl_try_strtod and jl_try_substrtod. - The parseint, parsefloat, float64_isvalid and float32_isvalid methods wrap the corresponding tryparse methods. - The jl_strtod, jl_strtof, ... functions are wrappers over the jl_try_str... functions. This should fix JuliaLang#10498 as well. Ref: discussions at JuliaLang#9316, JuliaLang#3631, JuliaLang#5704

tanmaykm · 2015-03-18T11:04:48Z

Closing. Will resubmit from a different branch.

tanmaykm · 2015-03-18T15:35:48Z

Continued in #10560

StefanKarpinski reviewed Dec 11, 2014
View reviewed changes

ArchRobison mentioned this pull request Dec 11, 2014

try without catch not documented #9320

Closed

correctly parse all integer types. fixes JuliaLang#9289

5ed0a6b

Used parseint instead of float64_isvalid.

tanmaykm force-pushed the readcsv branch from 6ddedc3 to 5ed0a6b Compare December 14, 2014 04:58

This was referenced Dec 21, 2014

An isconvertible check #3631

Closed

parsing APIs that indicate whether something was possible, rather than raising an error? #5704

Closed

tanmaykm mentioned this pull request Dec 29, 2014

tryparse: parse string to Nullable #9487

Closed

tanmaykm closed this Mar 18, 2015

tanmaykm mentioned this pull request Mar 18, 2015

correctly parse integer types in readdlm. fixes #9289 #10560

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

correctly parse all integer types. fixes #9289 #9316

correctly parse all integer types. fixes #9289 #9316

tanmaykm commented Dec 11, 2014

StefanKarpinski Dec 11, 2014

tanmaykm Dec 11, 2014

tanmaykm Dec 11, 2014

StefanKarpinski Dec 11, 2014

ArchRobison Dec 11, 2014

tanmaykm Dec 11, 2014

JeffBezanson Dec 11, 2014

vtjnash Dec 12, 2014

tanmaykm Dec 12, 2014

johnmyleswhite Dec 12, 2014

tanmaykm commented Dec 14, 2014

tanmaykm commented Dec 18, 2014

ViralBShah commented Dec 18, 2014

vtjnash commented Dec 18, 2014

tanmaykm commented Dec 18, 2014

vtjnash commented Dec 18, 2014

vtjnash commented Dec 18, 2014

tanmaykm commented Dec 18, 2014

JeffBezanson commented Dec 18, 2014

johnmyleswhite commented Dec 18, 2014

tanmaykm commented Dec 20, 2014

vtjnash commented Dec 20, 2014

tanmaykm commented Dec 20, 2014

vtjnash commented Dec 20, 2014

tanmaykm commented Dec 22, 2014

vtjnash commented Dec 22, 2014

tanmaykm commented Dec 22, 2014

tanmaykm commented Mar 18, 2015

tanmaykm commented Mar 18, 2015

correctly parse all integer types. fixes #9289 #9316

correctly parse all integer types. fixes #9289 #9316

Conversation

tanmaykm commented Dec 11, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tanmaykm commented Dec 14, 2014

tanmaykm commented Dec 18, 2014

ViralBShah commented Dec 18, 2014

vtjnash commented Dec 18, 2014

tanmaykm commented Dec 18, 2014

vtjnash commented Dec 18, 2014

vtjnash commented Dec 18, 2014

tanmaykm commented Dec 18, 2014

JeffBezanson commented Dec 18, 2014

johnmyleswhite commented Dec 18, 2014

tanmaykm commented Dec 20, 2014

vtjnash commented Dec 20, 2014

tanmaykm commented Dec 20, 2014

vtjnash commented Dec 20, 2014

tanmaykm commented Dec 22, 2014

vtjnash commented Dec 22, 2014

tanmaykm commented Dec 22, 2014

tanmaykm commented Mar 18, 2015

tanmaykm commented Mar 18, 2015