Optimize and fix issues with the `Parser` module #140

TotalVerb · 2016-04-30T07:28:26Z

Overview of changes:

Force parse_string to return Compat.UTF8String
Don't use IOBuffer objects in parse_string
Sprinkled @inbounds and @inline liberally
Make default Dict key typed Compat.UTF8String instead of Any
Fix some unicode compatibility along the way
Folded the streaming parser into the regular parser, using generic implementations (optimizations for number and string parsing in-memory)

Impact:

Speed for parsing both strings and non-strings have improved substantially on my computer. By my tests we are now easily within a factor of 2 from Python's json with or without strings using @samuelcolvin's benchmarks. Note that these benchmarks heavily rely on float parsing which is still slow, and are ASCII-only which obscures the biggest performance gain obtained.

Here are the new benchmark results. Note that these benchmarks have been "over-benchmarked" to some degree, since they were used heavily in deciding which optimizations to pursue. However, they should in theory represent the kinds of data one would parse quite well. Python's builtin json is somewhat unfairly penalized on these benchmarks, because it is pitifully slow on unicode data. ujson on the other hand is quite good at it. However, ujson "cheats" by default on the numeric benchmarks by not parsing floats to best precision; I included the "non-cheating" version too for completeness.

Julia 0.5:
 [Bench] canada 0.057967427 seconds
 [Bench] citm_catalog 0.027227154 seconds
 [Bench] citylots 6.882568502 seconds
 [Bench] twitter 0.011117001 seconds
 [Bench] Total (G.M.) 0.10482892077727128 seconds

Python 2 json:
python json canada time: 0.050s
python json citm_catalog time: 0.032s
python json citylots time: 8.600s
python json twitter time: 0.552s
GM 0.295 (2.81 times Julia)

Python 2 ujson:
python ujson canada time: 0.026s
python ujson citm_catalog time: 0.014s
python ujson citylots time: 5.535s
python ujson twitter time: 0.005s
GM 0.056 (0.534 times Julia)

Python 2 ujson (no cheating):
python ujson canada time: 0.045s
python ujson citm_catalog time: 0.014s
python ujson citylots time: 6.687s
python ujson twitter time: 0.005s
GM 0.068 (0.649 times Julia)

Old Julia 0.5
 [Bench] canada 0.100768632 seconds
 [Bench] citm_catalog 0.089031051 seconds
 [Bench] citylots 16.282590906 seconds
 [Bench] twitter 0.043230788 seconds
 [Bench] Total (G.M.) 0.2819005229832309 seconds (2.69 times new Julia)

This change involves a rewrite of object, number, and array parsing code, which also makes things more correct. It removes all of the allowed failures for JSON checker.

Breakage:

The streaming parsers no longer keep track of what has been parsed, and so the error messages are now effectively worthless.

Closes #98, #118, #122, #127.

TotalVerb · 2016-05-01T04:01:54Z

A lot of the remaining time for the no-strings test is due to the type instability of parse_number. I am wondering whether it is worth it to simply return Float64 always from this method. JSON itself does not distinguish between integers and decimals, and JavaScript uses Float64 for everything.

I will try out some speculative typing for arrays, which might still give a small performance boost.

kmsquire · 2016-05-01T05:41:55Z

@TotalVerb, thanks for doing this! I'll note that this package sometimes doesn't get much attention, so if you find this PR languishing, please bump it once or twice (or more).

nalimilan · 2016-05-01T08:17:50Z

A lot of the remaining time for the no-strings test is due to the type instability of parse_number. I am wondering whether it is worth it to simply return Float64 always from this method. JSON itself does not distinguish between integers and decimals, and JavaScript uses Float64 for everything.

Makes sense. Type instability is generally a mistake in Julia anyways.

TotalVerb · 2016-05-09T23:23:50Z

This will need an update to support the latest master's String changes. I plan to work on that later tonight.

TotalVerb · 2016-05-12T06:07:48Z

A major issue now is that nextind is very slow for the unified String type. Somehow the switch from ASCIIString to String led to over 50% performance regression on this branch. I have a few ideas for regaining some of the lost performance by avoiding nextind where unnecessary. (e.g. making String from String can be done by byte-copying instead of character-copying, exploiting the fact that ASCII bytes only occur in UTF8 if they are actually encoded ASCII codepoints.)

Master does not have nearly as hard a time dealing with the change to String (though it too suffers some). But in large part, that is because master uses + where nextind may be required.

stevengj · 2016-05-15T22:04:18Z

cc @StefanKarpinski, who will be interested to hear of the String change impacts.

TotalVerb · 2016-05-16T00:59:00Z

Actually, I think I was mistaken about the source of the performance regression. nextind probably isn't at fault. But the benchmarks are definitely slower; I just don't know why. I'll do some more profiling.

TotalVerb · 2016-05-16T01:04:07Z

Here is a condensed benchmark that suffers from significant regression:

function f(s)
    n = 0
    m = start(s)
    e = endof(s)
    while m ≤ e
        n += Int(s[m])
        m = nextind(s, m)
    end
    n
end

Pre-stringapalooza, on UTF8String this is over 10 times as slow as on ASCIIString. String seems to have inherited UTF8String's speed.

A lot of the JSON code looks like this. (Except in the present version, it uses m += 1 instead of nextind(s, m) in several places, not all of which the former is valid.) So it is not surprising that this 10-fold regression turns into a 50% regression when working with ASCII data.

yuyichao · 2016-05-16T01:12:51Z

It's not surprising that the new String inherits UTF8String's speed since that's exactly what it is.

I've also seen similar effect in https://groups.google.com/forum/?fromgroups=#!topic/julia-users/9zUDciJk878 where using nextind shows a 2x slow down. I remember @StefanKarpinski was talking about more efficient iteration on string index and I was expecting it to come in one of the future related changes.

TotalVerb · 2016-05-20T00:04:03Z

I think the major issues (excepting compatibility with v0.3) are sorted out now. I'll see if reusing buffers still makes sense, and then I'll make sure all the code additions are actually having substantial impact.

TotalVerb · 2016-05-22T07:00:44Z

This PR is nearing maturity. Overall, non-test code has grown by about 20 to 30 lines. (Some of the added lines are comments.) However, given that a very good chunk of the tests that failed before now pass, I think the slight increase in LOC is reasonable. The only functions which have seen their complexity increase are parse_string and parse_simple. I don't know if there's a way to keep parse_string as simple as it was before while maintaining correctness and efficiency, at least until nextind performance becomes much better. parse_simple on the other hand was behaving quite oddly earlier (accepting tXXe as true, which for some reason was even tested?!), and I think it is better to stick to the JSON standard at some small complexity cost.

The only downside to this change (as far as I can see) is that parse_number has less range on 64-bit machines, since it now always returns a Float64. (Before, it could parse 64-bit integers on 64-bit machines... but would also crash on those same integers on 32-bit machines, so it is debatable whether that behaviour was good.) It might be useful, in a later PR, to let the user supply a custom parse_number function. Especially now that functions can be specialized on, this would be similar to how a dict type can be passed in currently.

v0.3 compatibility is back, but the timing is slightly worse than on newer versions. I don't think that is a big deal.

tkelman · 2016-05-22T08:07:31Z

test/json_samples.jl

@@ -296,8 +296,7 @@ twitter = "{\"results\":[
     \"next_page\":\"?page=2&max_id=1480307926&q=%40twitterapi\",
     \"completed_in\":0.031704,
     \"page\":1,
-     \"query\":\"%40twitterapi\"}
-}"


what was happening here?

There was an extra close brace in the original test, so it was invalid JSON. This change makes parse more picky about that. I don't think it's a good idea to silently accept extra characters after the end, which was the original behaviour.

As an aside, using triple-quotes (""") would make this file much more readable. (I suspect it was originally created before triple-quotes were part of the language.)

I suspect it was originally created before triple-quotes were part of the language

Indeed it was.

TotalVerb · 2016-05-22T16:40:48Z

I was wondering why the array and object parsing code seems slower recently. It turns out that a recent refactor (moving the burden of throwing the error to a helper function) has led to the construction of a bunch of temporary strings. This can be fixed by removing the second argument, but unfortunately the tailored error messages (like Expected a '{' here) will be gone too. A macro would work too, and preserve the tailored error messages, but that looks ugly for some reason...

tkelman · 2016-05-22T19:41:26Z

could delay constructing the string until the error gets displayed?

codecov-io · 2016-05-22T22:47:55Z

Current coverage is 97.01%

Merging #140 into master will increase coverage by 0.64%

@@             master       #140   diff @@
==========================================
  Files             2          3     +1   
  Lines           331        335     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits            319        325     +6   
+ Misses           12         10     -2   
  Partials          0          0

Powered by Codecov. Last updated by bec34de...dcab8a6

tkelman · 2016-05-22T23:59:23Z

src/Parser.jl


 export parse

-type ParserState{T<:AbstractString}
-    str::T
+# FIXME: remove this when @static in Compat


better to submit this to Compat first

OK, I'll send a PR.

TotalVerb · 2016-05-24T02:08:50Z

Should tryparse maybe also go in Compat.jl? Then we could be rid of almost all compatibility code.

tkelman · 2016-05-24T02:12:55Z

.travis.yml

@@ -11,7 +11,7 @@ notifications:
 sudo: false
 script:
    - if [[ -a .git/shallow ]]; then git fetch --unshallow; fi
-    - julia -e 'Pkg.clone(pwd()); Pkg.build("JSON"); Pkg.test("JSON"; coverage=true)';
+    - julia -e 'Pkg.checkout("Compat"); Pkg.clone(pwd()); Pkg.build("JSON"); Pkg.test("JSON"; coverage=true)';


it won't be installed yet, you'd have to do this after cloning JSON from pwd

not a bad idea re: submitting tryparse to Compat, not sure why no one has asked for it yet but if it's useful here then it's bound to be useful somewhere else

Whoops, thanks for the catch. I'll make a PR for tryparse shortly.

It turns out tryparse is already in Compat! It's just the 3-argument form (with radix as third argument) that's missing, which is currently used in parsing unicode escapes. That form might be useful elsewhere too, so I'll submit a PR adding it.

TotalVerb · 2016-05-24T05:16:38Z

I think the Int/Float64 change may be more breaking than I had expected. At least Transit.jl and any package expecting Ints to roundtrip, may be relying on that behaviour.

It's probably best to revert the type-stability change for now, until there's actually a way to specify the number parse function. Then the type-unstable parser can be deprecated and packages given time to adjust. (However, a good change would be to swap Int64 for Int, so that 32-bit builds do not crash.) This will offset some of the performance gains, but we would still be faster than before.

kmsquire · 2016-05-24T05:53:14Z

I think the Int/Float64 change may be more breaking than I had expected. At least Transit.jl and any package expecting Ints to roundtrip, may be relying on that behaviour.

Since it's not in the JSON standard, packages shouldn't really be relying on it. A deprecation warning would be appropriate.

(To be fair, I argued differently when originally introducing OrderedDict's...)

TotalVerb · 2016-05-25T23:39:31Z

There is no real way to deprecate producing Ints as numbers without "deprecating" a whole lot of valid JSON. I think I'll run the tests for some packages that depend on JSON.jl, and see if too much breaks.

Also, Python's json module does have the int special-casing behaviour. It seems the only languages whose JSON modules don't are those that don't have real integer types (to be fair, that includes the JavaScript that JSON derives its name from). This is an argument against making this change.

kmsquire · 2016-05-26T00:31:33Z

Not that we have to follow Python, but I wasn't aware that they converted to int. I'm more ok with us doing so as well then.

TotalVerb · 2016-05-26T01:04:57Z

I can't reproduce the 0.3 Travis failure 😕.

Edit: never mind, I thought I had symlinked the package dir but apparently not.

TotalVerb · 2016-05-27T04:15:07Z

I restored the int-parsing behaviour. Performance has degraded a little bit, as expected. But overall it's not so bad, and for some reason it looks like compilation time is much lower. We also can get a bit of performance gain from using DataStructures.OrderedDict, if users are inclined to squeeze out every last drop of performance.

TotalVerb · 2016-05-28T06:55:34Z

Great news! nextind is quite costly in many parts of the code, and specializing that for UTF8String has improved the geometric mean benchmark results on my machine to 0.17 (down from 0.20). This is even lower than before parse_number type stability was reverted.

This particular change might be applicable to Base too, as it's a 2-fold speed improvement over the current implementation of nextind.

stevengj · 2016-05-28T13:03:53Z

In this particular application, however, it's not clear to me why you need nextind at all. Since parsing JSON only involves looking for ASCII characters and matching substrings, you should be able to do nearly everything by operating on the raw s.data byte array, rather than working with Unicode chars.

Definitely submit a PR with the nextind improvements, though.

TotalVerb · 2016-05-28T15:17:58Z

@stevengj Yes, you're right. In fact some of the performance gain comes from avoiding nextind in string and number parsing. Unfortunately these improved versions somewhat duplicate the generic versions, which exist to provide a performance and memory advantage when parsing a stream, and also because I thought removing generic versions when there were available (to a limited extent) before is somewhat of a regression.

I did consider avoiding it everywhere, but that currently makes code somewhat messy (== @compat UInt('x') everywhere), and would require a rethink of the streaming parser.

But to get more performance, that seems to be necessary. I'll see if I can figure something out.

stevengj · 2016-05-30T12:33:38Z

(Can some of the byte-based parsing optimizations be folded back in to Base? e.g. number parsing of UTF-8 strings seems like it could always avoid nextind as long as only ASCII digits are allowed.)

stevengj · 2016-05-30T12:35:02Z

I guess you could define const _x = @compat UInt('x') to simplify the code somewhat.

TotalVerb · 2016-06-11T01:52:19Z

Getting there... did some more cleanup and squashed commits. I reorganized the directory structure and moved error message constants out of the functions themselves. Also updated the commit message.

I don't plan on making any further major changes, but I will of course address any remaining review comments.

kmsquire · 2016-06-11T05:12:13Z

I don't plan on making any further major changes, but I will of course address any remaining review comments.

If it's ready for review, it would be good to change [WIP] to [RFC] in the title.

TotalVerb · 2016-06-11T05:14:00Z

Done.

kmsquire · 2016-06-11T16:23:12Z

One general comment: unless we expect the code base to become much more complicated, I don't think it's necessary to have a src/Parser directory.

kmsquire · 2016-06-11T16:27:05Z

Also, it's always seemed a bit odd to me that JSON printing is kept in JSON.jl, but the parser stuff was (originally) moved to parser.jl. I think it would be good to also move the printing code out of JSON.jl to a separate file that's just included in JSON.jl. (Having the main file just include a bunch of other files is a typical structure in many packages.)

kmsquire · 2016-06-11T16:37:56Z

src/Parser/specialized.jl

+    _String(b)
+end
+
+function predict_string(ps::MemoryParserState)


Can you add a short comment here about what this function is doing? It's possible to figure out, but it wasn't obvious to me from the name.

It's hard to make the comment short, since what the function does is a little strange. I added a docstring for this and the following function.

kmsquire · 2016-06-11T18:35:45Z

Generally, looks quite good--thanks again for doing this! I think it should be mergeable after you address (or at least answer) the comments above. Some of that can wait until future PRs.

TotalVerb · 2016-06-11T21:18:59Z

Thanks for the comments! I'll address them soon. As for the directory structure, this seems to have been the pattern that will be adopted by Base (JuliaLang/julia#16850). Of course, Base is much more complicated of a codebase too, so maybe we don't need this additional level of organization. I must be doing too much work with the JVM these days 😉.

The main JSON.jl file is a bit awkward because it is almost entirely printing... except for parsefile. Maybe this function should be moved to the Parser module? There was more stuff before when the streaming parser was here too, but that's gone now.

TotalVerb · 2016-06-11T22:29:07Z

All the issues except for the organizational ones should be addressed (or answered) now. I'm inclined to defer all the organizational issues to later, since they don't seem to be related to this PR.

tkelman · 2016-06-11T22:31:56Z

It would be a lot easier to see the actual changes being made in a relative sense if the file moves vs modifications were at least separate commits. Right now github shows the old file as completely deleted and the new files as created from scratch, but I don't think you rewrote every line, did you?

TotalVerb · 2016-06-11T22:33:07Z

Good point. I'll revert the file moves; those can come back if desired in the future.

TotalVerb · 2016-06-11T22:42:44Z

Of course, I think over the course of this PR most lines in Parser.jl were changed, so the diff isn't that much better.

kmsquire · 2016-06-12T03:35:34Z

src/Parser.jl

+        false
+    elseif c == LATIN_N
+        skip!(ps, LATIN_U, LATIN_L, LATIN_L)
+        nothing


I think it would be good to add a comment to each condition specifying what it's testing for, e.g.

if c == LATIN_T # "true" ...

It's decodable right now... but a comment can just be read.

Agreed, that would be better. (skipped ci for comment only change)

kmsquire · 2016-06-12T06:29:36Z

LGTM. I think this can be merged--I'll do so tomorrow if there are no additional comments.

Thanks for all of your work here, @TotalVerb!

TotalVerb · 2016-06-12T07:06:12Z

This will probably need a squash and rebase first.

Overview of changes: `parse_string` • now type stable • allocates much less when parsing in-memory data • fast path for strings without escape sequences `parse_number` • now non-allocating for ints (still allocates for floats) • correctly reject more kinds of invalid numbers `parse_array` • slightly improve efficiency • correctly reject invalid array formats `parse_object` • substantially improve efficiency • change default key type to concrete `parse_simple` • don’t blindly accept all words of form t**e, f***e, or n**l • naming convention: instead of simples, code now refers to jscontants streaming parser • replace brittle streaming parser with a new kind of ParserState • much faster than old streaming parser and less error-prone miscellaneous • use `@inbounds` in more places for better performance • fix some possible unicode compatibility issues • add some docstrings in various places • factor out error messages • add file with constant definitions so `@compat` can be avoided • fix up some weird tests • delete code to allow certain failures (no more allowed failures) known issues • streaming parser doesn’t keep state and so can’t offer informative errors

TotalVerb · 2016-06-12T17:06:56Z

Squashed and rebased.

tkelman reviewed May 22, 2016
View reviewed changes

tkelman reviewed May 24, 2016
View reviewed changes

TotalVerb mentioned this pull request May 24, 2016

Add, test, and document three-argument tryparse JuliaLang/Compat.jl#207

Merged

TotalVerb mentioned this pull request Jun 11, 2016

parsing of arrays results in array of type Any #139

Open

TotalVerb changed the title ~~[WIP] Microoptimize parser~~ [RFC] Microoptimize parser Jun 11, 2016

kmsquire reviewed Jun 11, 2016
View reviewed changes

TotalVerb mentioned this pull request Jun 11, 2016

Trying to parsefile an empty file triggers mmap exception #98

Closed

TotalVerb changed the title ~~[RFC] Microoptimize parser~~ [RFC] Optimize and fix issues with the Parser module Jun 11, 2016

kmsquire reviewed Jun 12, 2016
View reviewed changes

This was referenced Jun 12, 2016

Fix @os_only deprecation warnings #149

Merged

Parse error #122

Closed

TotalVerb changed the title ~~[RFC] Optimize and fix issues with the Parser module~~ Optimize and fix issues with the Parser module Jun 12, 2016

kmsquire merged commit 1082e64 into JuliaIO:master Jun 13, 2016

TotalVerb mentioned this pull request Jun 13, 2016

Parsing control characters such as newline should be invalid #127

Closed

stevengj mentioned this pull request Aug 19, 2016

performance benchmarks #59

Open

yuyichao mentioned this pull request Dec 3, 2016

Use Int instead of Int64 #178

Closed

Optimize and fix issues with the Parser module #140

Optimize and fix issues with the Parser module #140

Conversation

TotalVerb commented Apr 30, 2016 • edited Loading

TotalVerb commented May 1, 2016

kmsquire commented May 1, 2016

nalimilan commented May 1, 2016

TotalVerb commented May 9, 2016

TotalVerb commented May 12, 2016 • edited Loading

stevengj commented May 15, 2016

TotalVerb commented May 16, 2016 • edited Loading

TotalVerb commented May 16, 2016 • edited Loading

yuyichao commented May 16, 2016

TotalVerb commented May 20, 2016

TotalVerb commented May 22, 2016 • edited Loading

Choose a reason for hiding this comment

TotalVerb May 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TotalVerb commented May 22, 2016

tkelman commented May 22, 2016

codecov-io commented May 22, 2016 • edited Loading

Current coverage is 97.01%

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TotalVerb commented May 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TotalVerb May 24, 2016 • edited Loading

Choose a reason for hiding this comment

TotalVerb commented May 24, 2016 • edited Loading

kmsquire commented May 24, 2016

TotalVerb commented May 25, 2016 • edited Loading

kmsquire commented May 26, 2016

TotalVerb commented May 26, 2016 • edited Loading

TotalVerb commented May 27, 2016

TotalVerb commented May 28, 2016 • edited Loading

stevengj commented May 28, 2016 • edited Loading

TotalVerb commented May 28, 2016 • edited Loading

stevengj commented May 30, 2016

stevengj commented May 30, 2016

TotalVerb commented Jun 11, 2016 • edited Loading

kmsquire commented Jun 11, 2016

TotalVerb commented Jun 11, 2016

kmsquire commented Jun 11, 2016

kmsquire commented Jun 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmsquire commented Jun 11, 2016

TotalVerb commented Jun 11, 2016 • edited Loading

TotalVerb commented Jun 11, 2016

tkelman commented Jun 11, 2016

TotalVerb commented Jun 11, 2016

TotalVerb commented Jun 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmsquire commented Jun 12, 2016

TotalVerb commented Jun 12, 2016

TotalVerb commented Jun 12, 2016

Optimize and fix issues with the `Parser` module #140

Optimize and fix issues with the `Parser` module #140

TotalVerb commented Apr 30, 2016 •

edited

Loading

TotalVerb commented May 12, 2016 •

edited

Loading

TotalVerb commented May 16, 2016 •

edited

Loading

TotalVerb commented May 16, 2016 •

edited

Loading

TotalVerb commented May 22, 2016 •

edited

Loading

TotalVerb May 22, 2016 •

edited

Loading

codecov-io commented May 22, 2016 •

edited

Loading

TotalVerb May 24, 2016 •

edited

Loading

TotalVerb commented May 24, 2016 •

edited

Loading

TotalVerb commented May 25, 2016 •

edited

Loading

TotalVerb commented May 26, 2016 •

edited

Loading

TotalVerb commented May 28, 2016 •

edited

Loading

stevengj commented May 28, 2016 •

edited

Loading

TotalVerb commented May 28, 2016 •

edited

Loading

TotalVerb commented Jun 11, 2016 •

edited

Loading

TotalVerb commented Jun 11, 2016 •

edited

Loading