-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize and fix issues with the Parser
module
#140
Conversation
A lot of the remaining time for the no-strings test is due to the type instability of I will try out some speculative typing for arrays, which might still give a small performance boost. |
@TotalVerb, thanks for doing this! I'll note that this package sometimes doesn't get much attention, so if you find this PR languishing, please bump it once or twice (or more). |
Makes sense. Type instability is generally a mistake in Julia anyways. |
This will need an update to support the latest master's String changes. I plan to work on that later tonight. |
A major issue now is that Master does not have nearly as hard a time dealing with the change to |
cc @StefanKarpinski, who will be interested to hear of the |
Actually, I think I was mistaken about the source of the performance regression. |
Here is a condensed benchmark that suffers from significant regression: function f(s)
n = 0
m = start(s)
e = endof(s)
while m ≤ e
n += Int(s[m])
m = nextind(s, m)
end
n
end Pre-stringapalooza, on A lot of the JSON code looks like this. (Except in the present version, it uses |
It's not surprising that the new I've also seen similar effect in https://groups.google.com/forum/?fromgroups=#!topic/julia-users/9zUDciJk878 where using |
I think the major issues (excepting compatibility with v0.3) are sorted out now. I'll see if reusing buffers still makes sense, and then I'll make sure all the code additions are actually having substantial impact. |
This PR is nearing maturity. Overall, non-test code has grown by about 20 to 30 lines. (Some of the added lines are comments.) However, given that a very good chunk of the tests that failed before now pass, I think the slight increase in LOC is reasonable. The only functions which have seen their complexity increase are The only downside to this change (as far as I can see) is that v0.3 compatibility is back, but the timing is slightly worse than on newer versions. I don't think that is a big deal. |
@@ -296,8 +296,7 @@ twitter = "{\"results\":[ | |||
\"next_page\":\"?page=2&max_id=1480307926&q=%40twitterapi\", | |||
\"completed_in\":0.031704, | |||
\"page\":1, | |||
\"query\":\"%40twitterapi\"} | |||
}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was an extra close brace in the original test, so it was invalid JSON. This change makes parse
more picky about that. I don't think it's a good idea to silently accept extra characters after the end, which was the original behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, using triple-quotes ("""
) would make this file much more readable. (I suspect it was originally created before triple-quotes were part of the language.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect it was originally created before triple-quotes were part of the language
Indeed it was.
I was wondering why the array and object parsing code seems slower recently. It turns out that a recent refactor (moving the burden of throwing the error to a helper function) has led to the construction of a bunch of temporary strings. This can be fixed by removing the second argument, but unfortunately the tailored error messages (like |
could delay constructing the string until the error gets displayed? |
Current coverage is 97.01%@@ master #140 diff @@
==========================================
Files 2 3 +1
Lines 331 335 +4
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 319 325 +6
+ Misses 12 10 -2
Partials 0 0
|
|
||
export parse | ||
|
||
type ParserState{T<:AbstractString} | ||
str::T | ||
# FIXME: remove this when @static in Compat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to submit this to Compat first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll send a PR.
Should |
@@ -11,7 +11,7 @@ notifications: | |||
sudo: false | |||
script: | |||
- if [[ -a .git/shallow ]]; then git fetch --unshallow; fi | |||
- julia -e 'Pkg.clone(pwd()); Pkg.build("JSON"); Pkg.test("JSON"; coverage=true)'; | |||
- julia -e 'Pkg.checkout("Compat"); Pkg.clone(pwd()); Pkg.build("JSON"); Pkg.test("JSON"; coverage=true)'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it won't be installed yet, you'd have to do this after cloning JSON from pwd
not a bad idea re: submitting tryparse
to Compat, not sure why no one has asked for it yet but if it's useful here then it's bound to be useful somewhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, thanks for the catch. I'll make a PR for tryparse
shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out tryparse
is already in Compat
! It's just the 3-argument form (with radix as third argument) that's missing, which is currently used in parsing unicode escapes. That form might be useful elsewhere too, so I'll submit a PR adding it.
I think the It's probably best to revert the type-stability change for now, until there's actually a way to specify the number parse function. Then the type-unstable parser can be deprecated and packages given time to adjust. (However, a good change would be to swap |
Since it's not in the JSON standard, packages shouldn't really be relying on it. A deprecation warning would be appropriate. (To be fair, I argued differently when originally introducing |
There is no real way to deprecate producing Also, Python's |
Not that we have to follow Python, but I wasn't aware that they converted to int. I'm more ok with us doing so as well then. |
I can't reproduce the 0.3 Travis failure 😕. Edit: never mind, I thought I had symlinked the package dir but apparently not. |
I restored the int-parsing behaviour. Performance has degraded a little bit, as expected. But overall it's not so bad, and for some reason it looks like compilation time is much lower. We also can get a bit of performance gain from using |
Great news! This particular change might be applicable to Base too, as it's a 2-fold speed improvement over the current implementation of |
In this particular application, however, it's not clear to me why you need Definitely submit a PR with the |
@stevengj Yes, you're right. In fact some of the performance gain comes from avoiding nextind in string and number parsing. Unfortunately these improved versions somewhat duplicate the generic versions, which exist to provide a performance and memory advantage when parsing a stream, and also because I thought removing generic versions when there were available (to a limited extent) before is somewhat of a regression. I did consider avoiding it everywhere, but that currently makes code somewhat messy ( But to get more performance, that seems to be necessary. I'll see if I can figure something out. |
(Can some of the byte-based parsing optimizations be folded back in to Base? e.g. number parsing of UTF-8 strings seems like it could always avoid |
I guess you could define |
Getting there... did some more cleanup and squashed commits. I reorganized the directory structure and moved error message constants out of the functions themselves. Also updated the commit message. I don't plan on making any further major changes, but I will of course address any remaining review comments. |
If it's ready for review, it would be good to change |
Done. |
One general comment: unless we expect the code base to become much more complicated, I don't think it's necessary to have a |
Also, it's always seemed a bit odd to me that JSON printing is kept in JSON.jl, but the parser stuff was (originally) moved to |
_String(b) | ||
end | ||
|
||
function predict_string(ps::MemoryParserState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a short comment here about what this function is doing? It's possible to figure out, but it wasn't obvious to me from the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to make the comment short, since what the function does is a little strange. I added a docstring for this and the following function.
Generally, looks quite good--thanks again for doing this! I think it should be mergeable after you address (or at least answer) the comments above. Some of that can wait until future PRs. |
Thanks for the comments! I'll address them soon. As for the directory structure, this seems to have been the pattern that will be adopted by Base (JuliaLang/julia#16850). Of course, Base is much more complicated of a codebase too, so maybe we don't need this additional level of organization. I must be doing too much work with the JVM these days 😉. The main JSON.jl file is a bit awkward because it is almost entirely printing... except for |
All the issues except for the organizational ones should be addressed (or answered) now. I'm inclined to defer all the organizational issues to later, since they don't seem to be related to this PR. |
It would be a lot easier to see the actual changes being made in a relative sense if the file moves vs modifications were at least separate commits. Right now github shows the old file as completely deleted and the new files as created from scratch, but I don't think you rewrote every line, did you? |
Good point. I'll revert the file moves; those can come back if desired in the future. |
Parser
module
Of course, I think over the course of this PR most lines in Parser.jl were changed, so the diff isn't that much better. |
false | ||
elseif c == LATIN_N | ||
skip!(ps, LATIN_U, LATIN_L, LATIN_L) | ||
nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add a comment to each condition specifying what it's testing for, e.g.
if c == LATIN_T # "true"
...
It's decodable right now... but a comment can just be read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, that would be better. (skipped ci for comment only change)
LGTM. I think this can be merged--I'll do so tomorrow if there are no additional comments. Thanks for all of your work here, @TotalVerb! |
This will probably need a squash and rebase first. |
Parser
moduleParser
module
Overview of changes: `parse_string` • now type stable • allocates much less when parsing in-memory data • fast path for strings without escape sequences `parse_number` • now non-allocating for ints (still allocates for floats) • correctly reject more kinds of invalid numbers `parse_array` • slightly improve efficiency • correctly reject invalid array formats `parse_object` • substantially improve efficiency • change default key type to concrete `parse_simple` • don’t blindly accept all words of form t**e, f***e, or n**l • naming convention: instead of simples, code now refers to jscontants streaming parser • replace brittle streaming parser with a new kind of ParserState • much faster than old streaming parser and less error-prone miscellaneous • use `@inbounds` in more places for better performance • fix some possible unicode compatibility issues • add some docstrings in various places • factor out error messages • add file with constant definitions so `@compat` can be avoided • fix up some weird tests • delete code to allow certain failures (no more allowed failures) known issues • streaming parser doesn’t keep state and so can’t offer informative errors
Squashed and rebased. |
Overview of changes:
Impact:
Speed for parsing both strings and non-strings have improved substantially on my computer. By my tests we are now easily within a factor of 2 from Python's json with or without strings using @samuelcolvin's benchmarks. Note that these benchmarks heavily rely on float parsing which is still slow, and are ASCII-only which obscures the biggest performance gain obtained.
Here are the new benchmark results. Note that these benchmarks have been "over-benchmarked" to some degree, since they were used heavily in deciding which optimizations to pursue. However, they should in theory represent the kinds of data one would parse quite well. Python's builtin json is somewhat unfairly penalized on these benchmarks, because it is pitifully slow on unicode data. ujson on the other hand is quite good at it. However, ujson "cheats" by default on the numeric benchmarks by not parsing floats to best precision; I included the "non-cheating" version too for completeness.
This change involves a rewrite of object, number, and array parsing code, which also makes things more correct. It removes all of the allowed failures for JSON checker.
Breakage:
The streaming parsers no longer keep track of what has been parsed, and so the error messages are now effectively worthless.
Closes #98, #118, #122, #127.