Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize and fix issues with the Parser module #140

Merged
merged 1 commit into from
Jun 13, 2016
Merged

Optimize and fix issues with the Parser module #140

merged 1 commit into from
Jun 13, 2016

Conversation

TotalVerb
Copy link
Collaborator

@TotalVerb TotalVerb commented Apr 30, 2016

Overview of changes:

  • Force parse_string to return Compat.UTF8String
  • Don't use IOBuffer objects in parse_string
  • Sprinkled @inbounds and @inline liberally
  • Make default Dict key typed Compat.UTF8String instead of Any
  • Fix some unicode compatibility along the way
  • Folded the streaming parser into the regular parser, using generic implementations (optimizations for number and string parsing in-memory)

Impact:

Speed for parsing both strings and non-strings have improved substantially on my computer. By my tests we are now easily within a factor of 2 from Python's json with or without strings using @samuelcolvin's benchmarks. Note that these benchmarks heavily rely on float parsing which is still slow, and are ASCII-only which obscures the biggest performance gain obtained.

Here are the new benchmark results. Note that these benchmarks have been "over-benchmarked" to some degree, since they were used heavily in deciding which optimizations to pursue. However, they should in theory represent the kinds of data one would parse quite well. Python's builtin json is somewhat unfairly penalized on these benchmarks, because it is pitifully slow on unicode data. ujson on the other hand is quite good at it. However, ujson "cheats" by default on the numeric benchmarks by not parsing floats to best precision; I included the "non-cheating" version too for completeness.

Julia 0.5:
 [Bench] canada 0.057967427 seconds
 [Bench] citm_catalog 0.027227154 seconds
 [Bench] citylots 6.882568502 seconds
 [Bench] twitter 0.011117001 seconds
 [Bench] Total (G.M.) 0.10482892077727128 seconds

Python 2 json:
python json canada time: 0.050s
python json citm_catalog time: 0.032s
python json citylots time: 8.600s
python json twitter time: 0.552s
GM 0.295 (2.81 times Julia)

Python 2 ujson:
python ujson canada time: 0.026s
python ujson citm_catalog time: 0.014s
python ujson citylots time: 5.535s
python ujson twitter time: 0.005s
GM 0.056 (0.534 times Julia)

Python 2 ujson (no cheating):
python ujson canada time: 0.045s
python ujson citm_catalog time: 0.014s
python ujson citylots time: 6.687s
python ujson twitter time: 0.005s
GM 0.068 (0.649 times Julia)

Old Julia 0.5
 [Bench] canada 0.100768632 seconds
 [Bench] citm_catalog 0.089031051 seconds
 [Bench] citylots 16.282590906 seconds
 [Bench] twitter 0.043230788 seconds
 [Bench] Total (G.M.) 0.2819005229832309 seconds (2.69 times new Julia)

This change involves a rewrite of object, number, and array parsing code, which also makes things more correct. It removes all of the allowed failures for JSON checker.

Breakage:

The streaming parsers no longer keep track of what has been parsed, and so the error messages are now effectively worthless.

Closes #98, #118, #122, #127.

@TotalVerb
Copy link
Collaborator Author

A lot of the remaining time for the no-strings test is due to the type instability of parse_number. I am wondering whether it is worth it to simply return Float64 always from this method. JSON itself does not distinguish between integers and decimals, and JavaScript uses Float64 for everything.

I will try out some speculative typing for arrays, which might still give a small performance boost.

@kmsquire
Copy link
Contributor

kmsquire commented May 1, 2016

@TotalVerb, thanks for doing this! I'll note that this package sometimes doesn't get much attention, so if you find this PR languishing, please bump it once or twice (or more).

@nalimilan
Copy link
Member

A lot of the remaining time for the no-strings test is due to the type instability of parse_number. I am wondering whether it is worth it to simply return Float64 always from this method. JSON itself does not distinguish between integers and decimals, and JavaScript uses Float64 for everything.

Makes sense. Type instability is generally a mistake in Julia anyways.

@TotalVerb
Copy link
Collaborator Author

This will need an update to support the latest master's String changes. I plan to work on that later tonight.

@TotalVerb
Copy link
Collaborator Author

TotalVerb commented May 12, 2016

A major issue now is that nextind is very slow for the unified String type. Somehow the switch from ASCIIString to String led to over 50% performance regression on this branch. I have a few ideas for regaining some of the lost performance by avoiding nextind where unnecessary. (e.g. making String from String can be done by byte-copying instead of character-copying, exploiting the fact that ASCII bytes only occur in UTF8 if they are actually encoded ASCII codepoints.)

Master does not have nearly as hard a time dealing with the change to String (though it too suffers some). But in large part, that is because master uses + where nextind may be required.

@stevengj
Copy link
Member

cc @StefanKarpinski, who will be interested to hear of the String change impacts.

@TotalVerb
Copy link
Collaborator Author

TotalVerb commented May 16, 2016

Actually, I think I was mistaken about the source of the performance regression. nextind probably isn't at fault. But the benchmarks are definitely slower; I just don't know why. I'll do some more profiling.

@TotalVerb
Copy link
Collaborator Author

TotalVerb commented May 16, 2016

Here is a condensed benchmark that suffers from significant regression:

function f(s)
    n = 0
    m = start(s)
    e = endof(s)
    while m  e
        n += Int(s[m])
        m = nextind(s, m)
    end
    n
end

Pre-stringapalooza, on UTF8String this is over 10 times as slow as on ASCIIString. String seems to have inherited UTF8String's speed.

A lot of the JSON code looks like this. (Except in the present version, it uses m += 1 instead of nextind(s, m) in several places, not all of which the former is valid.) So it is not surprising that this 10-fold regression turns into a 50% regression when working with ASCII data.

@yuyichao
Copy link
Contributor

It's not surprising that the new String inherits UTF8String's speed since that's exactly what it is.

I've also seen similar effect in https://groups.google.com/forum/?fromgroups=#!topic/julia-users/9zUDciJk878 where using nextind shows a 2x slow down. I remember @StefanKarpinski was talking about more efficient iteration on string index and I was expecting it to come in one of the future related changes.

@TotalVerb
Copy link
Collaborator Author

I think the major issues (excepting compatibility with v0.3) are sorted out now. I'll see if reusing buffers still makes sense, and then I'll make sure all the code additions are actually having substantial impact.

@TotalVerb
Copy link
Collaborator Author

TotalVerb commented May 22, 2016

This PR is nearing maturity. Overall, non-test code has grown by about 20 to 30 lines. (Some of the added lines are comments.) However, given that a very good chunk of the tests that failed before now pass, I think the slight increase in LOC is reasonable. The only functions which have seen their complexity increase are parse_string and parse_simple. I don't know if there's a way to keep parse_string as simple as it was before while maintaining correctness and efficiency, at least until nextind performance becomes much better. parse_simple on the other hand was behaving quite oddly earlier (accepting tXXe as true, which for some reason was even tested?!), and I think it is better to stick to the JSON standard at some small complexity cost.

The only downside to this change (as far as I can see) is that parse_number has less range on 64-bit machines, since it now always returns a Float64. (Before, it could parse 64-bit integers on 64-bit machines... but would also crash on those same integers on 32-bit machines, so it is debatable whether that behaviour was good.) It might be useful, in a later PR, to let the user supply a custom parse_number function. Especially now that functions can be specialized on, this would be similar to how a dict type can be passed in currently.

v0.3 compatibility is back, but the timing is slightly worse than on newer versions. I don't think that is a big deal.

@@ -296,8 +296,7 @@ twitter = "{\"results\":[
\"next_page\":\"?page=2&max_id=1480307926&q=%40twitterapi\",
\"completed_in\":0.031704,
\"page\":1,
\"query\":\"%40twitterapi\"}
}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was happening here?

Copy link
Collaborator Author

@TotalVerb TotalVerb May 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an extra close brace in the original test, so it was invalid JSON. This change makes parse more picky about that. I don't think it's a good idea to silently accept extra characters after the end, which was the original behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, using triple-quotes (""") would make this file much more readable. (I suspect it was originally created before triple-quotes were part of the language.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect it was originally created before triple-quotes were part of the language

Indeed it was.

@TotalVerb
Copy link
Collaborator Author

I was wondering why the array and object parsing code seems slower recently. It turns out that a recent refactor (moving the burden of throwing the error to a helper function) has led to the construction of a bunch of temporary strings. This can be fixed by removing the second argument, but unfortunately the tailored error messages (like Expected a '{' here) will be gone too. A macro would work too, and preserve the tailored error messages, but that looks ugly for some reason...

@tkelman
Copy link
Contributor

tkelman commented May 22, 2016

could delay constructing the string until the error gets displayed?

@codecov-io
Copy link

codecov-io commented May 22, 2016

Current coverage is 97.01%

Merging #140 into master will increase coverage by 0.64%

@@             master       #140   diff @@
==========================================
  Files             2          3     +1   
  Lines           331        335     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits            319        325     +6   
+ Misses           12         10     -2   
  Partials          0          0          

Powered by Codecov. Last updated by bec34de...dcab8a6


export parse

type ParserState{T<:AbstractString}
str::T
# FIXME: remove this when @static in Compat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to submit this to Compat first

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll send a PR.

@TotalVerb
Copy link
Collaborator Author

Should tryparse maybe also go in Compat.jl? Then we could be rid of almost all compatibility code.

@@ -11,7 +11,7 @@ notifications:
sudo: false
script:
- if [[ -a .git/shallow ]]; then git fetch --unshallow; fi
- julia -e 'Pkg.clone(pwd()); Pkg.build("JSON"); Pkg.test("JSON"; coverage=true)';
- julia -e 'Pkg.checkout("Compat"); Pkg.clone(pwd()); Pkg.build("JSON"); Pkg.test("JSON"; coverage=true)';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it won't be installed yet, you'd have to do this after cloning JSON from pwd

not a bad idea re: submitting tryparse to Compat, not sure why no one has asked for it yet but if it's useful here then it's bound to be useful somewhere else

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, thanks for the catch. I'll make a PR for tryparse shortly.

Copy link
Collaborator Author

@TotalVerb TotalVerb May 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out tryparse is already in Compat! It's just the 3-argument form (with radix as third argument) that's missing, which is currently used in parsing unicode escapes. That form might be useful elsewhere too, so I'll submit a PR adding it.

@TotalVerb
Copy link
Collaborator Author

TotalVerb commented May 24, 2016

I think the Int/Float64 change may be more breaking than I had expected. At least Transit.jl and any package expecting Ints to roundtrip, may be relying on that behaviour.

It's probably best to revert the type-stability change for now, until there's actually a way to specify the number parse function. Then the type-unstable parser can be deprecated and packages given time to adjust. (However, a good change would be to swap Int64 for Int, so that 32-bit builds do not crash.) This will offset some of the performance gains, but we would still be faster than before.

@kmsquire
Copy link
Contributor

I think the Int/Float64 change may be more breaking than I had expected. At least Transit.jl and any package expecting Ints to roundtrip, may be relying on that behaviour.

Since it's not in the JSON standard, packages shouldn't really be relying on it. A deprecation warning would be appropriate.

(To be fair, I argued differently when originally introducing OrderedDict's...)

@TotalVerb
Copy link
Collaborator Author

TotalVerb commented May 25, 2016

There is no real way to deprecate producing Ints as numbers without "deprecating" a whole lot of valid JSON. I think I'll run the tests for some packages that depend on JSON.jl, and see if too much breaks.

Also, Python's json module does have the int special-casing behaviour. It seems the only languages whose JSON modules don't are those that don't have real integer types (to be fair, that includes the JavaScript that JSON derives its name from). This is an argument against making this change.

@kmsquire
Copy link
Contributor

Not that we have to follow Python, but I wasn't aware that they converted to int. I'm more ok with us doing so as well then.

@TotalVerb
Copy link
Collaborator Author

TotalVerb commented May 26, 2016

I can't reproduce the 0.3 Travis failure 😕.

Edit: never mind, I thought I had symlinked the package dir but apparently not.

@TotalVerb
Copy link
Collaborator Author

I restored the int-parsing behaviour. Performance has degraded a little bit, as expected. But overall it's not so bad, and for some reason it looks like compilation time is much lower. We also can get a bit of performance gain from using DataStructures.OrderedDict, if users are inclined to squeeze out every last drop of performance.

@TotalVerb
Copy link
Collaborator Author

TotalVerb commented May 28, 2016

Great news! nextind is quite costly in many parts of the code, and specializing that for UTF8String has improved the geometric mean benchmark results on my machine to 0.17 (down from 0.20). This is even lower than before parse_number type stability was reverted.

This particular change might be applicable to Base too, as it's a 2-fold speed improvement over the current implementation of nextind.

@stevengj
Copy link
Member

stevengj commented May 28, 2016

In this particular application, however, it's not clear to me why you need nextind at all. Since parsing JSON only involves looking for ASCII characters and matching substrings, you should be able to do nearly everything by operating on the raw s.data byte array, rather than working with Unicode chars.

Definitely submit a PR with the nextind improvements, though.

@TotalVerb
Copy link
Collaborator Author

TotalVerb commented May 28, 2016

@stevengj Yes, you're right. In fact some of the performance gain comes from avoiding nextind in string and number parsing. Unfortunately these improved versions somewhat duplicate the generic versions, which exist to provide a performance and memory advantage when parsing a stream, and also because I thought removing generic versions when there were available (to a limited extent) before is somewhat of a regression.

I did consider avoiding it everywhere, but that currently makes code somewhat messy (== @compat UInt('x') everywhere), and would require a rethink of the streaming parser.

But to get more performance, that seems to be necessary. I'll see if I can figure something out.

@stevengj
Copy link
Member

(Can some of the byte-based parsing optimizations be folded back in to Base? e.g. number parsing of UTF-8 strings seems like it could always avoid nextind as long as only ASCII digits are allowed.)

@stevengj
Copy link
Member

I guess you could define const _x = @compat UInt('x') to simplify the code somewhat.

@TotalVerb
Copy link
Collaborator Author

TotalVerb commented Jun 11, 2016

Getting there... did some more cleanup and squashed commits. I reorganized the directory structure and moved error message constants out of the functions themselves. Also updated the commit message.

I don't plan on making any further major changes, but I will of course address any remaining review comments.

@kmsquire
Copy link
Contributor

I don't plan on making any further major changes, but I will of course address any remaining review comments.

If it's ready for review, it would be good to change [WIP] to [RFC] in the title.

@TotalVerb TotalVerb changed the title [WIP] Microoptimize parser [RFC] Microoptimize parser Jun 11, 2016
@TotalVerb
Copy link
Collaborator Author

Done.

@kmsquire
Copy link
Contributor

One general comment: unless we expect the code base to become much more complicated, I don't think it's necessary to have a src/Parser directory.

@kmsquire
Copy link
Contributor

Also, it's always seemed a bit odd to me that JSON printing is kept in JSON.jl, but the parser stuff was (originally) moved to parser.jl. I think it would be good to also move the printing code out of JSON.jl to a separate file that's just included in JSON.jl. (Having the main file just include a bunch of other files is a typical structure in many packages.)

_String(b)
end

function predict_string(ps::MemoryParserState)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a short comment here about what this function is doing? It's possible to figure out, but it wasn't obvious to me from the name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to make the comment short, since what the function does is a little strange. I added a docstring for this and the following function.

@kmsquire
Copy link
Contributor

Generally, looks quite good--thanks again for doing this! I think it should be mergeable after you address (or at least answer) the comments above. Some of that can wait until future PRs.

@TotalVerb
Copy link
Collaborator Author

TotalVerb commented Jun 11, 2016

Thanks for the comments! I'll address them soon. As for the directory structure, this seems to have been the pattern that will be adopted by Base (JuliaLang/julia#16850). Of course, Base is much more complicated of a codebase too, so maybe we don't need this additional level of organization. I must be doing too much work with the JVM these days 😉.

The main JSON.jl file is a bit awkward because it is almost entirely printing... except for parsefile. Maybe this function should be moved to the Parser module? There was more stuff before when the streaming parser was here too, but that's gone now.

@TotalVerb
Copy link
Collaborator Author

All the issues except for the organizational ones should be addressed (or answered) now. I'm inclined to defer all the organizational issues to later, since they don't seem to be related to this PR.

@tkelman
Copy link
Contributor

tkelman commented Jun 11, 2016

It would be a lot easier to see the actual changes being made in a relative sense if the file moves vs modifications were at least separate commits. Right now github shows the old file as completely deleted and the new files as created from scratch, but I don't think you rewrote every line, did you?

@TotalVerb
Copy link
Collaborator Author

Good point. I'll revert the file moves; those can come back if desired in the future.

@TotalVerb TotalVerb changed the title [RFC] Microoptimize parser [RFC] Optimize and fix issues with the Parser module Jun 11, 2016
@TotalVerb
Copy link
Collaborator Author

Of course, I think over the course of this PR most lines in Parser.jl were changed, so the diff isn't that much better.

false
elseif c == LATIN_N
skip!(ps, LATIN_U, LATIN_L, LATIN_L)
nothing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add a comment to each condition specifying what it's testing for, e.g.

    if c == LATIN_T  #  "true"
        ...

It's decodable right now... but a comment can just be read.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that would be better. (skipped ci for comment only change)

This was referenced Jun 12, 2016
@kmsquire
Copy link
Contributor

LGTM. I think this can be merged--I'll do so tomorrow if there are no additional comments.

Thanks for all of your work here, @TotalVerb!

@TotalVerb
Copy link
Collaborator Author

This will probably need a squash and rebase first.

@TotalVerb TotalVerb changed the title [RFC] Optimize and fix issues with the Parser module Optimize and fix issues with the Parser module Jun 12, 2016
Overview of changes:

`parse_string`
 • now type stable
 • allocates much less when parsing in-memory data
 • fast path for strings without escape sequences

`parse_number`
 • now non-allocating for ints (still allocates for floats)
 • correctly reject more kinds of invalid numbers

`parse_array`
 • slightly improve efficiency
 • correctly reject invalid array formats

`parse_object`
 • substantially improve efficiency
 • change default key type to concrete

`parse_simple`
 • don’t blindly accept all words of form t**e, f***e, or n**l
 • naming convention: instead of simples, code now refers to jscontants

streaming parser
 • replace brittle streaming parser with a new kind of ParserState
 • much faster than old streaming parser and less error-prone

miscellaneous
 • use `@inbounds` in more places for better performance
 • fix some possible unicode compatibility issues
 • add some docstrings in various places
 • factor out error messages
 • add file with constant definitions so `@compat` can be avoided
 • fix up some weird tests
 • delete code to allow certain failures (no more allowed failures)

known issues
 • streaming parser doesn’t keep state and so can’t offer informative errors
@TotalVerb
Copy link
Collaborator Author

Squashed and rebased.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants