Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize JsonNodeDeserialization wrt recursion #3397

Closed
cowtowncoder opened this issue Feb 13, 2022 · 10 comments
Closed

Optimize JsonNodeDeserialization wrt recursion #3397

cowtowncoder opened this issue Feb 13, 2022 · 10 comments
Labels
CVE Issues related to public CVEs (security vuln reports) performance Issue related to performance problems or enhancements
Milestone

Comments

@cowtowncoder
Copy link
Member

(note: cleaved off of #2816, used to be bundled)

Current implementation JsonNodeDeserialization is expensive for deeply nested Object and Array values as it uses recursion: so for each small additional nesting level -- for arrays, 2 bytes to encode [ and ] -- a new stack frame gets created.
In practical terms this means that it is possible to exhaust JVM heap usage with document that has nesting in order of ten thousand(s) levels, depending on settings.

It should be possible to replace basic recursion, however, with iteration, to at least significantly reduce amplification: to prevent cheapest potential DoS concerns.

@cowtowncoder cowtowncoder added to-evaluate Issue that has been received but not yet evaluated 2.13 CVE Issues related to public CVEs (security vuln reports) performance Issue related to performance problems or enhancements and removed to-evaluate Issue that has been received but not yet evaluated labels Feb 13, 2022
@cowtowncoder cowtowncoder added this to the 2.13.0 milestone Feb 13, 2022
@cowtowncoder
Copy link
Member Author

Note: implementation was merged from branch exp/iterative-jsonnode-2.13 (hash f93fd41028b6efcc7c41401dd271aa7d81da6cf3 ?), included in release 2.13.0 -- this issue added retroactively for tracking purposes.

@deniz-husaj
Copy link

Same issue occurs when using e.g. ObjectMapper.readTree(InputStream in) where in is a JSON String with plain opening and closing square brackets e.g. "[[[[[]]]]]". But with a depth of 50 million nested Arrays. It will take very long time until finishing deserialization or throwing an Exception (actually with 50 million I never reached the ending).

Looks like the recursion in BaseNodeDeserializer._deserializeContainerNoRecursion is the reason for that. Reproducable with jackson-databind=2.13.2.2.

I assume this is related to this issue or should I open a new bug issue for that? Will the provided fix cover that part as well?

@yawkat
Copy link
Member

yawkat commented Apr 4, 2022

@denizhusaj #3416 (which fixed the CVE #2816) specifically fixed the StackOverflowError that can be caused by deeply nested JSON. This is not an issue for JsonNode anymore because in 2.13.0 the JsonNode impl was changed to be iterative.

However even the iterative implementation can still take a while if you feed it 50m arrays. It is simply a lot of tokens :)

@deniz-husaj
Copy link

@yawkat So there are no plans to forbid deeply nested JSON Arrays?

But somehow I still get a StackOverflowError when having deeply nested JSON Objects with input streams like {"abc0": {"abc1": {"abc2": {"abc3": {"abc4": {}}}}}} but with a depth e.g. of 50.000. Is this expected? Same scenario like in my comment before.

@yawkat
Copy link
Member

yawkat commented Apr 4, 2022

@denizhusaj i cannot reproduce that issue. I tried a nested JsonNode object with 50000 levels like in your example. It parsed just fine.

Maybe your error comes from JsonNode.toString? That will still error, however that is more of a debugging method.

@deniz-husaj
Copy link

@yawkat yes sorry you are right, it comes from the toString()...

Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate com.fasterxml.jackson.databind.node.ObjectNode.toString()

But regarding deep nested JSONArrays there will be no depth limit?

@yawkat
Copy link
Member

yawkat commented Apr 4, 2022

I can't say that, it's tatu's decision.

However I'm not convinced a depth limit would help with "long texts take a long time to parse" completely. In general you can also allocate a lot of objects without very deep json, e.g. [[[... 4000 levels...]],[[... 4000 levels...]],... thousands of repetitions...]. This will not run into the depth limits, but will still be fairly slow to parse (simply because there's lots of tokens).

There is one problem that is unique to deeply nested json in particular (as opposed to other ways of getting many tokens): This line limits the expansion of the ContainerStack to max 4000 elements, which means that you can get quite large allocations for every 4000 tokens. However there are always at most two of these arrays alive at a time, so it should not lead to overly large memory use, so it should not be a security risk. It does however reduce perf of parsing of that particular document.

@cowtowncoder
Copy link
Member Author

@denizhusaj Could you file a separate issue for ObjectNode.toString() please? That sounds like a side issue that sounds worth addressing.

As to plans: yes, there is a plan but it'd be via lower level streaming API:

FasterXML/jackson-core#637

since handling it for all distinct deserializers is more work, configurability, so this aspect (maximum input document size/complexity limits) seems better addresses with more general functionality.
That said, I have not started work here; and while conceptually simple, actual high-performance implementation is not trivial, and there's a bit of API work to consider as well (wrt how to pass such configuration limit settings).
API/performance comes into play when passing limits to parser/generator input/output stack implementations; there is no way to currently pass such info.

@deniz-husaj
Copy link

@cowtowncoder yes sure #3447

@cowtowncoder
Copy link
Member Author

Was included in 2.13.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CVE Issues related to public CVEs (security vuln reports) performance Issue related to performance problems or enhancements
Projects
None yet
Development

No branches or pull requests

3 participants