-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JsonSerializer.Deserialize can't be reliably used with Utf8JsonReaders that have been resumed #67454
Comments
Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis Issue DetailsDescriptionSystem.Text.Json.Utf8JsonReader is supposed to be resumable, so that you can read data in chunks, resuming the reader after each chunk. However, the JsonSerializer.Deserialize(ref Utf8JsonReader, ...) methods do not work reliably with a reader that has been resumed (at least in the most likely cases of objects and arrays - I haven't tested strings, numbers, etc). I believe this is because internally, JsonSerializer.Read<T>(ref Utf8JsonReader, JsonTypeInfo) tries to copy the entire object or array into a separate buffer, including the initial '{' or '[' character which has already been consumed by the reader, and then create a new reader to deserialize from that buffer. In the case of a reader that has been resumed at the StartObject or StartArray token, the '{' or '[' character will not exist - because it has already been consumed from the input stream. Really, I don't think JsonSerializer should be trying to look backwards, at data that has already been consumed from the input source. The reader is supposed to be forward-only. Reproduction StepsHere is one way to cause the exception: public class Foo { public int b { get; set; } }
var bytes = System.Text.Encoding.ASCII.GetBytes("{\"a\":7}").AsSpan();
var jr = new Utf8JsonReader(bytes[..1], false, new JsonReaderState());
var read = jr.Read(); // move to StartObject. consume '{'
jr = new Utf8JsonReader(bytes[(int)jr.BytesConsumed..], true, jr.CurrentState); // BytesConsumed is 1
var foo = JsonSerializer.Deserialize<Foo>(ref jr); // this fails There are many other, similar patterns that all fail. Expected behaviorJsonSerializer should respect the forward-only nature of the reader and not try to look backwards at already-consumed data, so that it can work with readers that have been resumed. The resumability of Utf8JsonReader is a core part of its design. More specifically, the Deserialize methods that take a Utf8JsonReader should work if the reader is resumed at the start of the value to deserialize. It would also be nice if it didn't try to copy the entire object or array - which can be quite large - into a separate buffer and then deserialize it using a separate reader. Can't it use the reader I give it? Actual behaviorIt fails with an exception like the following:
Regression?No response Known WorkaroundsInserting an extra '{' or '[' byte into the reader's input stream (so there are effectively two such bytes) when resuming then using reflection to set Utf8JsonReader._consumed = 1 will allow deserialization to succeed. public class Foo { public int b { get; set; } }
var bytes = System.Text.Encoding.ASCII.GetBytes("{\"a\":7}").AsSpan();
var jr = new Utf8JsonReader(bytes[..1], false, new JsonReaderState());
var read = jr.Read(); // move to StartObject. consume '{'
// put an extra '{' or '[' byte into the stream by subtracting one from BytesConsumed
jr = new Utf8JsonReader(bytes[((int)jr.BytesConsumed-1)..], true, jr.CurrentState);
jr._consumed = 1; // set this via reflection or the debugger
var foo = JsonSerializer.Deserialize<Foo>(ref jr); // this succeeds with the above hack Configuration.NET 6.0 on Windows 10 Enterprise x64 Other informationNo response
|
This is by design -- |
I'm not referring to resuming deserialization or to resumable converters. Deserialization hasn't even begun yet. I'm referring to having a Utf8JsonReader positioned on a StartObject token with all of the data for the object available, and then passing it to Deserialize. It shouldn't make a difference whether the reader was resumed there or navigated there via Read. This is not related to #63795. |
Your initial analysis on the root cause appears to be correct. It is not entirely clear to me what the historical reasons for copying the contents of the buffer might be here. It doesn't seem necessary for either checkpointing or reading ahead (which already happens separately in the converter layer). @bartonjs or @ahsonkhan might have more context. Looking at our test coverage for the Out of curiosity, what use case triggered this issue for you? Have you tried using any of the streaming deserialization APIs, in particular the |
Thank you for taking another look. I'm not sure what you mean by "re-entrant" readers. Does that just mean one that's been resumed? I think this case only affects readers that are resumed exactly at the StartObject or StartArray token, in which case the deserializer tries to look backwards one byte into the reader's internal buffer to capture the initial '{' or '[' delimiter. This byte has already been consumed, so it fails after resuming because it's no longer in the internal buffer. One way to fix it without significant changes to the code would be to simply insert the corresponding '{' or '[' byte into the copy buffer and then copy the rest of the data from the reader, rather than trying to look backwards one byte. (Of course, my preference would be to avoid the TrySkip call and the copy into a separate buffer, and just deserialize the value using the reader I provide, but perhaps there's some reason you guys don't do that.) I hit this issue when implementing a class to allow reading from a stream with the Utf8JsonReader interface. The Deserialize methods that take a stream seem to want to deserialize the entire stream into some object or value, but what I want is to be able to pick out individual objects from a stream and deserialize them one by one. For a simple example, let's say you have a stream like this:
If the stream is positioned at the {"a":1} object, the JsonSerializer.Deserialize method will throw an exception upon seeing the ',' character after it. Similarly, if it's positioned at the '[' character, JsonSerializer.DeserializeAsyncEnumerable enumerator will throw an exception upon seeing the ',' character after the array. (And I can't simply swallow the exception, since it may throw before enumerating all the items. I guess it throws as soon as some kind of lookahead code sees the comma...) Imagine that the {"a":1} object is much bigger and you have hundreds of thousands of them coming over a network stream. I'd prefer not to deserialize the whole stream into a giant object. I'd prefer to use a reader loop like: while(reader.Read() && reader.TokenType != JsonTokenType.EndArray)
{
processItem(JsonSerializer.Deserialize<MyItem>(ref reader, ...));
} In my case it looks like: while(streamReader.Read(ref reader) && reader.TokenType != JsonTokenType.EndArray)
{
streamReader.EnsureComplete(ref reader);
processItem(JsonSerializer.Deserialize<MyItem>(ref reader, ...));
} The EnsureComplete method makes sure all the data for the item the reader is currently positioned at is available to be read by the reader. This may require saving the reader state, reading more data, and resuming the reader based on the saved state. Since in my case the reader is practically always positioned on something like a StartObject token when I call EnsureComplete, it frequently gets resumed at the StartObject token and fails to deserialize. The JsonSerializer.DeserializeAsyncEnumerable method is good for what it says: reading from root-level arrays, but in my case I want to read from an array nested inside a page object and still be able to read the "nextLink" property afterwards. (I think it would be nice if the JsonSerializer methods just deserialized whatever thing the stream is pointing at, but I realize that's infeasible to implement in a way that has good performance in general. To avoid over-reading a non-seekable stream, you'd have to read one byte at a time, and that's no good.) |
In this context it means passing on the
Agreed. I'm guessing this logic was added for a reason, but that reason is unknown to me and may well be obsolete. It might be worth launching an investigation but at this point we are severely backlogged and can't prioritize it. Would you be interested in taking a look?
Ultimately this could be addressed by #63795. It hasn't been prototyped yet, but we'd be looking at exposing while(streamReader.Read(ref reader) && reader.TokenType != JsonTokenType.EndArray)
{
var state = new ReadState();
while (!JsonSerializer.TryDeserialize<MyItem>(ref reader, out MyItem? result, ref state, ...))
{
streamReader.Read(ref reader);
}
processItem(result);
} Ultimately this is how the streaming serialization methods are implemented under wraps, exposing such methods could allow more bespoke streaming applications. |
Perhaps, although I'd first have to figure out how to get started debugging the standard library. Any pointers?
One thing I hope could be avoided is the way System.Text.Json code (both built-in and perhaps in your example above) may repeatedly scan the same region of the JSON data. For example, imagine a large JSON object that's a megabyte in size. Whether using a loop like the above, or using the existing streaming deserialization methods, what tends to happen is that JsonSerializer will call TrySkip to make sure it has enough data, and when that returns false it'll fail and effectively roll back the reader. Then it, or the caller, will read more data into the buffer, and TrySkip will be called again, which'll rescan the old data again, plus the new data. This will happen repeatedly until it finally buffers the full megabyte, and as a result it'll effectively scan multiple megabytes to read that 1MB object. (Maybe your hypothetical ReadState variable could help with that by preserving extra state, depending on what's in it, but from looking at the current System.Text.Json code I think TrySkip is still called.) Another somewhat more vexing (but slightly further off-topic) example is people storing large binary data in JSON (as base64). The Utf8JsonReader doesn't seem to have any way to read a large string in chunks. With a large object, you can generally make some progress with each read, but with a large string, you can't make any progress until you've buffered the entire string, and as before the Utf8JsonReader.Read method will repeatedly rescan the entire buffer until it finally sees the whole string. And those large strings are generally stored in some object property, which is a frequent cause of the "large object" problem mentioned above. Personally I wish you guys would not call TrySkip, and just optimistically try to deserialize from the reader, and roll back the reader (as you already do) if it fails. This actually increases the cost of repeated attempts, though. Either way, given the design of Utf8JsonReader (which can't pull in more data by itself), I think the onus is on the caller (or the stream reader in this case) to have the intelligence to read the entire value in one go, and not just blindly read a chunk of bytes into the buffer and say "let's see if it works this time". This is not too difficult, but it nearly requires implementing your own JSON reader just to figure out how much data to give to the built-in Utf8JsonReader. It also gets more difficult with large strings (and large numbers in theory), since unlike a large object where you can at least detect the object from the StartObject token (and thus know that you should read to the end of it), the Utf8JsonReader won't give you a String token until it has the whole string in memory. It seems like something the standard library could help with, since few people are likely to do the work and get it right, but many could benefit from efficient streaming methods built in. Rather than providing TryDeserialize methods, you could provide methods to help people ensure it'll work the first time. (Then again, your hypothetical ReadState type may avoid the duplicated work.) I recall somebody saying you guys were thinking about eventually adding a stream reader... Another possibility is allowing the Utf8JsonReader to take a delegate that it can call to read more data. Either way, I'd hope more of the scanning work to enable effective streaming could be offloaded into the standard library. (I'm not just asking for myself, since I've already implemented all my asks locally, except fixes for bugs like this, which are outside my control.) |
Sure, this guide should help you getting started. |
Description
System.Text.Json.Utf8JsonReader is supposed to be resumable, so that you can read data in chunks, resuming the reader after each chunk. However, the JsonSerializer.Deserialize(ref Utf8JsonReader, ...) methods do not work reliably with a reader that has been resumed at the beginning of an object or array to deserialize. (I haven't tested strings, numbers, etc.)
I believe this is because internally, JsonSerializer.Read<T>(ref Utf8JsonReader, JsonTypeInfo) tries to copy the entire object or array into a separate buffer, including the initial '{' or '[' character which has already been consumed by the reader, and then create a new reader to deserialize from that buffer. In the case of a reader that has been resumed at the StartObject or StartArray token, the '{' or '[' character will not exist - because it has already been consumed from the input stream. Really, I don't think JsonSerializer should be trying to look backwards, at data that has already been consumed from the input source. The reader is supposed to be forward-only.
Reproduction Steps
Here is one way to cause the exception:
There are many other, similar patterns that all fail.
Expected behavior
JsonSerializer should respect the forward-only nature of the reader and not try to look backwards at already-consumed data, so that it can work with readers that have been resumed. The resumability of Utf8JsonReader is a core part of its design.
More specifically, the Deserialize methods that take a Utf8JsonReader should work if the reader is resumed at the start of the value to deserialize.
It would also be nice if it didn't try to copy the entire object or array - which can be quite large - into a separate buffer and then deserialize it using a separate reader. Can't it just use the reader I give it?
Actual behavior
It fails with an exception like the following:
Regression?
I doubt it.
Known Workarounds
Inserting an extra '{' or '[' byte into the reader's input stream (so there are effectively two such bytes) when resuming, and then using reflection to set Utf8JsonReader._consumed = 1, will allow deserialization to succeed. I don't consider this an actual workaround that I could use in real code, because it depends on reflection and internal implementation details, but it will avoid the exception. (Also, a Utf8JsonReader can't be boxed, so normal reflection methods don't work.)
Configuration
.NET 6.0 on Windows 10 Enterprise x64
Other information
No response
The text was updated successfully, but these errors were encountered: