Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

',' is invalid after a single JSON value. Expected end of data. #30405

Closed
ericwj opened this issue Jul 28, 2019 · 30 comments
Closed

',' is invalid after a single JSON value. Expected end of data. #30405

ericwj opened this issue Jul 28, 2019 · 30 comments
Assignees
Milestone

Comments

@ericwj
Copy link

ericwj commented Jul 28, 2019

I am reading a single complex object from a FileStream using JsonSerializer.ReadAsync<T>(stream, options, token) and I seriously just want the one value, but I am getting this exception.

',' is invalid after a single JSON value. Expected end of data.

The JSON in the file is preceded by bytes which I have skipped or already processed and it is followed by more bytes which I can skip or process separately. However I get stuck by this exception and don't get the value that is presumably correctly read.

I have no idea how long the value is. Pre-parsing the JSON at a lower level to just get the byte length is a bad solution.

Wouldn't it be better to not throw in this situation and let the consumer of the API choose to check for end of stream?

Similar to #29947.

@ericwj
Copy link
Author

ericwj commented Jul 28, 2019

I have tried setting JsonSerializerOptions.AllowTrailingCommas but that didn't make a difference. If it would do what I expect it still wouldn't help if the JSON isn't followed by a comma.

@ahsonkhan ahsonkhan self-assigned this Jul 29, 2019
@ahsonkhan
Copy link
Member

ahsonkhan commented Aug 19, 2019

@ericwj can you share your input JSON payload that is causing the issue and also the .NET object model you are deserializing to?

Also, what version of dotnet sdk are you using (share the output of dotnet --info)? Can you try with preview 9 bits (or latest nightly)?

Please try the nightly SDK/runtime with the fix to verify it works (https://github.com/dotnet/core-sdk#installers-and-binaries), or wait till preview 9 ships. Alternatively, you could reference the latest S.T.Json NuGet package. For example:

  <ItemGroup>
    <PackageReference Include="System.Text.Json" Version="4.6.0-preview9.19413.13" />
    <PackageReference Include="System.Text.Encodings.Web" Version="4.6.0-preview9.19413.13" />
  </ItemGroup>

If your scenarios is the following (and you want to just read the first json object), then that is not supported today. We only support reading a single JSON object:
File containing more than one payload isn't supported:
{ ... some json...}, { ... another json ... }

This is also how the underlying Utf8JsonReader behaves (which is where the exception you observed is coming from).

string json = "{},"
byte[] input = Encoding.UTF8.GetBytes(jsonString);

var reader = new Utf8JsonReader(input, 
    new JsonReaderOptions() { AllowTrailingCommas = true });

while (reader.Read())
    ;

// After EndObject throws System.Text.Json.JsonReaderException : 
',' is invalid after a single JSON value. Expected end of data. LineNumber: 0 | BytePositionInLine: 2

@ericwj
Copy link
Author

ericwj commented Aug 21, 2019

Yes I am trying to read in a streaming fashion, so there is JSON following other JSON. The files can become fairly big.

I think I might be able to use a Pipe and an Utf8JsonReader to do the job.

Although it is quite a lot of code overhead, working with SequencePosition is awkward and JsonSerializer.Deserialize<T>(ref reader, ...) will throw exceptions as long as the buffer isn't valid yet while I search for the next '}'. Wouldn't it be better if there was a bool TryDeserialize<T> anyway?

@ahsonkhan
Copy link
Member

Wouldn't it be better if there was a bool TryDeserialize<T> anyway?

What would you expect TryDeserialize to do/return in your scenario where you have multiple json payloads within the file and it fails to deserialize? The deserialize call today throws, so I don't see returning false in the Try* API would help all that much. cc @steveharter

I think I might be able to use a Pipe and an Utf8JsonReader to do the job.
Although it is quite a lot of code overhead, working with SequencePosition is awkward

Yep, that's true. Let me think about the scenario you presented and get back to you (I want to see how much effort is required if a motivated dev wants to support this case via the low-level reader). If you already have a pipe/utf8jsonreader-based implementation that works, please share.

@ericwj
Copy link
Author

ericwj commented Aug 22, 2019

A quick and dirty implementation that works as long as the input is okay is here.

What would you expect TryDeserialize to do/return in your scenario where you have multiple json payloads within the file and it fails to deserialize?

I would like it to return false or some error status and not throw, because exceptions are wicked expensive. Something like this:

class JsonSerializer {
    public static bool TryDeserialize<T>(
        ref Utf8JsonReader reader,
        out T result,
        JsonSerializerOptions options);
}

The pipe reader code will search for [, { and their closing counterparts, where any unbalanced object tags will cause an exception calling JsonSerializer.Deserialize<T>. A naive way is just to try to deserialize for any } found, hoping it will match the first opening brace in the JSON remaining at that point, at the cost of a try...catch if the braces aren't balanced. Less naive might be to balance {'s with }'s, but that comes at the cost of more calls to ReadOnlySequence<byte>.PositionOf(byte) or yet higher code complexity and more awkward SequencePosition math and comparisons.

I don't immediately see the need for SequencePosition in favor of plain ints or longs at all, can the tracking of the _object field not be avoided by just having overloads that take some integral offset? Or it has to become something proper that is comparable and has suitable operators defined. But then still I don't see why I have to deal with them and drag that _object field along.

I seriously don't think using the Utf8JsonReader directly - that is, without calling into JsonSerializer - is beneficial except if performance is supercritical and you have to deserialize a lot of JSON. I don't think I might ever go that route except in a rare case when the schema is very, very limited and the volume to read/write justifies the effort.

@ahsonkhan
Copy link
Member

Given this is currently by-design, and requires some thinking/design work to support this feature, moving to 5.0.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 5.0 milestone Feb 1, 2020
@layomia layomia modified the milestones: 5.0.0, Future Jun 20, 2020
@logiclrd
Copy link
Contributor

logiclrd commented Dec 6, 2020

I ran into this issue myself and ended up working around it by making a custom Stream class that has some minimal understanding of JSON syntax and, provided the JSON is valid, can determine when it has reached the end of a JSON value. One thing I ran into, though, is that in some cases, the end of the JSON value is detectable only by reading a character that isn't part of the value, and if Deserialize simply returned at that point, that character would be lost, since generic Stream doesn't have a way to "push back" a character to allow a future call to reread it. My custom Stream takes this into account, storing this potential "pushed back" character. But, this is a really messy solution, and I don't see a way to integrate partial reads of the source data into the existing model for Deserialize without running into this problem. But, a thought that occurred to me is that perhaps JsonSerializer could present a method specifically for deserializing a JSON array into an IEnumerable or IAsyncEnumerable, yielding each item as it is deserialized. Lacking a full async deserialization infrastructure, each of those items will have to be fully deserialized in order to be returned (e.g. even if it contains a nested array, that nested array cannot itself be returned piecemeal), but that is its own extremely difficult problem to solve in a way that I believe doing this at the top level is not. A method as proposed would basically need to itself recognize [, ], and ,, and then would need internally to have a Deserialize method that could return a "push back" character and also use one passed back in from a previous call, instead of requiring that the end of the value coincide with the end of the stream.

@logiclrd
Copy link
Contributor

logiclrd commented Dec 6, 2020

Actually if it were only being used by the enumerate-array-elements method it wouldn't need to support receiving a pushback character, as the character in question would always be either , or ] and would be consumed by the outer method.

@logiclrd
Copy link
Contributor

logiclrd commented Dec 6, 2020

In case it is not obvious, specifically, if the value being deserialized is a string, an object or an array, then the character that tells you you are at the end of the JSON value is a part of the value itself and no further character needs to be read. But, if it is a number, you only know that you're past the end of the number once you read the first character that isn't part of the number. Technically speaking, when reading true, false or null, the last letter of the word, with well-formed JSON, unambiguously marks the end of value, but if the value is read up until a clear "end of identifier", that end is also past the end of the JSON value.

@Clockwork-Muse
Copy link
Contributor

@logiclrd - You're making things hard on yourself. Assuming you've got some sort of delegating stream which is buffering the contents, you don't bother to rewind the stream, you just don't move the offending characters into the buffer (or rather, don't mark them as part of the buffer).

  • Assume the current character may be invalid (ie, , followed immediately by ,).
  • Seek until either start-of-object ({) or start-of-array ([) is found.
    • Mark which it was.
    • have the relevant stream methods consider this "start of stream" (ie, CopyTo considers it index 0)
  • As the stream is consumed, increment a count for additional starts of the same type, and decrement for counts of the same type (you don't have to consider starts/ends of the other type).
    • Obviously, skip starts/ends inside of strings. This means you need to keep track of string starts/ends too.
    • When the count is 0, consider it end-of-stream.

This unfortunately iterates over the stream an extra time (but only once), but is simple to implement and doesn't require messing around with states of readers.

@logiclrd
Copy link
Contributor

logiclrd commented Dec 7, 2020

This unfortunately iterates over the stream an extra time (but only once)

CanSeek is not always true. This is not a general solution. In my specific case, the underlying stream is a TCP connection. The entire purpose of my implementation is to avoid buffering and yield elements of the array as they arrive, and I think that would be a very useful thing to have in general. :-)

@Banyc
Copy link
Contributor

Banyc commented Mar 14, 2021

I have had a similar issue. I solved it by deleting the file before writing serialized JSON in it. If not do the deletion, the System.Text.Json.JsonSerializer will overwrite the file and leave the previous extra data unchanged.

I hope my experience could solve the issue for someone else.

@logiclrd
Copy link
Contributor

It's good to point this out in case people with corrupt data end up on this issue by searching for the error message, but I want to make it clear that this issue isn't about corrupt data, but about a design limitation in System.Text.Json that makes it impossible to intentionally decode a JSON value from the middle of a stream.

Note that there is another approach you can use: call stream.SetLength(stream.Position) after serializing to it but before closing the file.

@TeddyAlbina
Copy link

I ran into a similar issue with my json:
{"StorageIdentifier":"aafb1460-8a30-49f4-89be-5ccf2651248d","Subscription":"c728de4b-07f3-488f-b9eb-58508281525b"}

I can't figure out what is the problem

@eiriktsarpalis
Copy link
Member

Duplicate of #33030, specifically the discussion about SupportMultipleContent-like functionality.

@logiclrd
Copy link
Contributor

This does not seem like a duplicate of #33030. #33030 is about reading multiple independent JSON documents in succession from a stream, while this one is about reading a part of a single JSON document.

@eiriktsarpalis
Copy link
Member

Wouldn't a SupportMultipleContent-like feature serve to address both use cases though?

@ericwj
Copy link
Author

ericwj commented Oct 23, 2021

At the very least the other one is the duplicate - it is three quarters of a year newer than this issue.

Looks like about the same although I think SupportMultipleContent is misleading. Can you just make it stop parsing when it's done please?

@eiriktsarpalis
Copy link
Member

At the very least the other one is the duplicate - it is three quarters of a year newer than this issue.

I should point out that me closing this or the other issue is not assigning credit -- we just need to keep a focused backlog. It just happens that the other issue was triaged ahead of time and as such contains more recent information. I would recommend upvoting the other issue or contributing to the conversation there.

Looks like about the same although I think SupportMultipleContent is misleading. Can you just make it stop parsing when it's done please?

Why do you think it's misleading? The default behavior is to fail on trailing data and we're not changing that. We can discuss adding an optional flag for disabling this, just like Json.NET is doing.

@ericwj
Copy link
Author

ericwj commented Oct 23, 2021

I should point out that me closing this or the other issue is not assigning credit

Thats fine.

Why do you think it's misleading? The default behavior is to fail on trailing data and we're not changing that.

I think the name SupportMultipleContent is. I totally understand most use cases need to fail for trailing bytes. My point is that I started reading at some random point in a stream and it didn't care about that either. It also shouldn't care what follows, JSON which I read as 'multiple content', newline, or emoji, or 0xdeadbeef or 0xcc, or an access violation.

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented Oct 23, 2021

Sure, this is not an API proposal, it's merely pointing out the name of the feature in Json.NET

@logiclrd
Copy link
Contributor

logiclrd commented Oct 24, 2021

I still feel like these are fundamentally not the same issue.

The other issue is about allowing parsing in situations where the source data is not syntactically valid JSON, because it is smushing multiple documents together. This issue is about making the parsing of JSON arrays piecewise -- it's still a single, valid document, I just don't want to parse the whole thing in one call. But, I want the parsing state preserved so that I can go back to it and get the next piece.

These are not the same functionality.

EDIT: I thought I was commenting on a related issue that I had encountered. Rereading the top of this thread, @ericwj is not coming at this from quite the same place I am, and my issue was a comment further down the thread.

@ericwj
Copy link
Author

ericwj commented Oct 24, 2021

Just looking through how this works, I also think this issue should be reopened, over this request: TryDeserialize overloads.

Although neither the , nor the end-of-lines in #33030 are really a problem when the reading is from a span and the objects being read are known to be fairly small and can hence be definitely fully in the span such that Utf8JsonReader will happily succeed calling JsonSerializer.Deserialize<T>(ref Utf8JsonReader), stuff gets very cumbersome when the reading is from a stream like in this original issue's comment. Most of these scenario's get even more complicated when reading arbitrary objects, asynchronously. And one reason why I struggle to get anything custom implemented is because all I can do is try and catch and that just stinks for being...smelly, and for performing very badly for no reason.

In the former case, with small objects, in spans, or single read results, @logiclrd is also helped using the pair Utf8JsonReader.CurrentState and new UTf8JsonReader(Span/Sequence, bool, JsonReaderState). But this is seriously unworkable as a general solution over the lack of TryDeserialize on JsonSerializer and it is also just a lot of code - which would lead to #33030, but I can imagine whatever solution you create for that, someone will have some special requirements and still want to hand roll something and hence stumble on the limited performance of catches/second possible.

@eiriktsarpalis
Copy link
Member

EDIT: I thought I was commenting on a related issue that I had encountered. Rereading the top of this thread, @ericwj is not coming at this from quite the same place I am, and my issue was a comment further down the thread.

@logiclrd is this presumably related to #30405 (comment)? I'm not sure I entirely understand the first part of the post, so I would recommend creating a new issue containing a reproduction of the behavior so we can make a recommendation. Regarding the second part, .NET 6 does ship with a DeserializeAsyncEnumerable method for streaming root-level JSON arrays.

@eiriktsarpalis
Copy link
Member

@ericwj a related request was discussed in #54557 (comment). Per design guidelines Try- methods are only permitted to return false for one and only one failure mode. So a method that's equivalent to doing catch { return false; } would most likely not be accepted since it's collapsing a large number of failure modes.

@logiclrd
Copy link
Contributor

Mm, DeserializeAsyncEnumerable I believe does cover my usage. It only applies to arrays at the root, but that's precisely my situation.

@ericwj
Copy link
Author

ericwj commented Oct 27, 2021

@ericwj a related request was discussed (...) design guidelines (...)

This too is not an API proposal.

I'd be fine with error codes, preferably the most reasonable ones defined in an enum while there is a whole series of errors for which its fine to still throw in this case, too - like out of memory, etc. but also say invalid characters for the current JavaScript decoder.

Other than that, for the use case at the start of the article, all that would be needed is bool TryDeserialize<T> which returns false only in one or two specific cases, primarily when there is not enough data available.

The benefit of the former variant would be that much higher levels of customizations can be achieved at very high performance. Would still need to instruct it to succeed without error if all that is wrong is that data follows the JSON, whatever how long the valid part is.

@steveharter
Copy link
Member

steveharter commented Oct 27, 2021

Given the scenario where you just want a single value, I'd probably take the approach of passing in a JsonPath (e.g. "$.MyObject.MyPropertyToReturn" to a helper like:

static async Task<byte[]> GetJsonFromPath(Stream utf8Json, string jsonPath, JsonSerializerOptions options);
// Return the raw bytes to be deserialized in any way (serializer, node, element)

or a helper that directly deserializes those bytes using the serializer\node\element which will avoid an extra byte[] allocation:

static async Task<TValue> GetValueFromPath(Stream utf8Json, string jsonPath, JsonSerializerOptions options);
// Return the value from the serializer.

Similar JsonPath functionality was proposed with #31068 and also in #55827 where I said I'll provide a sample to do this with V6 code, however Stream+SingleValue wasn't mentioned (which complicates things).

Some options for Stream:

  • Drain: drain the Stream to the end, like the existing implementation for JsonDocument\JsonElement\Node. Not performant if there is a bunch of data past the requested JsonPath.
  • Single-use: read from the Stream only to the requested value (plus whatever the Stream reads to fill up the current smallish buffer) and then assume the Stream will be disposed. Since there will likely be some read but unprocessed bytes after the requested JSON value, it wouldn't be possible to do another ReadAsync on the Stream again to perform another Serialize\Deserialize since the unprocessed bytes are no longer known by the Stream.
  • Multiple-use: extend Single-use to support multiple Serialize\Deserialize by keeping track of the unprocessed bytes. One way to support that is to add a "JsonCursor" abstraction that can be used with new Serialize\Deserialize methods. JsonCursor would contain a reference to the Stream and maintain the reader and unprocessed bytes for multiple calls to Serialize\Deserialize.

All options would have minimal processing of the unneeded values -- e.g. values that are not the target JsonPath will not be deserialized.

I can work on the sample now if there's interest, but would like to know if "drain" is acceptable or not.

@ericwj
Copy link
Author

ericwj commented Oct 27, 2021

No thank you @steveharter, repeatedly parsing is unacceptable for the problems described in this issue and in #33030.

@steveharter steveharter self-assigned this Oct 28, 2021
@steveharter
Copy link
Member

OK I'll work on a prototype; since it involves async+Stream the prototype may need to change internal STJ code instead of just an add-on sample.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants