Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Utf8JsonReader to read json from stream #30328

Closed
richlander opened this issue Jul 22, 2019 · 13 comments
Closed

Enable Utf8JsonReader to read json from stream #30328

richlander opened this issue Jul 22, 2019 · 13 comments

Comments

@richlander
Copy link
Member

Basically, implement analogue of JsonTextReader(TextReader).

My scenario is reading the result of: docker inspect [image] (which produces a json document), either called via Process.Start() or piped in via standard input. Both scenarios result in TextReader objects. I’d like to see either a new constructor to enable this scenario or some straightforward collaborative mechanism between the two readers.

Related: https://github.com/dotnet/corefx/issues/38581

@davidfowl
Copy link
Member

Maybe but not really since it's a ref struct, you'd need to make sure all of the data was in the stream before you parse, which kinda defeats the purpose of a Stream. We'd need to make the reader a class so that you could store the stream as a field or we'd need some new class that used the reader and a stream together (like the JsonSerializer).

Why not use the JsonSerializer directly into a JsonElement?

Or copy this logic https://github.com/dotnet/corefx/blob/347412c9a917c71a744d8e20b090da90aa558a79/src/System.Text.Json/src/System/Text/Json/Serialization/JsonSerializer.Read.Stream.cs#L75-L226 😄

@bartonjs
Copy link
Member

Maybe but not really since it's a ref struct, you'd need to make sure all of the data was in the stream before you parse, which kinda defeats the purpose of a Stream.

A ref struct can hold a normal reference just fine, so it could theoretically work. IMO it would be bad, though, because the ValueSpan (or ValueSequence) properties would be returning Spans to buffers the user never owned, making their lifetime ambiguous (at best).

Certainly we could make a Stream-based wrapper to do the buffer management, which inverts the flow:

public class Utf8JsonStreamReader
{
    ...
    public JsonTokenType TokenType { get; }
    public int TokenStartIndex { get; }
    public int TokenLength { get; }
    public void CopyTokenValue(Span<byte> destination);
    public void Read();
    ...
}

But that seems awkward.

@stephentoub
Copy link
Member

stephentoub commented Jul 22, 2019

A ref struct can hold a normal reference just fine, so it could theoretically work

But it can't have async operations (Rich didn't mention that in his description, but I expect David was assuming that as a necessity).

@richlander
Copy link
Member Author

I'd be happy w/o the wrapper (avoiding lifetime and async challenges), and for the ability to provide the json reader with document lines, one at a time. Ideally (for my scenario), I could give the reader an IEnumerable<Span<Byte>> but that isn't possible for other reasons.

@dazinator
Copy link

dazinator commented Oct 24, 2019

I am looking to read json from a file stream. Struggling to understand how to do that..
I found this stack overflow in which someone has written a a complicated wrapper - is this really necessary? https://stackoverflow.com/questions/54983533/parsing-a-json-file-with-net-core-3-0-system-text-json

My use case is that I want to open a json file, and navigate to a particular section of it, then deserialise just that particular section of it. I thought I best use Utf8JsonReader for this so that I can read through the stream ignoring / skipping irrelevent tokens until l get to the relevent section of JSON that I want, then I can process just that section and then close the file stream - without having to load the whole file into memory, or read any more information that is strictly necessary.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 5.0 milestone Feb 1, 2020
@om-ha
Copy link

om-ha commented Feb 18, 2020

I hope this gets fixed.

@dazinator from the same answer you linked, in the comments somebody found it a few bugs and fixed them in this repo.

@alexandrvslv
Copy link

I came here to figure out DeserializeAsync errors:

']' is an invalid start of a property name. Expected a '"'. LineNumber: XXXXX | BytePositionInLine: XX.

'0x0D' is invalid within a JSON string. The string should be correctly escaped. Path: $[5] | LineNumber: XXXXX | BytePositionInLine: XX.

Or similar depending on order of items in json array of source stream.

And i have custom JsonConverter witch can process full item type with referencing properties.

Reason: Utf8JsonReader buffer size is not enough to cover one item in array.

Decision: Increasing JsonSerializerOptions.DefaultBufferSize to cover item size(i my case: to 128KB)

@peterduniho
Copy link

I get the point above that Utf8JsonReader may not be the best place for deserializing from a stream. But given that the docs for DataContractJsonSerializer direct people to this namespace, there should be some entry point in the System.Text.Json namespace that handles this.

Especially since Utf8JsonWriter does support Stream as a data sink. It's weird that the API is not symmetric.

@sergeyt
Copy link

sergeyt commented Jul 13, 2020

I came up with the following code. It reads JSON dataset as IAsyncEnumerable sequence of Dictionary<string, object> records:

using System;
using System.Collections.Generic;
using System.IO;
using System.Runtime.CompilerServices;
using System.Text.Json;
using System.Threading;
using Opw.HttpExceptions;

namespace YourApp
{
    using Record = IEnumerable<KeyValuePair<string, object>>;
    
    public static class JsonArrayReader
    {
        public static IAsyncEnumerable<Record> ReadJsonRecords(this Stream input, CancellationToken cancellationToken)
        {
            bool isArrayStart = true;
            return Parse(input, cancellationToken, (ref Utf8JsonReader reader) =>
            {
                if (isArrayStart)
                {
                    ReadArrayStart(ref reader);
                    isArrayStart = false;
                }
                return ReadRecords(ref reader);
            });
        }

        private delegate IEnumerable<T> Parser<T>(ref Utf8JsonReader reader);

        // inspired by https://github.com/scalablecory/system-text-json-samples/blob/master/json-test/JsonParser.ParseSimpleAsync.cs
        private static async IAsyncEnumerable<T> Parse<T>(Stream input, [EnumeratorCancellation] CancellationToken cancellationToken, Parser<T> parser)
        {
            var buffer = new byte[4096];
            var fill = 0;
            var consumed = 0;
            var done = false;
            var readerState = new JsonReaderState();

            while (!done)
            {
                if (fill == buffer.Length)
                {
                    if (consumed != 0)
                    {
                        buffer.AsSpan(consumed).CopyTo(buffer);
                        fill -= consumed;
                        consumed = 0;
                    }
                    else
                    {
                        Array.Resize(ref buffer, buffer.Length * 3 / 2);
                    }
                }

                int read = await input.ReadAsync(buffer.AsMemory(fill), cancellationToken).ConfigureAwait(false);

                fill += read;
                done = read == 0;

                foreach (var item in ParseBuffer())
                {
                    yield return item;
                }
            }
            
            IEnumerable<T> ParseBuffer()
            {
                var reader = new Utf8JsonReader(buffer.AsSpan(consumed, fill - consumed), done, readerState);
                var result = parser(ref reader);
                consumed += (int)reader.BytesConsumed;
                readerState = reader.CurrentState;
                return result;
            }
        }

        private static void ReadArrayStart(ref Utf8JsonReader reader)
        {
            if (!reader.Read())
            {
                throw new BadRequestException("Unexpected EOF");
            }

            // skip comments
            while (reader.TokenType == JsonTokenType.Comment)
            {
                reader.Skip();
            }

            if (reader.TokenType != JsonTokenType.StartArray)
            {
                throw new BadRequestException($"Expect JSON array, but got {reader.TokenType}");
            }
        }

        private static IEnumerable<Record> ReadRecords(ref Utf8JsonReader reader)
        {
            var records = new List<Record>();
            while (true)
            {
                if (!reader.Read())
                {
                    if (reader.TokenType == JsonTokenType.EndArray)
                    {
                        break;
                    }
                    throw new BadRequestException("Unexpected EOF");
                }

                if (reader.TokenType == JsonTokenType.EndArray)
                {
                    break;
                }
                
                if (reader.TokenType != JsonTokenType.StartObject)
                {
                    throw new BadRequestException($"Expect {JsonTokenType.StartObject}, but got {reader.TokenType}");
                }
                
                var record = ReadRecord(ref reader);
                if (record == null)
                {
                    break;
                }
                records.Add(record);
            }

            return records;
        }

        private static Record ReadRecord(ref Utf8JsonReader reader)
        {
            try
            {
                var savePoint = reader;
                var result = JsonSerializer.Deserialize<Dictionary<string, object>>(ref savePoint);
                reader = savePoint;
                return result;
            }
            catch (JsonException)
            {
                return null;
            }
        }
    }
}

It reuses idea from https://github.com/scalablecory/system-text-json-samples/blob/master/json-test/JsonParser.ParseSimpleAsync.cs.

Also if you target .NET 3+ you might have to implement JsonConverter for Dictionary<string, object> because of #1573.

@ahsonkhan
Copy link
Member

I came here to figure out DeserializeAsync errors:

']' is an invalid start of a property name. Expected a '"'. LineNumber: XXXXX | BytePositionInLine: XX.

'0x0D' is invalid within a JSON string. The string should be correctly escaped. Path: $[5] | LineNumber: XXXXX | BytePositionInLine: XX.

Or similar depending on order of items in json array of source stream.

And i have custom JsonConverter witch can process full item type with referencing properties.

Reason: Utf8JsonReader buffer size is not enough to cover one item in array.

Decision: Increasing JsonSerializerOptions.DefaultBufferSize to cover item size(i my case: to 128KB)

@alexandrvslv, can you please file a separate issue with a simplified repro test app of the issue you were seeing. At first glance, this seems like a bug, and you shouldn't need to increase the DefaultBufferSize to fix it. It would be good for us to understand what our JSON payload looks like, your custom JsonConverter implementation, and the root cause of the issue you were seeing (also include our TFM or STJ package version).

@alexandrvslv
Copy link

@alexandrvslv, can you please file a separate issue with a simplified repro test app of the issue you were seeing. At first glance, this seems like a bug, and you shouldn't need to increase the DefaultBufferSize to fix it. It would be good for us to understand what our JSON payload looks like, your custom JsonConverter implementation, and the root cause of the issue you were seeing (also include our TFM or STJ package version).

@ahsonkhan, i will try to reproduce the issue, it may take some time to implement test for serialize to json with custom formater, process data with some pipe, and deserialize it with custom parser.
i have a lot of changes since 2020 Feb. even model is simplified and transfer separated to sub request's.

@eiriktsarpalis
Copy link
Member

To add to @bartonjs and @stephentoub's initial points, it is currently possible to snapshot a Utf8JsonReader by creating a copy, and there is a lot of JsonConverter code out there relying on that behavior. Adding support for stream sources would break that invariant.

I think this would have been a different conversation if Utf8JsonReader was a class or interface, but it is what it is. Recommendation is to use the JsonSerializer async methods for async deserialization.

@steveharter
Copy link
Member

See also #30405 where there is a similar request to start deserialization from a particular point in a Stream.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 27, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests