Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an API to perform streaming transcoding #30260

Closed
pranavkm opened this issue Jul 15, 2019 · 16 comments · Fixed by #35145
Closed

Add an API to perform streaming transcoding #30260

pranavkm opened this issue Jul 15, 2019 · 16 comments · Fixed by #35145
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Encoding blocking Marks issues that we want to fast track in order to unblock other important work
Milestone

Comments

@pranavkm
Copy link
Contributor

(Edit by @GrabYourPitchforks: proposed API is in a comment at https://github.com/dotnet/corefx/issues/39483#issuecomment-557737022.)

System.Text.Json specifically supports UTF8 inputs and outputs. MVC needs to support arbitrary user configured encoding schemes. To support this in a performant way, MVC has transcoding Stream implementations that minimally buffer inputs and outputs to the serializer.

Most of the work in these implementations is to perform state management when calling Utf8.FromUtf16 \ Encoders * Decoders. It would be great if these APIs were instead available as part of CoreFx.

@pranavkm
Copy link
Contributor Author

/cc @GrabYourPitchforks

@GrabYourPitchforks
Copy link
Member

For context, I spoke with Pranav about this a few days ago. The Framework currently has a static method System.Text.Encoding.Convert that can perform byte[] <-> byte[] transcoding, but it's a one-shot method that maintains no internal state. It assumes that the input is complete and self-consistent, which means that it can't be used in scenarios like ASP.NET's where they're reading over a Stream in a loop.

The request here is that we have an API akin to System.Text.Encoding.Convert but which also maintains internal state, similar to how the Encoder and Decoder classes already work today. This would make it appropriate for use in looping / streaming scenarios.

@GrabYourPitchforks
Copy link
Member

@pranavkm Can you share a sample of the code that you want to be able to write? That would help inform the shape of the API that is missing.

@rynowak
Copy link
Member

rynowak commented Nov 16, 2019

Right now this is basically what it looks like....

Again, we're happy if you want to try and improve this with the knowledge that we're optimizing for performance over simplicity. The task we need to perform is to use the serializer with arbitrary encodings.

Ultimately we don't get any value from managing these streams ourselves - if this were an implementation detail of the JSON serializer that would suit our needs even better.

Reading

var httpContext = null; // assume we have the httpContext
var encoding = null; // assume we already decided on an Encoding

Stream stream;
bool disposeStream;
if (encoding.CodePage == Encoding.UTF8.CodePage)
{
    stream = httpContext.Request.Body;
    disposeStream = false;
}
else
{
    stream = new TranscodingReadStream(httpContext.Request.Body, encoding, disposeInner: false);
    disposeStream = true;
}

object result;
try
{
    result = await JsonSerializer.DeserializeAsync(stream, someType, SerializerOptions);
}
catch (JsonException)
{
    // Do error handling xD
}
finally
{
    if (disposeStream)
    {
        await stream.DisposeAsync();
    }
}

Writing

var httpContext = null; // assume we have the httpContext
var encoding = null; // assume we already decided on an Encoding

Stream stream;
bool wrappedStream;
if (encoding.CodePage == Encoding.UTF8.CodePage)
{
    stream = httpContext.Response.Body;
    wrappedStream = false;
}
else
{
    stream = new TranscodingWriteStream(httpContext.Response.Body, selectedEncoding, disposeInner: false);
    wrappedStream = true;
}

try
{
    await JsonSerializer.SerializeAsync(stream, someObject, someType, SerializerOptions);

    // The transcoding streams use Encoders and Decoders that have internal buffers. We need to flush these
    // when there is no more data to be written. Stream.FlushAsync isn't suitable since it's
    // acceptable to Flush a Stream (multiple times) prior to completion.
    if (wrappedStream)
    {
        await ((TranscodingWriteStream)stream).FinalWriteAsync(CancellationToken.None);
    }

    await stream.FlushAsync();
}
finally
{
    if (wrappedStream)
    {
        await stream.DisposeAsync();
    }
}

@GrabYourPitchforks
Copy link
Member

Are there particular Encodings you need this to be optimized for? UTF-8 on one end, obviously, but what's on the other side? For example, certain encodings like ISO 8859-1 map pretty cleanly to UTF-8 with only minimal transformation required, so if we knew what the common scenarios were we could optimize those code paths.

@rynowak
Copy link
Member

rynowak commented Nov 18, 2019

I'm not certain we care deeply about the performance. Out of the box MVC is configured to do UTF8 and UTF16 for historical reasons, but IMO if you're not doing UTF8 on the web, you are doing the weird. No one has ever asked us to add something else to the defaults.

We need this feature just because it would be genuinely surprising if we didn't have it.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Nov 22, 2019

So seems like something very straightforward would work then:

namespace System.Text    // perhaps should be System.IO?
{
    // NEW type being proposed - sealed by default (see detailed comments below)
    public sealed class TranscodingStream : Stream
    {
        public TranscodingStream(Stream innerStream, Encoding outerEncoding, Encoding innerEncoding, bool leaveOpen = false);

        /* all virtual methods defined by Stream are overridden */
    }
}

The new proposed class is sealed because I haven't yet thought through what extensibility points we'd need to add, and unsealing in the future is a non-breaking change.

When calling TranscodingStream.Write, we expect the outer byte data to be encoded using outerEncoding. We'll transcode it to innerEncoding, then write it to innerStream.

When calling TranscodingStream.Read, we expect the inner byte data to be encoded using innerEncoding. We'll transcode it to outerEncoding and return the transcoded data from TranscodingStream.Read.

When disposing the TranscodingStream instance, we'll flush to innerStream any buffered data we're holding on to. We'll then dispose innerStream only if leaveOpen = false.

@rynowak
Copy link
Member

rynowak commented Nov 23, 2019

This sounds like it meets our requirements pretty well!

@rynowak
Copy link
Member

rynowak commented Nov 23, 2019

@pranavkm - any thoughts?

@pranavkm
Copy link
Contributor Author

This sounds pretty close to what we have in MVC (with more bells and whistles). Just to have this in writing, we'll have async versions of the Read and Write APIs, won't we?

@GrabYourPitchforks
Copy link
Member

Yes, the plan is to override all virtual methods, including async methods. I'll clarify that in the API description.

@rynowak
Copy link
Member

rynowak commented Jan 15, 2020

@GrabYourPitchforks - what do we need to do to make progress on this? We want to build more functionality in ASP.NET Core that would use this, and I'd really love not to duplicate the code we're using in MVC today.

@GrabYourPitchforks
Copy link
Member

@rynowak I'll mark it partner blocking so that it gets prioritized during triage.

@GrabYourPitchforks GrabYourPitchforks self-assigned this Jan 15, 2020
@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 5.0 milestone Feb 1, 2020
@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review labels Feb 4, 2020
@terrajobst
Copy link
Contributor

terrajobst commented Feb 4, 2020

Video

  • We'll likely need this ourselves for the HttpClient extension methods for JSON, so putting it in the BCL seems more reasonable than an ASP.NET/networking specific helper
  • Instead of exposing a concrete type, we'd prefer a factory because we'll likely want to have dedicated implementations for specific encodings in order to speed things up.
  • Putting the factory on Encoding seems most sensible because that's the array based helpers are too
namespace System.Text
{
    public partial class Encoding
    {
        public static Stream CreateTranscodingStream(Stream innerStream, Encoding innerStreamEncoding, Encoding outerStreamEncoding, bool leaveOpen = false);
    }
}

@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@maryamariyan maryamariyan removed the untriaged New issue has not been triaged by the area owner label Mar 3, 2020
@scalablecory
Copy link
Contributor

@GrabYourPitchforks I'd love a version of this API that checks for byte order marks; I know there are a few places in code we check for these. Something like:

public static Stream CreateTranscodingStreamFromBOM(Stream innerStream, Encoding fallbackInnerStreamEncoding, Encoding outerStreamEncoding, bool leaveOpen = false);

@GrabYourPitchforks
Copy link
Member

@scalablecory If you don't mind go ahead and open a new issue tracking the BOM request. Since this issue is now closed I don't want your feedback to get lost!

@ghost ghost locked as resolved and limited conversation to collaborators Dec 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Encoding blocking Marks issues that we want to fast track in order to unblock other important work
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants