Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for binary data (#12) #15

Merged
merged 8 commits into from
Aug 26, 2016
Merged

Conversation

jonnyreeves
Copy link
Owner

  • Request implementations now supply a Uint8Array of binary data through to the chunkParser.
  • defaultChunkParser converts the chunkBytes into a string for processing
  • Overhauled how parserState is handled; no longer assume that the chunkParser will emit a string; instead we use a state object whose contract is wholly owned by the parser (we just shuffle it around internally).
  • Added test-coverage for UTF-8 characters
  • Added dependency on utf-8 lib for handling (un)marhsalling of binary data to text.
  • Updated documentation
  • Updated npm dependencies.

Inspiration provided by @mwitkow and @ariutta -- many thanks! :)

* Request implementations now supply a `Uint8Array` of binary data through to the `chunkParser`.
* `defaultChunkParser` converts the `chunkBytes` into a string for processing
* Overhauled how parserState is handled; no longer assume that the `chunkParser` will emit a string; instead we use a `state` object whose contract is wholly owned by the parser (we just shuffle it around internally).
* Added test-coverage for UTF-8 characters
* Added dependency on `utf-8` lib for handling (un)marhsalling of binary data to text.
* Updated documentation
* Updated npm dependencies.

Inspiration provided by @mwitkow and @ariutta -- many thanks! :)
@ariutta
Copy link
Contributor

ariutta commented Aug 19, 2016

Looks good! Only question is about xhr.js. It doesn't convert the string to an ArrayBuffer. Is it best to make the parsers responsible for handling two types of input types -- string or ArrayBuffer?

* Update README to convey which polyfills are required for crappy browsers.
Would appear that IE10 has typedarray support :)
@jonnyreeves
Copy link
Owner Author

@ariutta thanks so much for your input on #17; I've adopted your approach of using TextEncoder / TextDecoder to do the marshalling and dropped the dependency on utf-8 (can't remember why I went down that route - bits and bytes aren't my forte...)

I've left the polyfills for TextEncoder, TextDecoder and TypedArray out but added a note in the README that users must ensure certain browser features have been polyfilled correctly (this keep the library nice and slim for newer browsers).

Let me know your thoughts; if it looks good to you I'll land this (and add you to the CONTRIBUTORS).

Thanks again.

ariutta added a commit to ariutta/chunked-request that referenced this pull request Aug 20, 2016
@ariutta
Copy link
Contributor

ariutta commented Aug 20, 2016

@jonnyreeves: sure, keeping the polyfills out of the default bundle is usually a best practice, so that's fine. It's easy but maybe less clean to exclude them with this syntax in the browser field:

"browser": {
    "typedarray": "./empty-object.js"
}

About the TextDecoder, I notice uint8ArrayToString instantiates a new instance of TextDecoder for every chunk.

TextDecoder has a stream option:

A Boolean flag indicating that additional data will follow in subsequent calls to decode(). Set to true if processing the data in chunks, and false for the final chunk or if the data is not chunked. It defaults to false.

By not using the stream option and by instantiating a new instance for each chunk, I'm concerned we could run into trouble. One example: the WHATWG Encoding standard says decoders (but not encoders) should be able to handle UTF-16. In UTF-16, code points can be encoded with two 16-bit code units. Our chunk could split a code point, and the parser would fail. A more common concern is just a possible performance hit.

Here's how I handled this case:
https://github.com/ariutta/chunked-request/blob/4b07bf6b576e0c17dd0f6e5895d87903968139fc/src/defaultChunkParser.js#L16

Maybe we should also use the stream option for TextEncoder.

One more thing: the defaultParser seemed to have trouble with this sample .ndjson file (maybe a good candidate for another test?):
https://cdn.rawgit.com/maxogden/sqlite-search/master/test.ndjson

I made several changes to defaultParser.js so that it could handle that file, as you can see here.

@jonnyreeves
Copy link
Owner Author

...instantiates a new instance of TextDecoder for every chunk... I'm concerned we could run into trouble.

Makes me wonder if we want more control over the lifecycle of the chunkParser; at the moment you just supply a function which has no idea if it's the first call or last call for a given server response.

Perhaps instead of supply a Uint8Array to the chunkParser we should instead supply a stream which the chunk parser can consume until there are no more bytes left, ie:

// use `onChunk` to emit parsed data from the stream.
chunkParser(stream, onChunk) {
    let state = {};
    stream.on('data', bytes => { 
        // ...
        onChunk(jsonObjects); 
    });
    stream.on('end', () => { 
        onChunk(flushedState);
    });
}

Now the (default) chunkParser can instantiate the TextDecoder once per response (and we don't have to mess around managing the parser's state inside index.js anymore as an added bonus.

Let me know your thoughts - and thanks again.

@ariutta
Copy link
Contributor

ariutta commented Aug 21, 2016

I like the idea of passing a minimal stream to the parser. It separates concerns nicely. Only question is whether to base it on Node streams (eventemitter with data and end events) or WHATWG streams (reader with an async read method). I thought we'd previously agreed to use WHATWG streams because that's what fetch will return and is what the Node community is actually telling WHATWG to use instead of Node streams, but I don't have a strong preference for one versus the other. Here's the relevant comparison between the two:

  • Node stream data event maps to WHATWG stream reader.read(function({ value: theChunk, done: false }) {...})
  • Node stream end event maps to WHATWG stream reader.read(function({ value: undefined, done: true }) {...})

For the parser, some of the ideas you're raising about handling state and transforms remind me of transducers. I'm interested in using transducers for the parsers because it appears they may be the fastest way to transform data when it is sometimes returned synchronously and other times asynchronously. I also like that transducers "are composable algorithmic transformations. They are independent from the context of their input and output sources and specify only the essence of the transformation in terms of an individual element. Because transducers are decoupled from input or output sources, they can be used in many different processes," such as native or Underscore reduce implementations, immutable-js, Node streams, RxJS, etc (source). I've already created transducers for quite a few different parsers here.

What do you think about using WHATWG streams and transducers?

@ariutta
Copy link
Contributor

ariutta commented Aug 22, 2016

In my fork, I created a new branch feature/reader to show the use of WHATWG streams, passing a reader to the parser. I didn't use transducers, but that might be an interesting next step.

This required creating a minimal fetch polyfill for the XHR implementations, as you can see here and here.

ariutta and others added 3 commits August 22, 2016 08:33
Browsers using the fallback `xhr` transport should expect an addtional call to their `chunkParser` as we flush out the state of the `TextEncoder` when the XHR connection is closed.
@jonnyreeves
Copy link
Owner Author

Thanks @ariutta; I've started incorperating the changes from your feature/reader branch into this Pull Request; I'll start pulling across the ReadableStream shims later today / tomorrow - time permitting.

It's obvious now that we should look to de-dupe xhr.js and mozXhr.js; however I feel that's best achieved in a seperate PR after this one lands. Additionally, your change to the defaultParsers delimiter is valid; but I think it highlights the need to make it user-configurable (so the user doesn't have to reimplement the default parser just because they have a different delimiter) - again probably best in a subsiquent PR.

Thank you so much for your contributions - no only are you helping shape this library, but you are helping me think about the problem in a new way - I think I owe you a 🍻 :)

@ariutta
Copy link
Contributor

ariutta commented Aug 22, 2016

Beoir! 👍 Glad to collaborate -- it's tricky trying to cover the appropriate use cases without over-engineering.

Agree on segmenting the changes into multiple PRs.

This was referenced Aug 22, 2016
@jonnyreeves jonnyreeves added this to the 0.5 milestone Aug 26, 2016
@jonnyreeves jonnyreeves merged commit 251c178 into master Aug 26, 2016
@jonnyreeves
Copy link
Owner Author

@ariutta; will follow-up with integrating your streams work in a subsiquent PR as this one was getting quite chunky and it would get good to get some of the recent improvements out the door first :)

@jonnyreeves jonnyreeves mentioned this pull request Aug 26, 2016
@ariutta
Copy link
Contributor

ariutta commented Aug 27, 2016

Good plan.

On Aug 26, 2016 10:12 AM, "John Reeves" notifications@github.com wrote:

@ariutta https://github.com/ariutta; will follow-up with integrating
your streams work in a subsiquent PR as this one was getting quite chunky
and it would get good to get some of the recent improvements out the door
first :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABPJg1B_G_E8SokAbPK2SL2jWAkOX5T6ks5qjvRcgaJpZM4JnV8v
.

@jonnyreeves jonnyreeves deleted the feature/byte-arrays branch August 29, 2016 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants