Make parsing of strings optional #12

jonhoo · 2017-11-19T19:28:16Z

This fixes #5 as proposed in that issue. Specifically, it introduces a trait, FromByteResponse, and provides a generic parse_response function that parses input as raw byte strings, and then maps them using some implementation of that trait. This in turns allows users to choose how they want to parse byte sequences in an ergonomic way.

Under the hood, this is slightly less efficient that it could be. Specifically, it parses everything as &[u8] first, and then maps everything using calls to FromByteResponse. This works fine, but will cause unnecessary re-allocation of vectors inside a bunch of structs. The way to work around this is to have nom directly use a generic return value in all its parsers, but this unfortunately doesn't seem to be supported at the time of writing.

jonhoo · 2017-11-19T19:28:22Z

cc @sanmai-NL

djc · 2017-11-19T19:31:04Z

So can you describe a motivating use case or example where this is needed?

jonhoo · 2017-11-19T19:35:18Z

I think @sanmai-NL is a better person to ask than me, but apparently this can happen frequently with Exchange servers. For some motivation, mattnenterprise/rust-imap#11 has several users wanting this, namely @chshawkn-pub, @vandenoever, and @dario23. @sanmai-NL has also been a strong proponent of not requiring server UTF-8 support (e.g., outlook.com doesn't support it: mattnenterprise/rust-imap#11 (comment)).

jonhoo · 2017-11-19T19:35:47Z

Also, as a side-note, it'd be nice to run rustfmt on the codebase. I had to constantly remember to undo the formatting changes my editor wanted to introduce :)

djc · 2017-11-19T20:08:57Z

@jonhoo so I'm very opinionated on code formatting, and I think rustfmt at least with its default settings makes the code harder to read in many places. It also doesn't seem to deal particularly well with the formatting of the macro invocations in the parser (which is maybe not so surprising). ac960b0 applies some of rustfmts choices that I do agree with. I'd recommend you just disable automatic reformatting in your IDE for this project, I'll also see if I can configure rustfmt to come up with something more tasteful (but not sure how to get around the macro problem).

sanmai-NL · 2017-11-19T21:58:19Z

Perhaps this was a misunderstanding. The old IMAP Response parser in rust-imap failed in some cases, where it tried to decode bytes as UTF-8. First of all, we need to find out whether these issues still occur with the original reporters after mattnenterprise/rust-imap#58 is merged.

If not, then we may want to trace those original issues. Reading through various RFCs, however, I think that UTF-8 may not appear anywhere outside of message bodies when the capability to handle UTF-8 isn't advertised. I conclude that from the applicable RFC 6855 (note that an obsolete version of it was discussed earlier). Moreover, I cannot say UTF-8-invalid byte sequences may appear at all. I've tested creating folders with UTF-8 characters, messages with UTF-8 envelope senders, subjects and bodies on an Exchange as well as Dovecot IMAP server using Mozilla Thunderbird, but in all cases these strings are encoded in an ASCII compatible way. The only potential issue I see is that NULL bytes aren't allowed and such. The strings that appear in IMAP protocol messages are rather a restricted subset of ASCII. We should test the imap-proto parser to ensure it rejects e.g. UTF-8 bytes that are in the complement of the valid bytes for IMAP. I still have only had time to look at all the code and RFCs haphazardly, keep that in mind, but I presume we could put in a bit of extra work to enhance the imap-proto parser to implement the complete grammar down to particular character classes, starting at RFC 4466.

This fixes djc#5 as proposed in that issue. Specifically, it introduces a trait, `FromByteResponse`, and provides a generic `parse_response` function that parses input as raw byte strings, and then maps them using some implementation of that trait. This in turns allows users to choose how they want to parse byte sequences in an ergonomic way. Under the hood, this is *slightly* less efficient that it could be. Specifically, it parses everything as `&[u8]` first, and then maps everything using calls to `FromByteResponse`. This works fine, but will cause unnecessary re-allocation of vectors inside a bunch of structs. The way to work around this is to have `nom` directly use a generic return value in all its parsers, but this unfortunately doesn't seem to be supported at the time of writing.

jonhoo · 2017-11-19T22:31:06Z

Interesting... Well, if we never need to worry about std::str::from_utf8 failing, then I'm all in favor of keeping the code as it is (that is, returning &str everywhere). There are two advantages of the approach taken in this PR though:

The user can choose to parse out to owned String values directly, and all the way down, for some messages, if they so wish.
We can propagate decoding errors directly up to the user. This would require changing the return type to of FromByteResponse and the various map_bytes methods to return Result, but avoids unwrap inside the library itself.

Those are somewhat orthogonal though, and may not be worth the extra code.

djc · 2017-11-21T10:42:42Z

Yeah, so based on all this I don't think this is the right approach at this time. It feels to complex, even more so given that we don't really have a solid use case for this yet. Jon, sorry you spent time on something that won't be merged in the short term! I hope you did get something out of it.

As for the unwrap() calls, I think the right approach there is to use the map_res!() macro from nom to convert those to parser errors, instead of unwrapping them in place. Could probably even do a nice macro that takes care of the parse-error-converting std::str::from_utf8() in a concise way. If either of you wants to take that one, that might be a nicely manageable project.

sanmai-NL · 2017-11-21T11:15:42Z

@djc: Rushing to a solution would be a bit premature. Not to say this was definitely rushing, just that I can understand how you would perceive it @djc. I hope to help @jonhoo in getting to the bottom of the UTF-8 decoding issue that was reported. Closing the PR may be also a bit premature in that sense, perhaps my latest analysis that there's shouldn't be UTF-8 invalid byte sequences anywhere outside of e-mails is wrong again.

@djc: I'd like to tackle the error handling part! See #13.

djc · 2017-11-21T11:44:13Z

Reopening the PR will be cheap, and I am more than happy to do so if it turns out I'm totally wrong about all this!

jonhoo · 2017-11-21T15:44:26Z

Given that I personally have no use-case for non-UTF8 servers, this all seems fine by me. I mostly just did this because I want to close mattnenterprise/rust-imap#54 :p
And don't worry about the lost work. I knew going into it that it might not pan out.

jonhoo force-pushed the non-utf branch from 7bb5dbe to 8c944c1 Compare November 19, 2017 22:28

djc closed this Nov 21, 2017

sanmai-NL mentioned this pull request Nov 21, 2017

Error handling improvement #13

Open

jonhoo mentioned this pull request Nov 21, 2017

No support for non-UTF8 parsing #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make parsing of strings optional #12

Make parsing of strings optional #12

Uh oh!

jonhoo commented Nov 19, 2017

Uh oh!

jonhoo commented Nov 19, 2017

Uh oh!

djc commented Nov 19, 2017

Uh oh!

jonhoo commented Nov 19, 2017

Uh oh!

jonhoo commented Nov 19, 2017

Uh oh!

djc commented Nov 19, 2017

Uh oh!

sanmai-NL commented Nov 19, 2017 •

edited

Loading

Uh oh!

jonhoo commented Nov 19, 2017

Uh oh!

djc commented Nov 21, 2017

Uh oh!

sanmai-NL commented Nov 21, 2017 •

edited

Loading

Uh oh!

djc commented Nov 21, 2017

Uh oh!

jonhoo commented Nov 21, 2017

Uh oh!

Uh oh!

Make parsing of strings optional #12

Make parsing of strings optional #12

Uh oh!

Conversation

jonhoo commented Nov 19, 2017

Uh oh!

jonhoo commented Nov 19, 2017

Uh oh!

djc commented Nov 19, 2017

Uh oh!

jonhoo commented Nov 19, 2017

Uh oh!

jonhoo commented Nov 19, 2017

Uh oh!

djc commented Nov 19, 2017

Uh oh!

sanmai-NL commented Nov 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonhoo commented Nov 19, 2017

Uh oh!

djc commented Nov 21, 2017

Uh oh!

sanmai-NL commented Nov 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djc commented Nov 21, 2017

Uh oh!

jonhoo commented Nov 21, 2017

Uh oh!

Uh oh!

sanmai-NL commented Nov 19, 2017 •

edited

Loading

sanmai-NL commented Nov 21, 2017 •

edited

Loading