Skip to content
This repository was archived by the owner on Jul 22, 2019. It is now read-only.

Make parsing of strings optional #12

Closed
wants to merge 1 commit into from
Closed

Conversation

jonhoo
Copy link

@jonhoo jonhoo commented Nov 19, 2017

This fixes #5 as proposed in that issue. Specifically, it introduces a trait, FromByteResponse, and provides a generic parse_response function that parses input as raw byte strings, and then maps them using some implementation of that trait. This in turns allows users to choose how they want to parse byte sequences in an ergonomic way.

Under the hood, this is slightly less efficient that it could be. Specifically, it parses everything as &[u8] first, and then maps everything using calls to FromByteResponse. This works fine, but will cause unnecessary re-allocation of vectors inside a bunch of structs. The way to work around this is to have nom directly use a generic return value in all its parsers, but this unfortunately doesn't seem to be supported at the time of writing.

@jonhoo
Copy link
Author

jonhoo commented Nov 19, 2017

cc @sanmai-NL

@djc
Copy link
Owner

djc commented Nov 19, 2017

So can you describe a motivating use case or example where this is needed?

@jonhoo
Copy link
Author

jonhoo commented Nov 19, 2017

I think @sanmai-NL is a better person to ask than me, but apparently this can happen frequently with Exchange servers. For some motivation, mattnenterprise/rust-imap#11 has several users wanting this, namely @chshawkn-pub, @vandenoever, and @dario23. @sanmai-NL has also been a strong proponent of not requiring server UTF-8 support (e.g., outlook.com doesn't support it: mattnenterprise/rust-imap#11 (comment)).

@jonhoo
Copy link
Author

jonhoo commented Nov 19, 2017

Also, as a side-note, it'd be nice to run rustfmt on the codebase. I had to constantly remember to undo the formatting changes my editor wanted to introduce :)

@djc
Copy link
Owner

djc commented Nov 19, 2017

@jonhoo so I'm very opinionated on code formatting, and I think rustfmt at least with its default settings makes the code harder to read in many places. It also doesn't seem to deal particularly well with the formatting of the macro invocations in the parser (which is maybe not so surprising). ac960b0 applies some of rustfmts choices that I do agree with. I'd recommend you just disable automatic reformatting in your IDE for this project, I'll also see if I can configure rustfmt to come up with something more tasteful (but not sure how to get around the macro problem).

@sanmai-NL
Copy link
Contributor

sanmai-NL commented Nov 19, 2017

Perhaps this was a misunderstanding. The old IMAP Response parser in rust-imap failed in some cases, where it tried to decode bytes as UTF-8. First of all, we need to find out whether these issues still occur with the original reporters after mattnenterprise/rust-imap#58 is merged.

If not, then we may want to trace those original issues. Reading through various RFCs, however, I think that UTF-8 may not appear anywhere outside of message bodies when the capability to handle UTF-8 isn't advertised. I conclude that from the applicable RFC 6855 (note that an obsolete version of it was discussed earlier). Moreover, I cannot say UTF-8-invalid byte sequences may appear at all. I've tested creating folders with UTF-8 characters, messages with UTF-8 envelope senders, subjects and bodies on an Exchange as well as Dovecot IMAP server using Mozilla Thunderbird, but in all cases these strings are encoded in an ASCII compatible way. The only potential issue I see is that NULL bytes aren't allowed and such. The strings that appear in IMAP protocol messages are rather a restricted subset of ASCII. We should test the imap-proto parser to ensure it rejects e.g. UTF-8 bytes that are in the complement of the valid bytes for IMAP. I still have only had time to look at all the code and RFCs haphazardly, keep that in mind, but I presume we could put in a bit of extra work to enhance the imap-proto parser to implement the complete grammar down to particular character classes, starting at RFC 4466.

This fixes djc#5 as proposed in that issue. Specifically, it introduces a
trait, `FromByteResponse`, and provides a generic `parse_response`
function that parses input as raw byte strings, and then maps them using
some implementation of that trait. This in turns allows users to choose
how they want to parse byte sequences in an ergonomic way.

Under the hood, this is *slightly* less efficient that it could be.
Specifically, it parses everything as `&[u8]` first, and then maps
everything using calls to `FromByteResponse`. This works fine, but will
cause unnecessary re-allocation of vectors inside a bunch of structs.
The way to work around this is to have `nom` directly use a generic
return value in all its parsers, but this unfortunately doesn't seem to
be supported at the time of writing.
@jonhoo
Copy link
Author

jonhoo commented Nov 19, 2017

Interesting... Well, if we never need to worry about std::str::from_utf8 failing, then I'm all in favor of keeping the code as it is (that is, returning &str everywhere). There are two advantages of the approach taken in this PR though:

  1. The user can choose to parse out to owned String values directly, and all the way down, for some messages, if they so wish.
  2. We can propagate decoding errors directly up to the user. This would require changing the return type to of FromByteResponse and the various map_bytes methods to return Result, but avoids unwrap inside the library itself.

Those are somewhat orthogonal though, and may not be worth the extra code.

@djc
Copy link
Owner

djc commented Nov 21, 2017

Yeah, so based on all this I don't think this is the right approach at this time. It feels to complex, even more so given that we don't really have a solid use case for this yet. Jon, sorry you spent time on something that won't be merged in the short term! I hope you did get something out of it.

As for the unwrap() calls, I think the right approach there is to use the map_res!() macro from nom to convert those to parser errors, instead of unwrapping them in place. Could probably even do a nice macro that takes care of the parse-error-converting std::str::from_utf8() in a concise way. If either of you wants to take that one, that might be a nicely manageable project.

@djc djc closed this Nov 21, 2017
@sanmai-NL
Copy link
Contributor

sanmai-NL commented Nov 21, 2017

@djc: Rushing to a solution would be a bit premature. Not to say this was definitely rushing, just that I can understand how you would perceive it @djc. I hope to help @jonhoo in getting to the bottom of the UTF-8 decoding issue that was reported. Closing the PR may be also a bit premature in that sense, perhaps my latest analysis that there's shouldn't be UTF-8 invalid byte sequences anywhere outside of e-mails is wrong again.

@djc: I'd like to tackle the error handling part! See #13.

@djc
Copy link
Owner

djc commented Nov 21, 2017

Reopening the PR will be cheap, and I am more than happy to do so if it turns out I'm totally wrong about all this!

@jonhoo
Copy link
Author

jonhoo commented Nov 21, 2017

Given that I personally have no use-case for non-UTF8 servers, this all seems fine by me. I mostly just did this because I want to close mattnenterprise/rust-imap#54 :p
And don't worry about the lost work. I knew going into it that it might not pan out.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No support for non-UTF8 parsing
3 participants