Error handling improvement #13

sanmai-NL · 2017-11-21T11:18:46Z

As for the unwrap() calls, I think the right approach there is to use the map_res!() macro from nom to convert those to parser errors, instead of unwrapping them in place. Could probably even do a nice macro that takes care of the parse-error-converting std::str::from_utf8() in a concise way. If either of you wants to take that one, that might be a nicely manageable project.

I will submit a PR for this.

sanmai-NL · 2017-11-21T11:19:14Z

Related: #1.

sanmai-NL · 2017-11-24T12:08:50Z

As it's still not clear whether it's a proper assumption that parsed byte sequences in responses are valid UTF-8, I'm not yet convinced about using map_res! as proposed by @djc in the opening post. &String::from_utf8_lossy(data) seems like a less impactful solution to reduce the panic risk. I believe we would have to add more than a few lines of code to handle all instances where parsing a byte sequences as UTF-8 could theoretically(?) fail. I'm not sure how you envision the usage of a macro for that. As you see it, the macro would return the same error value for each of the many places where parsing bytes as UTF-8 could fail, right? Why not call map_err() instead?

What do you think?

djc · 2017-11-24T16:52:09Z

I'd rather not hide the impact. Using from_utf8_lossy() means more or less silently losing data, which I think is a really bad choice. I think raising a parse error is better, as it makes it clear that something is wrong. Especially at the stage where we don't fully understand how servers have implemented the standard, we should raise errors so we can act on them to improve this further, rather than glossing over error cases.

If you really want to dig in, go and list all the places in the parser that currently use from_utf8() and figure out what kind of data we usually see there (at least from one server, for example) as well as what the formal syntax from RFC 3501 says (the parser names try to match the formal syntax names). Then we can make a more informed decision about it on a case-by-case basis.

sanmai-NL · 2017-11-24T17:45:02Z

It's true from_utf8_lossy() theoretically sweeps invalid UTF-8 data under the carpet. We can, however, add a fuzz test case that asserts no occurrence of the UTF-8 replacement character, to discover such cases. (Note, I’m not suggesting that’s a better approach per se.)

Questions
I've been looking at many RFCs now and to me some things aren’t clear entirely:

Is there a complete, most current IMAP grammar that also explicitly specifies all character sets/terminal symbols? I've looked at e.g. RFC 3501 and RFC 4466.
Where can a byte sequence occur in the IMAP protocol that can't be decoded as UTF-8 unpredictably?

Assumptions

If the answer to 2 is nowhere, there’s no technical problem in using UTF-8 values (&str, String) in IMAP types, unless you'd want to serialize values of these types to some IMAP response or command as byte sequence (not sensible I suppose, and not supported right now).
Handling (i.e. not panicking or crashing) byte sequences that are valid UTF-8 yet also invalid byte sequences in the respective IMAP derivations, is only necessary for a goal of correctness. Meaning, if the goal is correctness, then the objective here is to handle invalid data, in order to prevent invalid data from passing the parser (and e.g. being transmitted to an IMAP server by code that depends on this library).
A functional goal for this library is to parse IMAP commands as well as responses.

Are we on the same page about these assumptions?

Tangential observation
In a way, the approach of sticking with UTF-8 values (&str, String) within library code, followed by adding comprehensive UTF-8 decoding error handling, is an argument in favor of #12. After all, either we trust data is UTF-8 or we don’t.

djc · 2017-11-28T12:33:20Z

I don't know how you read the RFCs, I think this information is clear and fairly easy to find.

If you just read the formal syntax in RFC 3501, you can see that the CHAR8 terminal should be the only way for non-UTF-8 data to get into the protocol. That effectively means that any syntax productions using the literal rule can return non-UTF-8 data, which means any production that includes string or nstring. So theoretically, if you only read 3501, there are a lot of places where non-UTF-8 data can occur. However, RFC 2822 prescribes that header fields (both name and body) can only contain US-ASCII characters, so that should exclude a lot of places.

As such, my position will be that we only use [u8] for data types that hold data from the body of messages, and use str everywhere else. I really don't want to spend more time discussing this issue unless someone comes up with a counterexample.

djc · 2018-01-08T21:18:23Z

I converted all the straightforward cases in 9a55385. There's a few more cases left where the bytes are wrapped in Option, which I haven't yet figured out how to cleanly deal with.

sanmai-NL mentioned this issue Nov 21, 2017

Make parsing of strings optional #12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling improvement #13

Error handling improvement #13

sanmai-NL commented Nov 21, 2017

sanmai-NL commented Nov 21, 2017

sanmai-NL commented Nov 24, 2017 •

edited

Loading

djc commented Nov 24, 2017 •

edited

Loading

sanmai-NL commented Nov 24, 2017

djc commented Nov 28, 2017

djc commented Jan 8, 2018

Error handling improvement #13

Error handling improvement #13

Comments

sanmai-NL commented Nov 21, 2017

sanmai-NL commented Nov 21, 2017

sanmai-NL commented Nov 24, 2017 • edited Loading

djc commented Nov 24, 2017 • edited Loading

sanmai-NL commented Nov 24, 2017

djc commented Nov 28, 2017

djc commented Jan 8, 2018

sanmai-NL commented Nov 24, 2017 •

edited

Loading

djc commented Nov 24, 2017 •

edited

Loading