Skip to content
This repository was archived by the owner on Jul 22, 2019. It is now read-only.

Error handling improvement #13

Open
sanmai-NL opened this issue Nov 21, 2017 · 6 comments
Open

Error handling improvement #13

sanmai-NL opened this issue Nov 21, 2017 · 6 comments

Comments

@sanmai-NL
Copy link
Contributor

@djc, #12 (comment) :

As for the unwrap() calls, I think the right approach there is to use the map_res!() macro from nom to convert those to parser errors, instead of unwrapping them in place. Could probably even do a nice macro that takes care of the parse-error-converting std::str::from_utf8() in a concise way. If either of you wants to take that one, that might be a nicely manageable project.

I will submit a PR for this.

@sanmai-NL
Copy link
Contributor Author

Related: #1.

@sanmai-NL
Copy link
Contributor Author

sanmai-NL commented Nov 24, 2017

As it's still not clear whether it's a proper assumption that parsed byte sequences in responses are valid UTF-8, I'm not yet convinced about using map_res! as proposed by @djc in the opening post. &String::from_utf8_lossy(data) seems like a less impactful solution to reduce the panic risk. I believe we would have to add more than a few lines of code to handle all instances where parsing a byte sequences as UTF-8 could theoretically(?) fail. I'm not sure how you envision the usage of a macro for that. As you see it, the macro would return the same error value for each of the many places where parsing bytes as UTF-8 could fail, right? Why not call map_err() instead?

What do you think?

@djc
Copy link
Owner

djc commented Nov 24, 2017

I'd rather not hide the impact. Using from_utf8_lossy() means more or less silently losing data, which I think is a really bad choice. I think raising a parse error is better, as it makes it clear that something is wrong. Especially at the stage where we don't fully understand how servers have implemented the standard, we should raise errors so we can act on them to improve this further, rather than glossing over error cases.

If you really want to dig in, go and list all the places in the parser that currently use from_utf8() and figure out what kind of data we usually see there (at least from one server, for example) as well as what the formal syntax from RFC 3501 says (the parser names try to match the formal syntax names). Then we can make a more informed decision about it on a case-by-case basis.

@sanmai-NL
Copy link
Contributor Author

It's true from_utf8_lossy() theoretically sweeps invalid UTF-8 data under the carpet. We can, however, add a fuzz test case that asserts no occurrence of the UTF-8 replacement character, to discover such cases. (Note, I’m not suggesting that’s a better approach per se.)

Questions
I've been looking at many RFCs now and to me some things aren’t clear entirely:

  1. Is there a complete, most current IMAP grammar that also explicitly specifies all character sets/terminal symbols? I've looked at e.g. RFC 3501 and RFC 4466.
  2. Where can a byte sequence occur in the IMAP protocol that can't be decoded as UTF-8 unpredictably?

Assumptions

  1. If the answer to 2 is nowhere, there’s no technical problem in using UTF-8 values (&str, String) in IMAP types, unless you'd want to serialize values of these types to some IMAP response or command as byte sequence (not sensible I suppose, and not supported right now).
  2. Handling (i.e. not panicking or crashing) byte sequences that are valid UTF-8 yet also invalid byte sequences in the respective IMAP derivations, is only necessary for a goal of correctness. Meaning, if the goal is correctness, then the objective here is to handle invalid data, in order to prevent invalid data from passing the parser (and e.g. being transmitted to an IMAP server by code that depends on this library).
  3. A functional goal for this library is to parse IMAP commands as well as responses.

Are we on the same page about these assumptions?

Tangential observation
In a way, the approach of sticking with UTF-8 values (&str, String) within library code, followed by adding comprehensive UTF-8 decoding error handling, is an argument in favor of #12. After all, either we trust data is UTF-8 or we don’t.

@djc
Copy link
Owner

djc commented Nov 28, 2017

I don't know how you read the RFCs, I think this information is clear and fairly easy to find.

If you just read the formal syntax in RFC 3501, you can see that the CHAR8 terminal should be the only way for non-UTF-8 data to get into the protocol. That effectively means that any syntax productions using the literal rule can return non-UTF-8 data, which means any production that includes string or nstring. So theoretically, if you only read 3501, there are a lot of places where non-UTF-8 data can occur. However, RFC 2822 prescribes that header fields (both name and body) can only contain US-ASCII characters, so that should exclude a lot of places.

As such, my position will be that we only use [u8] for data types that hold data from the body of messages, and use str everywhere else. I really don't want to spend more time discussing this issue unless someone comes up with a counterexample.

@djc
Copy link
Owner

djc commented Jan 8, 2018

I converted all the straightforward cases in 9a55385. There's a few more cases left where the bytes are wrapped in Option, which I haven't yet figured out how to cleanly deal with.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants