-
Notifications
You must be signed in to change notification settings - Fork 13.3k
io::IoError should carry info on the invalid byte sequence on non-utf8 InvalidInput #12113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
(I would also be satisfied with an variant of ((after further review, the above proposal seems similar to the very old #1675 )) |
One possibility would be to add a new |
I was making a little program to post-process my irc-logs, which unfortunately for some reason have non-utf8 mixed in, so it was important to me to have a reasonable way to recover from these scenarios and resume the parsing. I hacked up something that worked for me, but I doubt its clean enough to be put into the standard lib. (I was happy that Rust's stdlib does at least expose enough functionality for me to get the job done, e.g. by making helpers like The experience showed me that this is not as trivial a problem as I was making it out to be (e.g. I think a fully general interface needs to allow one to feed in a prefix sequence of characters that were left over from a previous failed call to Unfortunately my main experiences in the past with such problems (e.g. in Flash) were only just further instances where the provided API's were not flexible enough. Anyway, hopefully I'll iterate more on this and come up with something palatable. |
With IO reform, |
Triage: too much time has passed and too much has changed, so I don't actually remember what the right thing is here. I believe this boils down to #27802, ie, we still haven't decided what happens when you get an invalid @rust-lang/libs, opinions? |
We certainly now have the ability to do this via the custom error payload you can pass into |
I agree that this essentially could become a sub issue of #27802 ; but the discussion of that issue seems to focus on the semantics of a char iterator's interaction with an underlying byte stream (which is very important), while this issue is more of a "it would be nice if the error object that gets bubbled up actually carried enough info for a user to do recovery" We haven't always done such a great job in this respect, IMO, so I'm trying to be explicit about it here. |
|
small typo in log message
If you feed in a byte stream that is almost utf-8 but has errors, a looped series of calls to
fn read_char
will eventually return anIoError
withkind == InvalidResult
.Unfortunately, the returned
IoError
does not include any information about what the bytes were that were invalid (nor does it include information like how many bytes were read from the input before the error was encountered).It seems like it would not be that bad to change
IoError
so that itsdetail
field could be anOption<Either<~str, ~[u8]>>
, or something along those lines, so that in this scenario, theInvalidResult
would imply that one could look at thedetail
field to determine what the byte sequence was that caused the problem (and then the client code would have the option of substituting in a different character sequence specific to the byte sequence that failed).(Alternatively, we could change
IoErrorKind
so that theInvalidResult
variant carried anOption<~[u8]>
, but then theIoErrorKind
would no longer be a C-like enum.)I believe that this is strictly more expressive than just mapping every replacement to a single replacement character, as is done by
from_utf8_lossy
(#12062).The text was updated successfully, but these errors were encountered: