-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing numeric literals can cause TooShort/Invalid #1399
Comments
Wow, nice report here @t-mart ! I recommend providing a PR (assuming @pitdicker @djc are also interested and would like to see a PR first). Pick the option that you feel would be best to implement. I recommend at least two commits to the PR to demonstrate the behavioral change
Chrono maintainers may decide on a different option but tangible code is a great starting point for a longer discussion. Overall, I agree that the errors from the various parse methods are broad. When implementing a parse function call, I often have to "throw spaghetti at the wall" until I get something sticking because the error offers little insight. |
@t-mart Thank you for the report. We had an issue and closed PR related to the problem of some specifies accepting a variable number of digits without look-ahead: #560 and #561. The main difficulty is that we support iterating over the format string while iterating over the input string. So we currently don't know the next formatting specifier while parsing some input. I'll have a better look at this issue later, maybe tomorrow. Would you be willing to work on it? |
The linked old PR shows a compromise. If backtracking can be limited to only one or a few functions that do the integer parsing I don't think it is a bad solution. |
One option is to add some complexity to the strftime parser to read ahead when it encounters a variable-with formatting specifier. If it is followed by a fixed number of digits (either other formatting specifiers or plain digits), it stores this number of fixed digits as extra information in the formatting item. If a variable-with specifier is followed by another variable-with specifier it is an error. Then the parser can use this information to leave some digits remaining when parsing a variable-with specifier. Disadvantage is the double parsing of parts of the format string. And that doing so is unnecessary for the other use: formatting. But that may not turn out to be much of a problem in practice. The double parsing of a few characters in the format string would only happen in cases where we now fail to parse an input string at all. |
Thank you for your consideration and guidance, everyone. I intend to PR on this. Give me a few days to get to some free time. |
@t-mart thanks for leading effort on this issue. Can you please share if you had time to look into it? vector.dev community uses chrono for datetime parsing and interest in this feature is growing with those cases: No pressure, just curios on whether slipped through the cracks or any help is needed. |
Thanks for the update/reminder, @DimDroll. Regrettably, I have not had time to start work here. Please jump in! |
Fox expectations: this is something that is not an easy fix that can be done in a couple of hours of work (at least it is not easy for me 😄). Not even sure if it can even be done in a backwards-compatible way. But feel free to dive in and come up with a plan. |
For a format/input string item that starts with a numeric literal, the various
parse_from_str
methods can return aParseError
error result ofTooShort
orInvalid
if that literal comes directly after a numeric specifier/value with variable parsing width (PW=∞).In other words, this problem can occur when, for a consecutive pair of format items,
a
andb
,a
is a specifier value of variable width andb
is a literal that starts with a digit:%s1
timestamp followed by literal digit%Y1
year followed by literal digitOf particular note is that this same sequence of (specifier, literal) manifests as errors of two different variants:
TooShort
Example:This same behavior exists for the other (PW=∞) specifiers, such as Year (e.g.,
NaiveDate::parse_from_str("01-01-+100001", "%m-%d-%Y1")
.Invalid
ExampleNote that this case contains the same specifier-literal pair as above, but the error is different. Probably because the pair is not at the end of the strings.
Cause?
I believe this has to do with the fact that
scan::number
is greedy against any next digit(s), even if that may consume a next literal. In general, the parse algorithm does not appear to backtrack to find another possible match location. This is an understandable compromise. Allowing for this case would essentially be implementing a regex engine.I do concede that this kind of input is pathological and would rarely occur in practical usage. But the current implementation feels wrong that it can do this just depending on the "numericality" of a literal, at least without it being documented. And, while there are a few mentions that the parser is greedy, it seems to only consider the cases where the input is missing obvious text from its format. (On the other hand, maybe it's just ok to say the parser is greedy as we currently do, and call it a day. Not sure.)
Options
If this finding is valid, (and indeed, if we do not want to implement a regex-engine-style approach) here are a few ways we could handle it:
During input parsing, reject numerically-prefixed literals after variable-length specifiers (
Year
,IsoYear
, andTimestamp
) by returningInvalid
or some new error variant (e.g.GreedyValue
). We could determine this with minimal (?) overhead by something like:Note that
Year
only exhibits this behavior when parsed with a leading+
/-
, so regular 4-digit years would never error this way.The same as above, but during format parsing (i.e.,
StrftimeItems::new(fmt)
). This would short-circuit on formats where the input could exhibit this behavior, but would be a false positive in the 4-digit Year case (no+
/-
), which we can only know once we start parsing the input.Or, just call this out in the documentation: Update
TooShort
andInvalid
's documentation to include this possibility, and maybe in some other high-level documentation places as well. This would be easiest and obviously cause no performance overhead, but maybe not as correct as we could be.Versions
rustc 1.77.0-nightly (5518eaa94 2024-01-29)
0.4.33
The text was updated successfully, but these errors were encountered: