-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite of CSV regex parsing to avoid exponential behavior #2513
Rewrite of CSV regex parsing to avoid exponential behavior #2513
Conversation
Problem was discovered after DFU started busy-looping while parsing a CSV with a faulty 1554,"%it, Le Protecteur (missing closing quotes) line in Text/Internal_MagicItems.csv Nested star operators - like here, (a*b*)* - can take exponential time under the right conditions. Minimal fix is to replace this pattern with (a+|b+)*, assuming a and b share no common prefix. I did not stop there and removed the intermediate matches array by directly parsing key and value columns together in the regex. Another potential bug with .Trim("\""), it may remove more than the quotes surrounding quoted strings, ex. """Approach"" said the King" -> Approach" said the King
Thanks for suggested fix Pango, I appreciate it. :) I'll just need to regression test change against master CSV files. These are exported direct from string tables using Unity Editor (here), so that's always the baseline for compatibility and expected format of CSV input. |
Also, return the lines as a List instead of an Array. Rationale: Arrays must allocate exactly the number of elements, so IEnumerable.ToArray() overallocates as it aggregates elements, then as a last step does an extra copy to return the exact number of elements. Since the result is short lived, extra allocated elements is not a concern. Source: https://stackoverflow.com/questions/1105990/is-it-better-to-call-tolist-or-toarray-in-linq-queries Another solution would have been to return an IEnumerable, but it could generate parsing exceptions from unexpected places.
I feel previous style is more consistent with the rest of DFU coding style. Both approaches have a O(n) complexity.
This is excellent! It handles all master CSVs perfectly, plus some edge cases I look out for. It stands up to bad input in value field much better than the previous. Thank you Pango! |
@KABoissonneault @jefetienne |
Problem was discovered after DFU froze (busy-looping) while parsing a CSV with a faulty
(missing closing quotes) line in Text/Internal_MagicItems.csv
Nested star operators - like here,
(a*b*)*
- can take exponential time under the right conditions.Minimal fix is to replace this pattern with
(a+|b+)*
, assuming a and b share no common prefix.I did not stop there and removed the intermediate
matches
array by directly parsing key and value columns together in the regex.Another potential bug fix,
.Trim("\"")
may remove more than the quotes surrounding quoted strings, ex.so I replaced it with a more explicit removal of surrounding quotes.