Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite of CSV regex parsing to avoid exponential behavior #2513

Merged

Conversation

petchema
Copy link
Collaborator

@petchema petchema commented Apr 27, 2023

Problem was discovered after DFU froze (busy-looping) while parsing a CSV with a faulty

1554,"%it, Le Protecteur

(missing closing quotes) line in Text/Internal_MagicItems.csv

Nested star operators - like here, (a*b*)* - can take exponential time under the right conditions.
Minimal fix is to replace this pattern with (a+|b+)*, assuming a and b share no common prefix.

I did not stop there and removed the intermediate matches array by directly parsing key and value columns together in the regex.

Another potential bug fix, .Trim("\"") may remove more than the quotes surrounding quoted strings, ex.

"""Approach"" said the King" -> Approach" said the King

so I replaced it with a more explicit removal of surrounding quotes.

Problem was discovered after DFU started busy-looping while parsing a
CSV with a faulty

    1554,"%it, Le Protecteur

(missing closing quotes) line in Text/Internal_MagicItems.csv

Nested star operators - like here, (a*b*)* - can take exponential time
under the right conditions.
Minimal fix is to replace this pattern with (a+|b+)*, assuming
a and b share no common prefix.

I did not stop there and removed the intermediate matches array by
directly parsing key and value columns together in the regex.

Another potential bug with .Trim("\""), it may remove more than the
quotes surrounding quoted strings, ex.

"""Approach"" said the King" -> Approach" said the King
@petchema petchema added the bug label Apr 27, 2023
@Interkarma
Copy link
Owner

Thanks for suggested fix Pango, I appreciate it. :)

I'll just need to regression test change against master CSV files. These are exported direct from string tables using Unity Editor (here), so that's always the baseline for compatibility and expected format of CSV input.

Pierre Etchemaite added 3 commits April 27, 2023 09:37
Also, return the lines as a List instead of an Array.

Rationale: Arrays must allocate exactly the number of elements, so
IEnumerable.ToArray() overallocates as it aggregates elements, then as a
last step does an extra copy to return the exact number of elements.

Since the result is short lived, extra allocated elements is not a
concern.

Source:
https://stackoverflow.com/questions/1105990/is-it-better-to-call-tolist-or-toarray-in-linq-queries

Another solution would have been to return an IEnumerable, but it could
generate parsing exceptions from unexpected places.
I feel previous style is more consistent with the rest of DFU coding
style. Both approaches have a O(n) complexity.
@Interkarma
Copy link
Owner

This is excellent! It handles all master CSVs perfectly, plus some edge cases I look out for. It stands up to bad input in value field much better than the previous. Thank you Pango!

@Interkarma Interkarma merged commit bf8094d into Interkarma:master Apr 27, 2023
@petchema
Copy link
Collaborator Author

@KABoissonneault @jefetienne
Thank you for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants