detection is really slow in some cases #13

kaskawu · 2020-05-08T12:25:36Z

Hey there, first of all, great project!

The following commands takes a significant amount of time:

> python3 -m timeit -n 1 -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')" 
1 loop, best of 5: 13.2 sec per loop
python3 -m timeit "from clevercsv import Detector; Detector().detect('a'*18)" 
1 loop, best of 5: 8.24 sec per loop

After benchmarking a little bit, the apparent cause is that the unix_path and url regexes in the detector are susceptible to a ReDOS .

These change, which replace the regexes with (hopefully) equivalent ones fixes the most oblivious issues:

-    "url": "((https?|ftp):\/\/(?!\-))?(((([\p{L}\p{N}]*\-?[\p{L}\p{N}]+)+\.)+([a-z]{2,}|local)(\.[a-z]{2,3})?)|localhost|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(\:\d{1,5})?))(\/[\p{L}\p{N}_\/()~?=&%\-\#\.:]*)?(\.[a-z]+)?",
-    "unix_path": "(\/|~\/|\.\/)(?:[a-zA-Z0-9\.\-\_]+\/?)+",
+    "url": "((https?|ftp):\/\/(?!\-))?(((?:[\p{L}\p{N}-]+\.)+([a-z]{2,}|local)(\.[a-z]{2,3})?)|localhost|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(\:\d{1,5})?))(\/[\p{L}\p{N}_\/()~?=&%\-\#\.:]*)?(\.[a-z]+)?",
+    "unix_path": "[~.]?(?:\/[a-zA-Z0-9\.\-\_]+)+\/?",

New results:

> python3 -m timeit -n 1 -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')" 
1 loop, best of 5: 4.17 msec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (347 msec) was more than four times slower than the best time (4.17 msec).
> python3 -m timeit "from clevercsv import Detector; Detector().detect('a'*18)" 
1 loop, best of 5: 217 usec per loop

Python version: 3.8

The text was updated successfully, but these errors were encountered:

GjjvdBurg · 2020-05-08T14:40:52Z

Hi @kaskawu! Thanks for your interest in the package and for reporting this issue. Strangely, I have a hard time replicating your results:

$ python3 -m timeit -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')"
500 loops, best of 5: 721 usec per loop

and with the change you propose:

$ python3 -m timeit -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')"
1 loop, best of 5: 638 usec per loop

What version of the regex package are you using?

That said, it does seem to make a massive difference on your system, so I'm certainly open to making this change. I do however want to make sure I fully understand the cause before implementing any changes. Thanks!

kaskawu · 2020-05-08T15:03:17Z

> pip3 freeze | grep regex
regex==2020.5.7
>

That said, I tested across multiple python versions. I tried python 3.7 and 3.8, and the slowdown only happens on 3.8:

Python 3.7:

> python3 --version
Python 3.7.7
> python3 -m timeit -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')"
1 loop, best of 5: 5.75 msec per loop

Python 3.8:

> python3 --version
Python 3.8.2
> python3 -m timeit -n 1 -r 1 -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')"
1 loop, best of 1: 19.7 sec per loop

GjjvdBurg · 2020-05-08T16:19:35Z

Wow that's very interesting! Thanks for doing some more digging. I'll take a more detailed look at this soon, hopefully I can reproduce it in someway and figure out a good solution. Thanks again for reporting it!

lmmentel · 2020-05-12T20:06:57Z

Same here, performance drops with python3.8

python --version
Python 3.8.1
python -m timeit -n 1 -r 1 -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')" 
1 loop, best of 1: 8.34 sec per loop

GjjvdBurg · 2020-05-12T22:29:08Z

Thanks again @kaskawu for reporting this issue. I've updated CleverCSV using the unix_path regex you suggested above (diving into it, that regex seemed to be the problem). I'm preparing an updated release of the package now. Thanks also @lmmentel for confirming!

@kaskawu

* Fix speed of ``unix_path`` regex used in type detection. ([issue #13](#13)). Thanks to @kaskawu.

@kaskawu

* Update URL regex to avoid catastrophic backtracking and increase performance. See [issue #13](#13) and [issue #15](#15). Thanks to @kaskawu for the fix and @jlumbroso for re-raising the issue. * Add ``num_chars`` keyword argument to ``read_as_dicts`` and ``csv2df`` wrappers. * Improve documentation w.r.t. handling large files. Thanks to @jlumbroso for raising this issue.

kaskawu changed the title ~~detection is really slow~~ detection is really slow in some cases May 8, 2020

GjjvdBurg mentioned this issue May 12, 2020

Simplify unix_path regex (fixes #13) #14

Merged

GjjvdBurg closed this as completed in a78de9c May 12, 2020

GjjvdBurg added a commit that referenced this issue May 12, 2020

CleverCSV Release 0.5.6

cc6fef9

* Fix speed of ``unix_path`` regex used in type detection. ([issue #13](#13)). Thanks to @kaskawu.

This was referenced May 19, 2020

clevercsv sniffer slows to a crawl on large-ish files (e.g. FEC data) #15

Open

Fix url regex speed #16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detection is really slow in some cases #13

detection is really slow in some cases #13

kaskawu commented May 8, 2020 •

edited

Loading

GjjvdBurg commented May 8, 2020

kaskawu commented May 8, 2020 •

edited

Loading

GjjvdBurg commented May 8, 2020

lmmentel commented May 12, 2020

GjjvdBurg commented May 12, 2020

detection is really slow in some cases #13

detection is really slow in some cases #13

Comments

kaskawu commented May 8, 2020 • edited Loading

GjjvdBurg commented May 8, 2020

kaskawu commented May 8, 2020 • edited Loading

GjjvdBurg commented May 8, 2020

lmmentel commented May 12, 2020

GjjvdBurg commented May 12, 2020

kaskawu commented May 8, 2020 •

edited

Loading

kaskawu commented May 8, 2020 •

edited

Loading