-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fread spends too much time in is_url/is_secureurl/is_file for long in memory input #2531
Comments
Seems like a problem of the regex engine more than anything. anchoring to string beginning should mean that |
probably R's regex engine attempts to build a histogram of the target string before applying the regexp. |
i have in mind this: |
Indeed it may be more a R issue, but something one can at least work around in Using See also help-page:
The additional penalty (even when using |
For later reference, the utility is_url <- function(x) grepl("^(http|ftp)s?://", x)
is_url_perl <- function(x) grepl("^(?:http|ftp)s?://", x, perl = TRUE)
is_url2 <- function(x) {
`||`(startsWith(x, "ht") && startsWith(x, "http://") || startsWith(x, "https://"),
startsWith(x, "ft") && startsWith(x, "ftp://") || startsWith(x, "ftps://"))
}
microbenchmark::microbenchmark(is_url(input), is_url_perl(input), is_url2(input), times = 50)
Unit: microseconds
# expr min lq mean median uq max neval cld
# is_url(input) 1001673.642 1038144.688 1076978.567 1070702.5405 1112537.553 1185685.65 50 c
# is_url_perl(input) 23280.392 24793.807 30226.567 27345.3795 30737.237 51998.24 50 b
# is_url2(input) 1.807 2.711 139.927 12.8005 18.975 6414.17 50 a
URL <- "https://github.com/Rdatatable/data.table/issues/2531"
microbenchmark::microbenchmark(is_url(URL), is_url_perl(URL), is_url2(URL), times = 50)
# expr min lq mean median uq max neval cld
# is_url(URL) 6.626 7.228 8.40290 7.680 8.132 30.118 50 b
# is_url_perl(URL) 65.657 66.259 67.40972 66.560 66.862 92.462 50 c
# is_url2(URL) 1.506 1.808 2.25304 2.108 2.409 8.132 50 a |
This was a great report -- many thanks! |
in `fread.R` in order to speed up execution time when `input` is the actual data. `file.exists` is very slow, so checking first for new lines speeds up the process. See also discussion in Rdatatable#2531
Thank you for looking into this issue.
unfortunately
with the Fortunately the next check you are doing with
The easiest fix seems to be to swap the two conditions. The only case where I see this would make a difference is, if You suggested me to re-open the issue in case it's not working, but as reporter, I can not re-open it. I have opened a PR #2630 with the proposed fix. The other part of your fix with substring etc is indeed avoiding |
Thanks for the follow up. Yes I had thought file.exists() would be fast on huge input as the operating system would detect it was invalid in the first few characters. Interesting it isn't. Maybe R passes over it first before calling the OS for some reason. Anyway, will proceed with your PR, thanks. |
Summary
When
fread
is fed with a character string as input the routine spends considerable amount of time detecting that the supplied input is not a filename or an url.This is due to
grepl
not scaling well for large input as used in fread.Example
Although the pattern is anchored at the beginning of the string, running
grepl
for large inputs will take a lot of time for large inputs (more detailed benchmarks further down).This will lead then to the full call to
fread
possibly spending a third of the time for those supposedly simple checks. (example also below)Possible solutions could be one or more of the following
grepl
to use PERL regexp engineperl=TRUE
Alternatively use another method to determine whether the input starts with url (see benchmarks below)
str=
to denote that the input is to be considered as the data and skip the tests for url or file. This would be similar in spirit to thefile=
argument.As a side-effect this would also allow input to be read which consits only of url's and having no header, e.g.
Profiling example
Benchmarking
grepl
and friendsComparing different functions to verify whether a string starts with any of http(s)/ftp(s) or file shows that
grepl
scales badly and is by far the slowest of the tested variants.Adding simply
perl=TRUE
already improves by around factor 100 for large inputs (code further below)sessionInfo
Results are similar with R 3.4.3 / data.table 1.10.4 / Windows 10 64bit
The text was updated successfully, but these errors were encountered: