fread spends too much time in is_url/is_secureurl/is_file for long in memory input #2531

javrucebo · 2017-12-19T00:37:06Z

Summary

When fread is fed with a character string as input the routine spends considerable amount of time detecting that the supplied input is not a filename or an url.

This is due to grepl not scaling well for large input as used in fread.

Example

is_url <- function(x) grepl("^(http|ftp)s?://", x)

Although the pattern is anchored at the beginning of the string, running grepl for large inputs will take a lot of time for large inputs (more detailed benchmarks further down).
This will lead then to the full call to fread possibly spending a third of the time for those supposedly simple checks. (example also below)

Possible solutions could be one or more of the following

changing the call to grepl to use PERL regexp engine perl=TRUE
Alternatively use another method to determine whether the input starts with url (see benchmarks below)
adding an explicit optional argument which denotes string input, e.g. str= to denote that the input is to be considered as the data and skip the tests for url or file. This would be similar in spirit to the file= argument.
As a side-effect this would also allow input to be read which consits only of url's and having no header, e.g.

fread("http://hkhfsk\nhttp://fhdkf\nhttp://kjfhskd\nhttp://hfkjf", header=FALSE)

consider an input, which exceeds a certain number of characters in length, not being an URL

Profiling example

library(data.table)
# create a random character input in 5 columns with 20000*200 lines
randomString <- function() {
  paste(sample(c(LETTERS, letters), sample(1:9,1)), collapse="")
}
input <- paste(rep(paste(replicate(20000,paste(replicate(5, randomString()), 
                                              collapse="\t")),
                         collapse="\n"),200),
               collapse="\n")
# profile the call to fread
Rprof()
invisible(fread(input, header=FALSE))
Rprof(NULL)
summaryRprof()

$by.self
        self.time self.pct total.time total.pct
".Call"      5.34    64.49       5.34     64.49
"grepl"      2.94    35.51       2.94     35.51

Benchmarking `grepl` and friends

Comparing different functions to verify whether a string starts with any of http(s)/ftp(s) or file shows that grepl scales badly and is by far the slowest of the tested variants.
Adding simply perl=TRUE already improves by around factor 100 for large inputs (code further below)

library(data.table)
library(ggplot2)
library(Rcpp)

# define alternative functions to detect URL's
grepl_full <- function(txt) { grepl("^(http|ftp)s?://", txt) }
grepl_substr <- function(txt) { grepl("^(http|ftp)s?://", substr(txt,1,8)) }
grepl_perl <- function(txt) { grepl("^(http|ftp)s?://", txt, perl=TRUE) }
grepl_perl_substr <- function(txt) { grepl("^(http|ftp)s?://", substr(txt,1,8), perl=TRUE) }
grepl_perl_stri_sub <- function(txt) { grepl("^(http|ftp)s?://", stringi::stri_sub(txt,1,8),
                                             perl=TRUE) }
stridetect <- function(txt) { stringi::stri_detect_regex(txt, "^(http|ftp)s?://") }
stristartswith <- function(txt) { 
  stringi::stri_startswith_fixed(txt, c("http://", "https://", "ftp://", "ftps://")) 
}
cppFunction('bool isURL_cpp(const std::string &x) {
               bool ret = FALSE;
               if (x.length() >= 8) {
                 std::string s = x.substr(0,8);
                 if (s.substr(0,7) == "http://" || 
                     s == "https://" ||
                     s.substr(0,6) == "ftp://" ||
                     s.substr(0,7) == "ftps://") {
                   ret = TRUE;
                 }
               }
               return(ret);
             }')
cppFunction('bool isURL_c(const char* x) {
               bool ret = FALSE;
               if (strlen(x) >= 8) {
                 if (!strncmp(x, "http://", 7) || 
                     !strncmp(x, "https://", 8) ||
                     !strncmp(x, "ftp://", 6) ||
                     !strncmp(x, "ftps://", 7)) {
                   ret = TRUE;
                 }
               }
               return(ret);
             }')


# number of string lengths to test with. max len = 1e+nn
nn <- 8
# create random strings of given length
for (i in 1:nn) {
  assign(paste0("str_",i), paste(sample(c(LETTERS, letters), 10^i, replace=TRUE), 
                                 collapse=""))
}
function_list <- c("grepl_full", "grepl_substr", "grepl_perl", "grepl_perl_substr", 
                   "grepl_perl_stri_sub", 
                   "isURL_cpp", "isURL_c", 
                   "stridetect", "stristartswith")

# cross join of functions and strings
dt <- CJ(fun=function_list, ind=1:nn)
# build code to run microbenchmark
mb_code <- paste("microbenchmark::microbenchmark(",
                 paste0("'", dt$fun, " ", 10^dt$ind, "'=", 
                        dt$fun, "(str_", dt$ind, ")", collapse=","),",times=10)")
# eval microbenchmark and calculate median
mb <- as.data.table(eval(parse(text=mb_code)))
mb[,c("fun", "nr"):=tstrsplit(as.character(expr), " ", fixed=TRUE)]
mb[,nr:=as.integer(nr)]
mb <- mb[,.(time=median(time)), by=.(fun,nr)]

# plot
ggplot(mapping=aes(nr, time/1e6)) + 
  geom_line(data=mb[fun %chin% c("grepl_full", "grepl_perl")],
            aes(col=fun), size=1.1) +
  geom_line(data=mb[!fun %chin% c("grepl_full", "grepl_perl")],
            aes(col=fun), size=0.5, linetype="dotted") +
  geom_point(data=mb, aes(col=fun)) +
  scale_y_log10(breaks=c(0.01,0.1,1,10,100,1000)) +
  scale_x_log10(breaks=10^(1:nn)) +
  labs(y="time in milliseconds", x="string length", 
       title="Benchmark string tests for isURL") +
  theme(legend.position = c(0.05,0.95), legend.justification = c(0,1))

sessionInfo

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_IE.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_IE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] tools_3.3.1 yaml_2.1.15

Results are similar with R 3.4.3 / data.table 1.10.4 / Windows 10 64bit

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2017-12-19T03:53:55Z

Seems like a problem of the regex engine more than anything. anchoring to string beginning should mean that grepl performance is O(1), as far as I'm concerned. Strange that it appears otherwise.

st-pasha · 2017-12-19T06:39:57Z

probably R's regex engine attempts to build a histogram of the target string before applying the regexp.

MichaelChirico · 2017-12-19T08:40:24Z

i have in mind this:

https://swtch.com/~rsc/regexp/regexp1.html

javrucebo · 2017-12-19T09:48:57Z

Indeed it may be more a R issue, but something one can at least work around in fread

Using perl=TRUE does not seem to add any negative aspect and will be usally faster (in my benchmarks 100 times for large input)

See also help-page: ?grep:

If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine,

The additional penalty (even when using perl=TRUE) does not seem to come from grep: Look at the benchmark when feeding grep with the shortened string though substr (grep_perl_substr) whereas using stringi::stri_sub performs with O(1) (grep_perl_strisub). So there seems to be an overhead when calling the base function with a large string. (Actually we see this even for the Rcpp versions)

HughParsonage · 2018-01-13T06:45:11Z

For later reference, the utility startsWith is easily the fastest (and clearest), but unfortunately depends on R (>= 3.3.0).

is_url <- function(x) grepl("^(http|ftp)s?://", x)
is_url_perl <- function(x) grepl("^(?:http|ftp)s?://", x, perl = TRUE)
is_url2 <- function(x) {
  `||`(startsWith(x, "ht") && startsWith(x, "http://") || startsWith(x, "https://"),
       startsWith(x, "ft") && startsWith(x, "ftp://") || startsWith(x, "ftps://"))
}

microbenchmark::microbenchmark(is_url(input), is_url_perl(input), is_url2(input), times = 50)
Unit: microseconds
#              expr         min          lq        mean       median          uq        max neval cld
# is_url(input)     1001673.642 1038144.688 1076978.567 1070702.5405 1112537.553 1185685.65    50   c
# is_url_perl(input)  23280.392   24793.807   30226.567   27345.3795   30737.237   51998.24    50  b 
# is_url2(input)          1.807       2.711     139.927      12.8005      18.975    6414.17    50 a  

URL <- "https://github.com/Rdatatable/data.table/issues/2531"
microbenchmark::microbenchmark(is_url(URL), is_url_perl(URL), is_url2(URL), times = 50)
# expr                      min     lq     mean median     uq    max neval cld
# is_url(URL)             6.626  7.228  8.40290  7.680  8.132 30.118    50  b 
# is_url_perl(URL)       65.657 66.259 67.40972 66.560 66.862 92.462    50   c
# is_url2(URL)            1.506  1.808  2.25304  2.108  2.409  8.132    50 a

mattdowle · 2018-02-16T03:59:53Z

This was a great report -- many thanks!
I've just fixed it and pushed to master. Please test and reopen if this doesn't fix it. I'm pretty sure it's much better now, but you never know. I pushed it straight to master (rather than a PR) so you can more easily test it just by upgrading to latest dev.
I added your test too so it now reads those http:// addresses as data as you suggested.

in `fread.R` in order to speed up execution time when `input` is the actual data. `file.exists` is very slow, so checking first for new lines speeds up the process. See also discussion in Rdatatable#2531

javrucebo · 2018-02-16T09:03:42Z

Thank you for looking into this issue.
I am afraid it is not fixing the issue of slow execution time completely.
In the new code, once you go in the else clause of if (!missing(file)) the first thing you are checking (after verifying input is a length 1 character vector) is

if (file.exists(input)) {
  ....

unfortunately file.exists is also very slow for large input:

input <- paste(rep("1,2,3,4.567,some text field", 1e6),collapse="\n")
system.time(file.exists(input))
##   user  system elapsed 
##  0.693   0.000   0.694 
system.time(fread(input))
##   user  system elapsed 
##  0.861   0.000   0.862

with the file.exists eating up 80% of the total fread execution time.

Fortunately the next check you are doing with grep (detecting newlines) is very fast (opposed to the suprisingly slow grepl checks:

system.time(length(grep('\\n|\\r', input)))
##   user  system elapsed 
##      0       0       0

The easiest fix seems to be to swap the two conditions. The only case where I see this would make a difference is, if input is a filename and has a newline in the name, but I hope no ones would do this and then expect fread to be able to read the file. (and actually it wouldn't work as the c-routine for fread checks again for newlines and would treat such a file name as the actual input)

You suggested me to re-open the issue in case it's not working, but as reporter, I can not re-open it.

I have opened a PR #2630 with the proposed fix.

The other part of your fix with substring etc is indeed avoiding grepl is slightly faster, but due to the fact you check now for newlines earlier in the code the penalty should not be visible ... unless someone is giving as input a very long string without any newlines ... but that seems also like a pathological case.

mattdowle · 2018-02-17T16:55:51Z

Thanks for the follow up. Yes I had thought file.exists() would be fast on huge input as the operating system would detect it was invalid in the first few characters. Interesting it isn't. Maybe R passes over it first before calling the OS for some reason. Anyway, will proceed with your PR, thanks.
Didn't know you can't reopen issues; must be just project members that see that option then. I'll see if there's a project setting to open that, and will bear it in mind in future. I've invited you to be project member too.

javrucebo · 2018-02-19T21:31:50Z

just for the record to fully close it.
With the changes applied in 8ab3028 and 3989ea2 fread speedup for data input through the input variable is substantial.

For input with > 0.5 MB size / 10000 rows the speedup between before and after fixes is 3 - 5.

st-pasha added fread performance labels Dec 19, 2017

st-pasha added this to the v1.10.6 milestone Dec 19, 2017

st-pasha mentioned this issue Dec 19, 2017

Master task for fread bugs / proposals #2247

Closed

mattdowle closed this as completed in 8ab3028 Feb 16, 2018

javrucebo mentioned this issue Feb 16, 2018

change the order of file.exists(input) and grep('\\n|\\r', input) in fread.R #2630

Merged

MichaelChirico mentioned this issue Mar 4, 2018

csvy reading capabilities #2656

Merged

HughParsonage mentioned this issue Apr 16, 2018

Add text argument to fread #2753

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread spends too much time in is_url/is_secureurl/is_file for long in memory input #2531

fread spends too much time in is_url/is_secureurl/is_file for long in memory input #2531

javrucebo commented Dec 19, 2017

MichaelChirico commented Dec 19, 2017

st-pasha commented Dec 19, 2017

MichaelChirico commented Dec 19, 2017

javrucebo commented Dec 19, 2017

HughParsonage commented Jan 13, 2018

mattdowle commented Feb 16, 2018 •

edited

Loading

javrucebo commented Feb 16, 2018

mattdowle commented Feb 17, 2018

javrucebo commented Feb 19, 2018

fread spends too much time in is_url/is_secureurl/is_file for long in memory input #2531

fread spends too much time in is_url/is_secureurl/is_file for long in memory input #2531

Comments

javrucebo commented Dec 19, 2017

Summary

Profiling example

Benchmarking grepl and friends

sessionInfo

MichaelChirico commented Dec 19, 2017

st-pasha commented Dec 19, 2017

MichaelChirico commented Dec 19, 2017

javrucebo commented Dec 19, 2017

HughParsonage commented Jan 13, 2018

mattdowle commented Feb 16, 2018 • edited Loading

javrucebo commented Feb 16, 2018

mattdowle commented Feb 17, 2018

javrucebo commented Feb 19, 2018

Benchmarking `grepl` and friends

mattdowle commented Feb 16, 2018 •

edited

Loading