Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider handling of scipen in fwrite #6457

Open
MichaelChirico opened this issue Sep 3, 2024 · 0 comments
Open

Reconsider handling of scipen in fwrite #6457

MichaelChirico opened this issue Sep 3, 2024 · 0 comments
Labels

Comments

@MichaelChirico
Copy link
Member

          > @philippechataignon do you want to have a go at adding a atime performance regression test? Totally fine if not -- what would help at least would be to write a simple benchmark of gzipped fwrite that you think would capture the important pieces of what's changed here, does that make sense?

OK for testing regression but notice that the core of fwrite hasn't change : same buffer sizes, same number of jobs, same number of rows per job. Personally I observe similar timings that previous version.

One point of discussion : I notice that #2020 introduces a change that I never realized before this PR. By default scipen fwrite parameter uses the value of R option scipen. Like a lot of persons, I have a options(scipen = 999) in my Rprofile because I don't like sci output on screen but I never realize that fwrite was penalized. Why ? Because of the maxLineLen which can be very high and gives a lot of batches with few lines and little chunk. In fwrite, scipen has an maximum of 350 but it's a very high limit. And maxLineLen is used for determining number and sizes of fwrite chunks.

For testing impact, I have this little program :

n = 10000
ncol = 1000
dt <- data.table(i=1:n)
dt[, paste0("V", 1:ncol) := lapply(1:ncol, function(x) as.numeric(sample(1:n, replace=T)))]
print(sessionInfo())
system.time(fwrite(dt, "/dev/null", compress="gzip", verbose=T, scipen=0))
system.time(fwrite(dt, "/dev/null", compress="gzip", verbose=T, scipen=999))

With scipen = 0

maxLineLen=61026. Found in 0.000s
Writing bom (false), yaml (0 characters) and column names (true)
Writing 10000 rows in 73 batches of 137 rows, each buffer size 8388608 bytes (8 MiB), showProgress=1, nth=4
zlib: uncompressed length=48947760 (46 MiB), compressed length=21719798 (20 MiB), ratio=44.4%, crc=26a23b0a
Written 10000 rows in 1.569 secs using 4 threads. MaxBuffUsed=7%

With scipen = 999

maxLineLen=761026. Found in 0.000s
Writing bom (false), yaml (0 characters) and column names (true)
Writing 10000 rows in 910 batches of 11 rows, each buffer size 8388608 bytes (8 MiB), showProgress=1, nth=4
zlib: uncompressed length=48947760 (46 MiB), compressed length=22745125 (21 MiB), ratio=46.5%, crc=26a23b0a
Written 10000 rows in 1.202 secs using 4 threads. MaxBuffUsed=0%

In last case real mean line length is ~ 5000 but estimated to 761026. Compression ratio is higher because the buffers are very little used. Surprisingly timing is better despite openmp number of threads overhead.

In my opinion, scipen parameter should be 0 in fwrite and not scipen option witch is related to digits option not present in fwrite and digits is higher in fwrite output (20 ?).

I use this little bench for scipen impact and I think it can be used for atime. I've tried to add this :

  # Issue with fwrite length for gzip output, fixed in: https://github.com/Rdatatable/data.table/pull/6393
  # No regression timing test
  "No regression fwrite" = atime::atime_test(
    N = 10^seq(2, 8),
    setup = {
      set.seed(1)
      ncol = 1000
      L <- data.table(i=1:N)
      L[, paste0("V", 1:ncol) := lapply(1:ncol, function(x) rnorm(N))]
    },
    expr = {
      fwrite(dt, "/dev/null", compress="gzip")
    },
    Fast = "117ab45674f1e56304abca83f9f0df50ab0274be", # Close-to-last merge commit in the PR
    Slow = "e73c2c849f921cf4ef51e3809842e0fee9b9f52c"),

but I'm not sure that /dev/null is portable and if we write a real file, that's made the timing.

OK for another one to continue and test that there is not time regression.

Originally posted by @philippechataignon in #6393 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant