Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix transform slowness #5493

Merged
merged 7 commits into from
Jan 6, 2024
Merged

Fix transform slowness #5493

merged 7 commits into from
Jan 6, 2024

Conversation

OfekShilon
Copy link
Contributor

The source of transform slowness is this call to deparse on large inputs, from within name_dots. This seems to be a column-naming-convenience utility aimed at other scenarios (perhaps meld of data tables?) and in the call from transform its results are dropped altogether.

It is possible to bypass this costly call entirely by some dedicated argument passed to the data.table ctor and premeated to name_dots via ..., but a simpler and probably good enough solution would be to cap deparse at a single line by adding nlines=1 argument.

Impact:
(1) Prefix:

> df <- data.frame(x = runif(n = 1e7))
> dt <- as.data.table(df)
> system.time(df <- transform(df, y = round(x)))
   user  system elapsed 
  0.122   0.031   0.152 
> system.time(dt <- transform(dt, y = round(x)))
   user  system elapsed 
 19.658   0.157  19.808 

(2) Postfix:

> df <- data.frame(x = runif(n = 1e7))
> dt <- as.data.table(df)
> system.time(df <- transform(df, y = round(x)))
   user  system elapsed 
  0.147   0.001   0.147 
> system.time(dt <- transform(dt, y = round(x)))
   user  system elapsed 
  0.268   0.039   0.308 

Still twice as slow as transform.data.frame but nevertheless ~60x speedup. And probably enough for the (hopefully) few uses of transform.data.table.

@codecov
Copy link

codecov bot commented Oct 20, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (685d51d) 97.48% compared to head (0bacebc) 99.51%.

❗ Current head 0bacebc differs from pull request most recent head 1e03fe7. Consider uploading reports for the commit 1e03fe7 to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           1-15-99    #5493      +/-   ##
===========================================
+ Coverage    97.48%   99.51%   +2.02%     
===========================================
  Files           80       80              
  Lines        14862    14763      -99     
===========================================
+ Hits         14488    14691     +203     
+ Misses         374       72     -302     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ColeMiller1
Copy link
Contributor

Good find! A more direct approach that seems to work is to change this:

ans = do.call("data.table", c(list(`_data`), e[!matched]))

to:

ans = as.data.table(c(`_data`, e[!matched]))

On my computer, it is slightly slower than base (~5% to 10% slower). As far as the PR, a news item would be needed. I'm not sure there's a framework for performance tests, so a test might not be needed.

@jangorecki jangorecki linked an issue Oct 21, 2022 that may be closed by this pull request
@OfekShilon OfekShilon changed the title Fix 5492 by limiting the costly deparse to nlines=1 Fix 5492 Oct 22, 2022
@jangorecki jangorecki changed the title Fix 5492 Fix transform slowness Oct 22, 2022
R/utils.R Outdated Show resolved Hide resolved
@mattdowle mattdowle added this to the 1.14.5 milestone Nov 10, 2022
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
@jangorecki jangorecki modified the milestones: 1.15.0, 1.16.0 Nov 6, 2023
@MichaelChirico MichaelChirico changed the base branch from master to 1-15-99 January 6, 2024 10:44
@MichaelChirico
Copy link
Member

Thanks @OfekShilon and @ColeMiller1 for the fix! Merging into a 'dev' branch that will become master as soon as 1.15.0 hits CRAN.

@OfekShilon PTAL at the NEWS, I edited it a bit given my understanding, we can fix in follow-up if I mis-paraphrased.

@MichaelChirico MichaelChirico merged this pull request into Rdatatable:1-15-99 Jan 6, 2024
1 check was pending
@OfekShilon OfekShilon deleted the 5492 branch January 6, 2024 12:09
MichaelChirico added a commit that referenced this pull request Jan 6, 2024
* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
MichaelChirico added a commit that referenced this pull request Jan 12, 2024
…5342)

* improve documentation for GForce where sorting affects the result

* link issue

* tests

* typo

* mention Sys.setlocale

* obsolete comment

* 1.15.0 on CRAN. Bump to 1.15.99

* Fix transform slowness (#5493)

* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>

* Improvements to the introductory vignette (#5836)

* Added my improvements to the intro vignette

* Removed two lines I added extra as a mistake earlier

* Requested changes

* Vignette typo patch (#5402)

* fix typos and grammatical mistakes

* fix typos and punctuation

* remove double spaces where it wasn't necessary

* fix typos and adhere to British English spelling

* fix typos

* fix typos

* add missing closing bracket

* fix typos

* review fixes

* Update vignettes/datatable-benchmarking.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Update vignettes/datatable-benchmarking.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Apply suggestions from code review benchmarking

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* remove unnecessary [ ] from datatable-keys-fast-subset.Rmd

* Update vignettes/datatable-programming.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Update vignettes/datatable-reshape.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* One last batch of fine-tuning

---------

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Michael Chirico <chiricom@google.com>

* fix bad merge

* Improved handling of list columns with NULL entries (#4250)

* Updated documentation for rbindlist(fill=TRUE)

* Print NULL entries of list as NULL

* Added news item

* edit NEWS, use '[NULL]' not 'NULL'

* fix test

* split NEWS item

* add example

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <benjamin.schwendinger@tuwien.ac.at>

* clarify that list input->unnamed list output (#5383)

* clarify that list input->unnamed list output

* Add example where make.names is used

* mention role of make.names

* revert from next release branch

* manual merge NEWS

* manual rebase tests

* manual rebase data.table.R

* clarify 0 turns off everything

---------

Co-authored-by: Ofek <ofekshilon@gmail.com>
Co-authored-by: Ani <bloodraven166@gmail.com>
Co-authored-by: David Budzynski <56514985+davidbudzynski@users.noreply.github.com>
Co-authored-by: Scott Ritchie <sritchie73@gmail.com>
Co-authored-by: Benjamin Schwendinger <benjamin.schwendinger@tuwien.ac.at>
MichaelChirico added a commit that referenced this pull request Jan 14, 2024
* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
MichaelChirico added a commit that referenced this pull request Feb 17, 2024
* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
MichaelChirico added a commit that referenced this pull request Feb 18, 2024
* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
MichaelChirico added a commit to dvg-p4/data.table that referenced this pull request Feb 19, 2024
* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
MichaelChirico added a commit that referenced this pull request Feb 19, 2024
* Check for .datatable.aware being FALSE #5654

* Add tests

* Fix tests

* Simplify logic as suggested

* Band-aid on underlying selfrefok() problem for test

* Update news and add comment

* Fix transform slowness (#5493)

* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>

* Improvements to the introductory vignette (#5836)

* Added my improvements to the intro vignette

* Removed two lines I added extra as a mistake earlier

* Requested changes

* switch to 3.2.0 R dep (#5905)

* frollmax1: frollmax, frollmax adaptive, left adaptive support (#5889)

* frollmax exact, buggy fast, no fast adaptive

* frollmax fast fixing bugs

* frollmax man to fix CRAN check

* frollmax fast adaptive non NA, dev

* froll docs, adaptive left

* no frollmax fast adaptive

* frollmax adaptive exact NAs handling

* PR summary in news

* copy-edit changes from reviews

Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>

* comment requested by Michael

* update NEWS file

* Apply suggestions from code review

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Apply suggestions from code review

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* add comment requested by Michael

* add comment about int iterator for loop over k-1 obs

* extra comments

* Revert "extra comments"

This reverts commit 03af0e3.

* add comments to frollmax and frollsum

* typo fix

---------

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>

* Run GHA jobs on 1-15-99 dev branch (#5909)

* Make declarations static for covr (#5910)

* botched rebase

* stray \

* smaller diff

* test #s

---------

Co-authored-by: Ofek <ofekshilon@gmail.com>
Co-authored-by: Michael Chirico <chiricom@google.com>
Co-authored-by: Ani <bloodraven166@gmail.com>
Co-authored-by: Jan Gorecki <J.Gorecki@wit.edu.pl>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>
MichaelChirico added a commit that referenced this pull request Apr 20, 2024
* 1.15.0 on CRAN. Bump to 1.15.99

* Fix transform slowness (#5493)

* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>

* Improvements to the introductory vignette (#5836)

* Added my improvements to the intro vignette

* Removed two lines I added extra as a mistake earlier

* Requested changes

* Vignette typo patch (#5402)

* fix typos and grammatical mistakes

* fix typos and punctuation

* remove double spaces where it wasn't necessary

* fix typos and adhere to British English spelling

* fix typos

* fix typos

* add missing closing bracket

* fix typos

* review fixes

* Update vignettes/datatable-benchmarking.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Update vignettes/datatable-benchmarking.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Apply suggestions from code review benchmarking

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* remove unnecessary [ ] from datatable-keys-fast-subset.Rmd

* Update vignettes/datatable-programming.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Update vignettes/datatable-reshape.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* One last batch of fine-tuning

---------

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Michael Chirico <chiricom@google.com>

* Improved handling of list columns with NULL entries (#4250)

* Updated documentation for rbindlist(fill=TRUE)

* Print NULL entries of list as NULL

* Added news item

* edit NEWS, use '[NULL]' not 'NULL'

* fix test

* split NEWS item

* add example

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <benjamin.schwendinger@tuwien.ac.at>

* clarify that list input->unnamed list output (#5383)

* clarify that list input->unnamed list output

* Add example where make.names is used

* mention role of make.names

* fix subsetting issue in split.data.table (#5368)

* fix subsetting issue in split.data.table

* add a test

* drop=FALSE on inner [

* switch to 3.2.0 R dep (#5905)

* Allow early exit from check for eval/evalq in cedta (#5660)

* Allow early exit from check for eval/evalq in cedta

Done in the browser+untested, please take a second look :)

* Use %chin%

* nocov new code

* frollmax1: frollmax, frollmax adaptive, left adaptive support (#5889)

* frollmax exact, buggy fast, no fast adaptive

* frollmax fast fixing bugs

* frollmax man to fix CRAN check

* frollmax fast adaptive non NA, dev

* froll docs, adaptive left

* no frollmax fast adaptive

* frollmax adaptive exact NAs handling

* PR summary in news

* copy-edit changes from reviews

Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>

* comment requested by Michael

* update NEWS file

* Apply suggestions from code review

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Apply suggestions from code review

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* add comment requested by Michael

* add comment about int iterator for loop over k-1 obs

* extra comments

* Revert "extra comments"

This reverts commit 03af0e3.

* add comments to frollmax and frollsum

* typo fix

---------

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>

* Friendlier error in assignment with trailing comma (#5467)

* friendlier error in assignment with trailing comma

e.g. `DT[, `:=`(a = 1, b = 2,)`.

WIP. Need to add tests and such, but editing from browser before I forget.

* Another pass

* include unnamed indices on RHS too

* tests

* NEWS

* test numbering

* explicit example in NEWS

* Link to ?read.delim in ?fread to give a closer analogue of expected behavior (#5635)

* fread is similar to read.delim (#5634)

* Use ?read.csv / ?read.delim

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Run GHA jobs on 1-15-99 dev branch (#5909)

* Make declarations static for covr (#5910)

* class= argument for condition calls

* Unify logic with helper

* Add tests

* Use call.=FALSE where possible

* correct caught class

* strip call=/call.= handling

* botched merge

---------

Co-authored-by: Ofek <ofekshilon@gmail.com>
Co-authored-by: Ani <bloodraven166@gmail.com>
Co-authored-by: David Budzynski <56514985+davidbudzynski@users.noreply.github.com>
Co-authored-by: Scott Ritchie <sritchie73@gmail.com>
Co-authored-by: Benjamin Schwendinger <benjamin.schwendinger@tuwien.ac.at>
Co-authored-by: Jan Gorecki <J.Gorecki@wit.edu.pl>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>
Co-authored-by: Manuel López-Ibáñez <2620021+MLopez-Ibanez@users.noreply.github.com>
MichaelChirico added a commit that referenced this pull request Apr 23, 2024
* 1.15.0 on CRAN. Bump to 1.15.99

* Fix transform slowness (#5493)

* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>

* Improvements to the introductory vignette (#5836)

* Added my improvements to the intro vignette

* Removed two lines I added extra as a mistake earlier

* Requested changes

* Vignette typo patch (#5402)

* fix typos and grammatical mistakes

* fix typos and punctuation

* remove double spaces where it wasn't necessary

* fix typos and adhere to British English spelling

* fix typos

* fix typos

* add missing closing bracket

* fix typos

* review fixes

* Update vignettes/datatable-benchmarking.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Update vignettes/datatable-benchmarking.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Apply suggestions from code review benchmarking

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* remove unnecessary [ ] from datatable-keys-fast-subset.Rmd

* Update vignettes/datatable-programming.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Update vignettes/datatable-reshape.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* One last batch of fine-tuning

---------

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Michael Chirico <chiricom@google.com>

* Improved handling of list columns with NULL entries (#4250)

* Updated documentation for rbindlist(fill=TRUE)

* Print NULL entries of list as NULL

* Added news item

* edit NEWS, use '[NULL]' not 'NULL'

* fix test

* split NEWS item

* add example

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <benjamin.schwendinger@tuwien.ac.at>

* clarify that list input->unnamed list output (#5383)

* clarify that list input->unnamed list output

* Add example where make.names is used

* mention role of make.names

* fix subsetting issue in split.data.table (#5368)

* fix subsetting issue in split.data.table

* add a test

* drop=FALSE on inner [

* switch to 3.2.0 R dep (#5905)

* Allow early exit from check for eval/evalq in cedta (#5660)

* Allow early exit from check for eval/evalq in cedta

Done in the browser+untested, please take a second look :)

* Use %chin%

* nocov new code

* frollmax1: frollmax, frollmax adaptive, left adaptive support (#5889)

* frollmax exact, buggy fast, no fast adaptive

* frollmax fast fixing bugs

* frollmax man to fix CRAN check

* frollmax fast adaptive non NA, dev

* froll docs, adaptive left

* no frollmax fast adaptive

* frollmax adaptive exact NAs handling

* PR summary in news

* copy-edit changes from reviews

Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>

* comment requested by Michael

* update NEWS file

* Apply suggestions from code review

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Apply suggestions from code review

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* add comment requested by Michael

* add comment about int iterator for loop over k-1 obs

* extra comments

* Revert "extra comments"

This reverts commit 03af0e3.

* add comments to frollmax and frollsum

* typo fix

---------

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>

* Friendlier error in assignment with trailing comma (#5467)

* friendlier error in assignment with trailing comma

e.g. `DT[, `:=`(a = 1, b = 2,)`.

WIP. Need to add tests and such, but editing from browser before I forget.

* Another pass

* include unnamed indices on RHS too

* tests

* NEWS

* test numbering

* explicit example in NEWS

* Link to ?read.delim in ?fread to give a closer analogue of expected behavior (#5635)

* fread is similar to read.delim (#5634)

* Use ?read.csv / ?read.delim

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Run GHA jobs on 1-15-99 dev branch (#5909)

* overhauled linter

* revert code changes

* Initial commit of {lintr} approach

* first pass at personalization

* first custom linter

* delint vignettes

* delint tests

* delint R sources

* rm empty

* re-merge

* Move config to .ci directory

* Use endsWithAny

* Make declarations static for covr (#5910)

* restore lint on branch

* extension needed after all?

* set option in R

* debug printing

* Exact file name in option

* really hacky approach

* skip more linters

* One more round of deactivation

* FIx whitespace issues (again??)

* botched merge

* obsolete branch ref

* restore simple CI script thanks to upstream fix

* more delint

* just disable unused_import_linter() everywhere for now

* rm whitespace from atime tests

* comment about comment

---------

Co-authored-by: Ofek <ofekshilon@gmail.com>
Co-authored-by: Ani <bloodraven166@gmail.com>
Co-authored-by: David Budzynski <56514985+davidbudzynski@users.noreply.github.com>
Co-authored-by: Scott Ritchie <sritchie73@gmail.com>
Co-authored-by: Benjamin Schwendinger <benjamin.schwendinger@tuwien.ac.at>
Co-authored-by: Jan Gorecki <J.Gorecki@wit.edu.pl>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>
Co-authored-by: Manuel López-Ibáñez <2620021+MLopez-Ibanez@users.noreply.github.com>
MichaelChirico added a commit that referenced this pull request May 3, 2024
* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

transform is ~100x slower on data.table than on data.frame
5 participants