[r] Fix perf subsetting large concatenated matrices #179

bnprks · 2024-12-23T04:50:04Z

The logic for subsetting rbind or cbind matrices was taking time O(length(selection)* # sub-matrices), which could be quite slow for large datasets. Now, rather than performing a linear search through the selection indices to find the indices in range for each sub-matrix, we just do a binary search which eliminates the performance issue.

immanuelazn

The new local_i, and local_j search is really clever. Definitely was faster on my end, and only some minor copy/paste changes.

As a small thought exercise though, I think the main problem was running local_i <- i$subset[i$subset >= row_start & i$subset <= row_end] - last_row repetitively for each concatted matrix. We can notice that split_Selection_index() sorts indices already. I think this could have been replaced with a two pointer approach with the cumulative sums against i$subset, which would only require one pass through i$subset.

Reading through the findInterval documentation though, since the first arg is sorted by definition, it can be O(length(cumsum(c(0, rows))) even though a binary search is used. Overall the differences in speed shouldn't be substantial though!

r/R/matrix.R

Co-authored-by: Immanuel Abdi <56730419+immanuelazn@users.noreply.github.com>

bnprks · 2025-01-09T01:24:52Z

The findInterval logic is definitely a little trickier to follow than I'd like. If we were in C++ I'd definitely go for the two pointer approach, but I'm not actually sure if R has a good way to search arrays like that since R for loops are pretty suspect from a performance perspective.

Thanks for spotting those copy-paste editing issues!

bnprks changed the title ~~[r] Fix perf subsetting large conatenated matrices~~ [r] Fix perf subsetting large concatenated matrices Dec 23, 2024

immanuelazn approved these changes Jan 4, 2025

View reviewed changes

r/R/matrix.R Outdated Show resolved Hide resolved

r/R/matrix.R Outdated Show resolved Hide resolved

r/R/matrix.R Outdated Show resolved Hide resolved

Fix copy-paste errors

2fdecde

Co-authored-by: Immanuel Abdi <56730419+immanuelazn@users.noreply.github.com>

bnprks force-pushed the bp/matrix-concat-subset-perf branch from 4ad367b to 2fdecde Compare January 9, 2025 01:20

bnprks added 2 commits January 8, 2025 23:39

Merge branch 'main' into bp/matrix-concat-subset-perf

aedcea5

Update NEWS

b7929a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r] Fix perf subsetting large concatenated matrices #179

[r] Fix perf subsetting large concatenated matrices #179

bnprks commented Dec 23, 2024

immanuelazn left a comment •

edited

Loading

bnprks commented Jan 9, 2025

[r] Fix perf subsetting large concatenated matrices #179

Are you sure you want to change the base?

[r] Fix perf subsetting large concatenated matrices #179

Conversation

bnprks commented Dec 23, 2024

immanuelazn left a comment • edited Loading

Choose a reason for hiding this comment

bnprks commented Jan 9, 2025

immanuelazn left a comment •

edited

Loading