Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] Fix perf subsetting large concatenated matrices #179

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

bnprks
Copy link
Owner

@bnprks bnprks commented Dec 23, 2024

The logic for subsetting rbind or cbind matrices was taking time O(length(selection)* # sub-matrices), which could be quite slow for large datasets. Now, rather than performing a linear search through the selection indices to find the indices in range for each sub-matrix, we just do a binary search which eliminates the performance issue.

The logic for subsetting rbind or cbind matrices was taking time
O(length(selection)* # sub-matrices), which could be quite slow for large
datasets. Now, rather than performing a linear search through the
selection indices to find the indices in range for each sub-matrix,
we just do a binary search which eliminates the performance issue.
@bnprks bnprks changed the title [r] Fix perf subsetting large conatenated matrices [r] Fix perf subsetting large concatenated matrices Dec 23, 2024
Copy link
Collaborator

@immanuelazn immanuelazn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new local_i, and local_j search is really clever. Definitely was faster on my end, and only some minor copy/paste changes.

As a small thought exercise though, I think the main problem was running local_i <- i$subset[i$subset >= row_start & i$subset <= row_end] - last_row repetitively for each concatted matrix. We can notice that split_Selection_index() sorts indices already. I think this could have been replaced with a two pointer approach with the cumulative sums against i$subset, which would only require one pass through i$subset.

Reading through the findInterval documentation though, since the first arg is sorted by definition, it can be O(length(cumsum(c(0, rows))) even though a binary search is used. Overall the differences in speed shouldn't be substantial though!

r/R/matrix.R Outdated Show resolved Hide resolved
r/R/matrix.R Outdated Show resolved Hide resolved
r/R/matrix.R Outdated Show resolved Hide resolved
Co-authored-by: Immanuel Abdi <56730419+immanuelazn@users.noreply.github.com>
@bnprks bnprks force-pushed the bp/matrix-concat-subset-perf branch from 4ad367b to 2fdecde Compare January 9, 2025 01:20
@bnprks
Copy link
Owner Author

bnprks commented Jan 9, 2025

The findInterval logic is definitely a little trickier to follow than I'd like. If we were in C++ I'd definitely go for the two pointer approach, but I'm not actually sure if R has a good way to search arrays like that since R for loops are pretty suspect from a performance perspective.

Thanks for spotting those copy-paste editing issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants