-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[r] Fix perf subsetting large concatenated matrices #179
base: main
Are you sure you want to change the base?
Conversation
The logic for subsetting rbind or cbind matrices was taking time O(length(selection)* # sub-matrices), which could be quite slow for large datasets. Now, rather than performing a linear search through the selection indices to find the indices in range for each sub-matrix, we just do a binary search which eliminates the performance issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new local_i, and local_j search is really clever. Definitely was faster on my end, and only some minor copy/paste changes.
As a small thought exercise though, I think the main problem was running local_i <- i$subset[i$subset >= row_start & i$subset <= row_end] - last_row
repetitively for each concatted matrix. We can notice that split_Selection_index()
sorts indices already. I think this could have been replaced with a two pointer approach with the cumulative sums against i$subset
, which would only require one pass through i$subset
.
Reading through the findInterval
documentation though, since the first arg is sorted by definition, it can be O(length(cumsum(c(0, rows)))
even though a binary search is used. Overall the differences in speed shouldn't be substantial though!
Co-authored-by: Immanuel Abdi <56730419+immanuelazn@users.noreply.github.com>
4ad367b
to
2fdecde
Compare
The findInterval logic is definitely a little trickier to follow than I'd like. If we were in C++ I'd definitely go for the two pointer approach, but I'm not actually sure if R has a good way to search arrays like that since R for loops are pretty suspect from a performance perspective. Thanks for spotting those copy-paste editing issues! |
The logic for subsetting rbind or cbind matrices was taking time O(length(selection)* # sub-matrices), which could be quite slow for large datasets. Now, rather than performing a linear search through the selection indices to find the indices in range for each sub-matrix, we just do a binary search which eliminates the performance issue.