-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize subsetting in genlight objects #48
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I have created for loops that will hopefully preserve the class type of the object that went into the "[" method. The "[" method for genlight and SNPbin object would create new objects every time they were subset. This prevents these objects from being properly inherited because everything returned from that method will be an object of class "genlight" or "SNPbin". This breaks inheritence because any new class created from genlight or SNPbin objects will not be able to use callNextMethod() for the "[" method because the class of object returned will no longer be the one that went in.
This combined with the direct subsetting of the SNPbin object speeds up the process of subsetting by an order of magnitude. For benchmarking, microbenchmark was used with 4 data sets (x, y, z, and zz) containing 50 samples with the sizes 1e3, 1e4, 1e5 and 1e6 snps: Subset and reconstruct (old method) ``` Unit: milliseconds expr min lq mean median uq max neval x[, the_loci] 12.69427 14.01981 15.37758 14.94392 16.15417 23.03208 100 y[, the_loci] 25.41475 28.27175 30.67594 29.52679 31.66637 48.77981 100 z[, the_loci] 187.44007 202.46392 209.83472 207.75600 216.46626 257.66354 100 zz[, the_loci] 1774.08317 1815.80018 1854.20525 1838.57798 1882.47164 2085.26544 100 ``` Subset directly: ``` Unit: milliseconds expr min lq mean median uq max neval x[, the_loci] 7.173006 7.653467 8.361506 8.133443 8.884091 11.70068 100 y[, the_loci] 7.970834 8.411625 9.207977 8.938945 9.796977 13.50356 100 z[, the_loci] 22.982876 25.815514 30.168462 27.029380 28.965536 61.29198 100 zz[, the_loci] 184.400632 216.074103 221.699854 222.185526 226.223172 380.18579 100 ```
thibautjombart
added a commit
that referenced
this pull request
Apr 20, 2015
Optimize subsetting in genlight objects
I think this is properly awesome. Well done and big thanks! =D |
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Subsetting genlight objects had a bottleneck when subsetting SNPbin objects. For objects with 1 million snps, it would take about a second to subset 10 snps from the object (and ~5 seconds to subset 999,999 snps). When thinking about doing bootstrapping or sliding windows, it would be painfully slow.
I found that this was due to the fact that the SNPbin object was being rebuilt for every subset. I added an internal function that will subset the raw vector by converting it to bits, subsetting them, and then using
packBits()
to pack them all into a raw vector.I also removed the creation of a new genlight object in favor of simply subsetting the slots. This way the object can be inherited properly
The speedup is an order of magnitude.
For benchmarking, microbenchmark was used with 4 data sets each containing 50 samples with 1% missing data (based off of the example in the genlight documentation):
50 random loci were used for subsetting:
Subset and reconstruct (old method):
Subset directly: