Optimize subsetting in genlight objects #48

zkamvar · 2015-04-20T17:32:09Z

Subsetting genlight objects had a bottleneck when subsetting SNPbin objects. For objects with 1 million snps, it would take about a second to subset 10 snps from the object (and ~5 seconds to subset 999,999 snps). When thinking about doing bootstrapping or sliding windows, it would be painfully slow.

I found that this was due to the fact that the SNPbin object was being rebuilt for every subset. I added an internal function that will subset the raw vector by converting it to bits, subsetting them, and then using packBits() to pack them all into a raw vector.

I also removed the creation of a new genlight object in favor of simply subsetting the slots. This way the object can be inherited properly

The speedup is an order of magnitude.

For benchmarking, microbenchmark was used with 4 data sets each containing 50 samples with 1% missing data (based off of the example in the genlight documentation):

data	nLoc
x	1,000
y	10,000
z	100,000
zz	1,000,000

50 random loci were used for subsetting:

> dput(the_loci)
c(68L, 366L, 773L, 609L, 196L, 180L, 420L, 675L, 125L, 138L, 
49L, 606L, 13L, 979L, 544L, 576L, 106L, 108L, 854L, 731L, 439L, 
161L, 137L, 990L, 66L, 516L, 922L, 247L, 25L, 868L, 151L, 447L, 
58L, 313L, 874L, 167L, 523L, 801L, 299L, 132L, 776L, 870L, 585L, 
110L, 240L, 381L, 114L, 403L, 97L, 197L)

Subset and reconstruct (old method):

Unit: milliseconds
           expr        min         lq       mean     median         uq        max neval
  x[, the_loci]   12.69427   14.01981   15.37758   14.94392   16.15417   23.03208   100
  y[, the_loci]   25.41475   28.27175   30.67594   29.52679   31.66637   48.77981   100
  z[, the_loci]  187.44007  202.46392  209.83472  207.75600  216.46626  257.66354   100
 zz[, the_loci] 1774.08317 1815.80018 1854.20525 1838.57798 1882.47164 2085.26544   100

Subset directly:

Unit: milliseconds
           expr        min         lq       mean     median         uq       max neval
  x[, the_loci]   7.173006   7.653467   8.361506   8.133443   8.884091  11.70068   100
  y[, the_loci]   7.970834   8.411625   9.207977   8.938945   9.796977  13.50356   100
  z[, the_loci]  22.982876  25.815514  30.168462  27.029380  28.965536  61.29198   100
 zz[, the_loci] 184.400632 216.074103 221.699854 222.185526 226.223172 380.18579   100

I have created for loops that will hopefully preserve the class type of the object that went into the "[" method. The "[" method for genlight and SNPbin object would create new objects every time they were subset. This prevents these objects from being properly inherited because everything returned from that method will be an object of class "genlight" or "SNPbin". This breaks inheritence because any new class created from genlight or SNPbin objects will not be able to use callNextMethod() for the "[" method because the class of object returned will no longer be the one that went in.

This combined with the direct subsetting of the SNPbin object speeds up the process of subsetting by an order of magnitude. For benchmarking, microbenchmark was used with 4 data sets (x, y, z, and zz) containing 50 samples with the sizes 1e3, 1e4, 1e5 and 1e6 snps: Subset and reconstruct (old method) ``` Unit: milliseconds expr min lq mean median uq max neval x[, the_loci] 12.69427 14.01981 15.37758 14.94392 16.15417 23.03208 100 y[, the_loci] 25.41475 28.27175 30.67594 29.52679 31.66637 48.77981 100 z[, the_loci] 187.44007 202.46392 209.83472 207.75600 216.46626 257.66354 100 zz[, the_loci] 1774.08317 1815.80018 1854.20525 1838.57798 1882.47164 2085.26544 100 ``` Subset directly: ``` Unit: milliseconds expr min lq mean median uq max neval x[, the_loci] 7.173006 7.653467 8.361506 8.133443 8.884091 11.70068 100 y[, the_loci] 7.970834 8.411625 9.207977 8.938945 9.796977 13.50356 100 z[, the_loci] 22.982876 25.815514 30.168462 27.029380 28.965536 61.29198 100 zz[, the_loci] 184.400632 216.074103 221.699854 222.185526 226.223172 380.18579 100 ```

Optimize subsetting in genlight objects

thibautjombart · 2015-04-20T18:06:40Z

I think this is properly awesome. Well done and big thanks! =D

zkamvar added 11 commits April 19, 2015 15:15

roxygen2 noise

03df724

Attempt to create faster subsetting for SNPbin

b3bd93c

rename internal subsetting functions.

0210e5d

better handle missing

39792f8

missing data subsets correctly.

cc9279a

avoid computing vector if no NAs are kept

93d8fd8

set method to modern method.

e0dc3f8

Merge branch 'master' into genlight-standardize

0fbccf5

Merge branch 'master' into genlight-optimize

0d90468

thibautjombart added a commit that referenced this pull request Apr 20, 2015

Merge pull request #48 from thibautjombart/genlight-optimize

e9a7ae0

Optimize subsetting in genlight objects

thibautjombart merged commit e9a7ae0 into master Apr 20, 2015

thibautjombart deleted the genlight-optimize branch April 20, 2015 18:58

zkamvar mentioned this pull request Jul 31, 2015

Small speedup to snpbin #81

Merged

zkamvar mentioned this pull request Sep 22, 2024

bug in genlight filtering #363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize subsetting in genlight objects #48

Optimize subsetting in genlight objects #48

zkamvar commented Apr 20, 2015

thibautjombart commented Apr 20, 2015

Optimize subsetting in genlight objects #48

Optimize subsetting in genlight objects #48

Conversation

zkamvar commented Apr 20, 2015

thibautjombart commented Apr 20, 2015