crittersVStubes_OTU.R is extremely slow and does not run in parallel #23

cbird808 · 2019-08-12T13:48:35Z

while I suspect there are other ways of improving the speed, time consuming steps can be parallelized. Note that the readme will have to be updated to include the parallel package in R.

Replace apply with parApply:
apply (X = charon, MARGIN = 1, function (x) {assign (x[[1]], x[[2]], as.numeric (x[[4]]), x[[5]], as.numeric (x[[6]]) ) })

#parallel version of apply
library(parallel)
cl <- makeCluster(detectCores())
parApply (cl=cl, charon, 1, function (x) {assign (x[[1]], x[[2]], as.numeric (x[[4]]), x[[5]], as.numeric (x[[6]]) ) })
stopCluster(cl)

The text was updated successfully, but these errors were encountered:

cbird808 · 2019-08-12T14:21:35Z

here's another time consuming step:

# Use the taxonimic rank and the TAXID as coordinates to assign the scientific name
# in the appropriate field
for (i in 1:(length (higherTaxa))){
for (j in 1:(length (higherTaxa[[i]]))){
if (is.na (attributes (higherTaxa[[i]][j])$names)){
break
}

if (attributes (higherTaxa[[i]][j])$names != "no rank"){ # Skip "no rank"
colIdx <- attributes (higherTaxa[[i]][j])$names
full[i,colIdx] <- as.character (sciname (id = as.numeric (higherTaxa[[i]][j]), taxdir = TAXDIR, names = ncbi_names))
}
}
}

#parallel version
fillTax <- function (i, TAXDIR) {
for (j in 1:(length (higherTaxa[[i]]))){
if (is.na (attributes (higherTaxa[[i]][j])$names)){
break
}

if (attributes (higherTaxa[[i]][j])$names != "no rank"){ # Skip "no rank"
colIdx <- attributes (higherTaxa[[i]][j])$names
full[i,colIdx] <- as.character (sciname (id = as.numeric (higherTaxa[[i]][j]), taxdir = TAXDIR, names = ncbi_names))
}
}
}
cl <- makeCluster(detectCores())
clusterExport(cl,"TAXDIR") #clusterExport(cl=cl, varlist=c("text.var", "ntv")
parLapply(cl, 1:length(higherTaxa), function(x) fillTax(x,TAXDIR) )
stopCluster(cl)

cbird808 · 2019-08-13T01:41:41Z

I've started improving this. Have streamlined the processing of charon and creation of CVT. have started using furrr to parallelize the time consuming tasks

ekrell · 2019-08-14T13:50:05Z

Okay perfect. This has been on my list for long time. The script for counting OTUs (bin/CROP_size_fix.sh) is also nasty slow and trivially parallel. As in, the script itself could just be called in parallel on a subset of the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crittersVStubes_OTU.R is extremely slow and does not run in parallel #23

crittersVStubes_OTU.R is extremely slow and does not run in parallel #23

cbird808 commented Aug 12, 2019 •

edited

Loading

cbird808 commented Aug 12, 2019

cbird808 commented Aug 13, 2019

ekrell commented Aug 14, 2019

crittersVStubes_OTU.R is extremely slow and does not run in parallel #23

crittersVStubes_OTU.R is extremely slow and does not run in parallel #23

Comments

cbird808 commented Aug 12, 2019 • edited Loading

cbird808 commented Aug 12, 2019

cbird808 commented Aug 13, 2019

ekrell commented Aug 14, 2019

cbird808 commented Aug 12, 2019 •

edited

Loading