Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crittersVStubes_OTU.R is extremely slow and does not run in parallel #23

Open
cbird808 opened this issue Aug 12, 2019 · 3 comments
Open

Comments

@cbird808
Copy link
Contributor

cbird808 commented Aug 12, 2019

while I suspect there are other ways of improving the speed, time consuming steps can be parallelized. Note that the readme will have to be updated to include the parallel package in R.

Replace apply with parApply:
apply (X = charon, MARGIN = 1, function (x) {assign (x[[1]], x[[2]], as.numeric (x[[4]]), x[[5]], as.numeric (x[[6]]) ) })

#parallel version of apply
library(parallel)
cl <- makeCluster(detectCores())
parApply (cl=cl, charon, 1, function (x) {assign (x[[1]], x[[2]], as.numeric (x[[4]]), x[[5]], as.numeric (x[[6]]) ) })
stopCluster(cl)

@cbird808
Copy link
Contributor Author

here's another time consuming step:

# Use the taxonimic rank and the TAXID as coordinates to assign the scientific name
# in the appropriate field
for (i in 1:(length (higherTaxa))){
for (j in 1:(length (higherTaxa[[i]]))){
if (is.na (attributes (higherTaxa[[i]][j])$names)){
break
}

if (attributes (higherTaxa[[i]][j])$names != "no rank"){ # Skip "no rank"
colIdx <- attributes (higherTaxa[[i]][j])$names
full[i,colIdx] <- as.character (sciname (id = as.numeric (higherTaxa[[i]][j]), taxdir = TAXDIR, names = ncbi_names))
}
}
}

#parallel version
fillTax <- function (i, TAXDIR) {
for (j in 1:(length (higherTaxa[[i]]))){
if (is.na (attributes (higherTaxa[[i]][j])$names)){
break
}

if (attributes (higherTaxa[[i]][j])$names != "no rank"){ # Skip "no rank"
colIdx <- attributes (higherTaxa[[i]][j])$names
full[i,colIdx] <- as.character (sciname (id = as.numeric (higherTaxa[[i]][j]), taxdir = TAXDIR, names = ncbi_names))
}
}
}
cl <- makeCluster(detectCores())
clusterExport(cl,"TAXDIR") #clusterExport(cl=cl, varlist=c("text.var", "ntv")
parLapply(cl, 1:length(higherTaxa), function(x) fillTax(x,TAXDIR) )
stopCluster(cl)

@cbird808
Copy link
Contributor Author

I've started improving this. Have streamlined the processing of charon and creation of CVT. have started using furrr to parallelize the time consuming tasks

@ekrell
Copy link
Contributor

ekrell commented Aug 14, 2019

Okay perfect. This has been on my list for long time. The script for counting OTUs (bin/CROP_size_fix.sh) is also nasty slow and trivially parallel. As in, the script itself could just be called in parallel on a subset of the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants