-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MSData::Spectrum::getMZIntensityPairs()] Sizes do not match. #170
Comments
It can't be the files, since I can read them without problems using |
Where does the |
The I'll try to do some more tests tomorrow, also with other files to exclude that it's the files. |
Some updates: |
This is consistent with the |
Actually, the error comes from ProteinWizard mzR::peaks -> RcppWiz::getPeakList(x) which loads the spectrum and extracts the mz intensity pairs with getMZIntensityPairs . The error is thrown by this function if the sizes of mz and intensity arrays don't match. That's what I understand from the error. |
Now that's getting strange. the same files but differently converted to mzML (using vendor settings and without zlib compression of the binary data) and the error doesn't happen again. |
even more strange: when I process the files all in one go I don't get the error, when I save the Need to get to a reproducible example using test files from |
Somehow, on-disk objects shouldn't be saved/load, although, in theory, it should work if the raw files haven't been modified/moved. I just tried with a single file, and, indeed, it works. It is really puzzling. Could it be that the saving/loading error is only a red herring, and the problem lies somewhere else, deeper (for example your commit 9898ece9a70764fb7b748bca024877c0cea44623) |
Yes, I think it was only by chance that it worked and than failed again. So, saving/loading might not be it. The only thing I know so far is that I get a segfault if I use the |
GOT IT! I'll do some more tests and push the changes once fixed. |
Closing issue as it seems to be fixed for good. @lgatto could you eventually dump version and push to svn? |
Done. Version 2.1.2 on hedgehog
I still have to extract your #170 commit and push to release 3.4. |
Done too. |
o Ensure that header information is read too if spectra data is loaded for OnDiskMSnExp objects. From: jotsetung <johannes.rainer@gmail.com> git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MSnbase@124451 bc3139a8-67e5-0310-9ffc-ced21a209358
* master: update news Fix issue #170 Add spectrapply method and backend option Fix unit test error due to recent changes Add bpi method (issue #168) set filename only when input is a character Update readMSnSet2 to save filename Cite Lazar 2016 in vignette imputation section add imputatation paper to bib update news and description fix typo in impute man page new github devel version From: Laurent <lg390@cam.ac.uk> git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MSnbase@124452 bc3139a8-67e5-0310-9ffc-ced21a209358
o Ensure that header information is read too if spectra data is loaded for OnDiskMSnExp objects. From: jotsetung <johannes.rainer@gmail.com> git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MSnbase@124451 bc3139a8-67e5-0310-9ffc-ced21a209358
* master: update news Fix issue #170 Add spectrapply method and backend option Fix unit test error due to recent changes Add bpi method (issue #168) set filename only when input is a character Update readMSnSet2 to save filename Cite Lazar 2016 in vignette imputation section add imputatation paper to bib update news and description fix typo in impute man page new github devel version From: Laurent <lg390@cam.ac.uk> git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MSnbase@124452 bc3139a8-67e5-0310-9ffc-ced21a209358
Digging deeper into the |
Updates related to |
After extensive tests and evaluation of multiple approaches the only solution to this issue seems to be the original solution, i.e. to call |
Thanks! |
After some tests (many more to come), the issue reported here seems to occur only on macOS and there also only on one specific set of mzML files. So, if all further tests run smoothly, my suggestion would be to make the fix an option, but to disable it by default (more explanations later). Below are some benchmark tests for just reading data using library(mzR)
library(msdata)
library(microbenchmark)
## Define the functions to compare.
only_peaks <- function(x) {
fh <- mzR::openMSfile(x)
pks <- mzR::peaks(fh)
mzR::close(fh)
}
peaks_with_all_headers <- function(x) {
fh <- mzR::openMSfile(x)
hdr <- mzR::header(fh)
pks <- mzR::peaks(fh)
mzR::close(fh)
}
peaks_with_last_header <- function(x) {
fh <- mzR::openMSfile(x)
hdr <- mzR::header(fh, length(fh))
pks <- mzR::peaks(fh)
mzR::close(fh)
}
## mzML
fl <- system.file("microtofq/MM14.mzML", package = "msdata")
microbenchmark(only_peaks(fl), peaks_with_all_headers(fl),
peaks_with_last_header(fl), times = 10)
Unit: milliseconds
expr min lq mean median uq
only_peaks(fl) 44.89906 45.89676 47.75040 47.15564 49.25066
peaks_with_all_headers(fl) 71.15074 73.36380 80.23435 74.95574 80.91604
peaks_with_last_header(fl) 66.75709 67.77629 80.98064 69.63443 74.46741
max neval cld
51.4870 10 a
106.6319 10 b
167.8683 10 b Not unexpectedly, the call without Next on a gzipped mzML file: ## gzipped mzML
fl <- system.file("proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz", package = "msdata")
microbenchmark(only_peaks(fl), peaks_with_all_headers(fl),
peaks_with_last_header(fl), times = 10)
Unit: seconds
expr min lq mean median uq
only_peaks(fl) 13.39147 13.52713 13.64382 13.56836 13.80904
peaks_with_all_headers(fl) 27.62570 27.78864 28.13496 27.99689 28.56590
peaks_with_last_header(fl) 15.50221 15.67585 16.01584 15.87251 16.03055
max neval cld
14.14337 10 a
29.14699 10 c
17.84753 10 b
Now that's considerably slower. Reading the header information from all spectra has really poor performance, while reading the last header is better. Next we evaluate an mzXML file: fl <- system.file("lockmass/LockMass_test.mzXML", package = "msdata")
microbenchmark(only_peaks(fl), peaks_with_all_headers(fl),
peaks_with_last_header(fl), times = 10)
Unit: milliseconds
expr min lq mean median uq
only_peaks(fl) 67.81239 68.1742 70.10934 68.55077 72.94026
peaks_with_all_headers(fl) 122.98311 126.4965 129.60679 127.83370 131.70625
peaks_with_last_header(fl) 100.18529 101.1152 104.02154 102.28500 108.03445
max neval cld
75.26939 10 a
139.63150 10 c
111.65219 10 b Similar to the mzML file, reading just the data is fastest, data + last header second and data + all header is about twice as slow. ## At last with the same file but gzipped...
fl <- "/Users/jo/data/2017/mzXML/1405_blk1.mzXML.gz"
microbenchmark(only_peaks(fl), peaks_with_all_headers(fl),
peaks_with_last_header(fl), times = 10)
Unit: seconds
expr min lq mean median uq
only_peaks(fl) 15.58146 15.76791 15.97458 15.96889 16.06633
peaks_with_all_headers(fl) 30.09662 30.22185 30.64439 30.48337 30.82167
peaks_with_last_header(fl) 28.57678 28.68009 29.43082 28.90038 29.33533
max neval cld
16.65828 10 a
31.81355 10 c
33.78522 10 b Also here, reading just the data using Summarizing:
|
The first runs for my torture tests are ready: library(mzR)
SN <- "/Users/jo/data/2016/2016-11/NoSN/"
## SN <- "/Users/jo/data/2017/2017_02/"
## SN <- "/Users/jo/data/2016/2016_06/"
## SN <- "/Users/jo/data/2017/nalden01/"
fl <- dir(SN, full.names = TRUE)
torture_test <- function(files, FUN, iterations = 10) {
for (i in 1:iterations) {
cat("\nIteration", i, "of", iterations, "\n\n")
for (j in 1:length(fl)) {
if (j %% 20 == 0)
cat(j, "files processed\n")
FUN(fl[j])
}
}
}
fail_fun <- function(x) {
fh <- mzR::openMSfile(x)
pks <- mzR::peaks(fh)
mzR::close(fh)
}
torture_test(fl, FUN = fail_fun)
In brief, the test opens each file, extracts the data from each spectrum in the file using
A note on the mzML files from our lab: they are converted from ABI wiff format to mzML using proteowizard on Windows 7. |
Results for macOS:
As described above, the error occurs randomly, although more frequently on certain files - but not always. sessionInfo: > sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin16.7.0/x86_64 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mzR_2.11.5 Rcpp_0.12.12
loaded via a namespace (and not attached):
[1] compiler_3.4.1 ProtGenerics_1.9.0 parallel_3.4.1
[4] Biobase_2.37.2 codetools_0.2-15 BiocGenerics_0.23.0 |
- Add an option fastLoad to disable the additional mzR::header call executed before each mzR::peaks call to fetch data on-demand for OnDiskMSnExp objects. This partially reverts the fix for issue #170 as this seems to be macOS and file specific. - Add related unit tests and documentation.
Results for Linux:
Apparently, on Linux there is no problem using just sessionInfo: > sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu/x86_64 (64-bit)
Running under: Linux Mint 18.1
Matrix products: default
BLAS: /home/jo/R/2017-07/R-3.4.1-BioC3.6-devel/lib/R/lib/x86_64/libRblas.so
LAPACK: /home/jo/R/2017-07/R-3.4.1-BioC3.6-devel/lib/R/lib/x86_64/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=it_IT.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mzR_2.11.5 Rcpp_0.12.12
loaded via a namespace (and not attached):
[1] compiler_3.4.1 ProtGenerics_1.9.0 parallel_3.4.1
[4] Biobase_2.37.2 codetools_0.2-15 BiocGenerics_0.23.0
|
Results for Windows:
Also on Windows using sessionInfo: > sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252
[3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
[5] LC_TIME=German_Austria.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mzR_2.11.4 Rcpp_0.12.11
loaded via a namespace (and not attached):
[1] compiler_3.4.1 ProtGenerics_1.9.0 parallel_3.4.1
[4] Biobase_2.37.2 codetools_0.2-15 BiocGenerics_0.23.0 |
Conclusion from these tests:
I will add this (and in addition remove the additional |
Next I'm running torture tests using library(MSnbase)
torturing <- function(x) {
tmp <- readMSData2(x, msLevel. = 1)
register(SerialParam())
for (i in 1:10) {
cat("--- ", i, " ---", "\n")
cat("first spectrapply\n")
sp <- MSnbase::spectrapply(tmp, FUN = function(z) {max(mz(z))})
rm(sp)
gc()
cat("second spectrapply\n")
sp <- MSnbase::spectrapply(tmp, FUN = function(z) {max(mz(z))})
rm(sp)
gc()
tmp <- filterRt(tmp, rt = c(5, 500))
cat("third spectrapply after filter rt\n")
sp <- MSnbase::spectrapply(tmp, FUN = function(z) {max(mz(z))})
cat("\n\n")
}
}
This function is run on the same sets of test files on macOS, Linux and Windows.
|
- Update torture script to evaluate the fastLoad option (not reading header prior to read data) and the removal of the additional gc() call in spectrapply. - Tune the functions called by spectrapply,OnDiskMSnExp. - Automatically disable fastLoad on macOS.
torture test results for macOS: Error in object@backend$getPeakList(x) :
[MSData::Spectrum::getMZIntensityPairs()] Sizes do not match. With
So, for macOS we definitely have to use |
torture test results for Linux:
For Linux there seems to be no need to call |
Finally, torture test results for Windows:
Looks like also on Windows |
* master: update news Fix issue #170 Add spectrapply method and backend option Fix unit test error due to recent changes Add bpi method (issue #168) set filename only when input is a character Update readMSnSet2 to save filename Cite Lazar 2016 in vignette imputation section add imputatation paper to bib update news and description fix typo in impute man page new github devel version From: Laurent <lg390@cam.ac.uk> git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/MSnbase@124452 bc3139a8-67e5-0310-9ffc-ced21a209358
I just encountered the following error when using either chromatogram() or mz() functions:
I am analyzing .mzML files generated by msconvert of Thermo .raw files on a Windows 10 device and analyzing with R 3.5.1 running in Rstudio 1.1.456 on a Mac. Of course, I don't get the error if I run readMSData() using mode = "inMemory". I read the above thread in detail and was wondering how to apply the solution? Thanks for your help and apologies for key missing details; this is my first post in such a forum.
|
Thank you for the report @wmoldham - Do you get the error on OSX and Windows, or only OSX? |
I have only attempted this on OSX, I don't have easy access to a Windows machine (!), I can try to find one to reproduce there. |
I had the same error recently on a set of files too (on OSX). To me this happened randomly, i.e. if I called the same function a second time I did not get the error again. That made me think it might be related to garbage collection. Note also that this error is thrown by the proteowizard routines that are used in |
@jotsetung - is the |
On MacOS it should be always |
Apologies for the delayed response. After restarting Rstudio, I spent yesterday working with the data, including repeating the processing steps that yielded the error previously, and I did not encounter this error again. No problems using the readMSData() using mode "onDisk". I can confirm that |
Hi all, I've been experiencing the same issue on and off using function
I am running this on macOS, with
|
@lauzikaite, I have had much better stability utilizing the |
@lauzikaite yes, I am aware of this and it keeps happening to me too (macOS). Problem is that I have no idea how we could fix the error. To me it seems to be related to some garbage collection process (in R?) that kicks in randomly. @wmoldham thanks for your input! I also had the impression that with |
@wmoldham, thank you for the suggestion. I can confirm that use of @jotsetung, thank you. I just wanted to inquire whether I've missed something in my setup to avoid this. So far, the best "fix" for me has been the use of a |
I have also noticed this problem on mac... |
For me (on mac) it seems to be OK now. Regarding linux and Windows, I never got this error on my linux and windows test environments. This seems indeed to happen (randomly) on mac - and absolutely no idea why (the error is thrown by the proteowizard C++ code that is used by |
I encountered the same problem on Mac. |
I'm experiencing some random errors again:
I have a set of 690 mzML files, select 12 of them for further analysis, filter on retention time and get the following error when calling
spectra
on theOnDiskMSnExp
:At first I thought that it must be my files, but when I select randomly selected 12 other files the same error occurred. So it's most likely not these specific files, also, the element index is not always 6.
Also without filtering I get errors.
And what makes me really wondering is that sometimes, especially if called repeatedly, the function works without errors.
The text was updated successfully, but these errors were encountered: