Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readMgfData speed-up #319

Closed
veitveit opened this issue Mar 13, 2018 · 7 comments
Closed

readMgfData speed-up #319

veitveit opened this issue Mar 13, 2018 · 7 comments

Comments

@veitveit
Copy link

It looks that most time is spent on initialization of the Spectrum2 objects (see screenshot below).

As dirty workaround, I substituted (in extractMgfSpectrum2Info):

sp <- new("Spectrum2",
            rt = unname(desc["RTINSECONDS"]),
            scanIndex = unname(as.integer(desc["SCANS"])),
            precursorMz = unname(desc["PEPMASSMZ"]),
            precursorIntensity = unname(desc["PEPMASSINT"]),
            precursorCharge = unname(as.integer(desc["CHARGE"])),
            mz = ms[, 1L],
            intensity = ms[, 2L],
         fromFile = 1L,
           centroided = centroided)

by

sp <- dummy_sp
  
  sp@rt <- unname(desc["RTINSECONDS"])
  sp@scanIndex <- unname(as.integer(desc["SCANS"]))
  sp@precursorMz <- unname(desc["PEPMASSMZ"])
  sp@precursorIntensity <- unname(desc["PEPMASSINT"])
  sp@precursorCharge <- unname(as.integer(desc["CHARGE"]))
  sp@mz <- ms[, 1L]
  sp@intensity <- ms[, 2L]
  sp@fromFile <- 1L
  sp@centroided <- centroided
  sp@peaksCount <- nrow(ms)

where dummy_sp was defined as empty global Spectrum2 object.

As result, reading an mgf-file got much faster (in total 50% less spent. Much less when iterating through the spectra). Things might be missing though.

image

@jorainer
Copy link
Collaborator

Alternatively one could use the MSnbase:::Spectrum2_mz_sorted function. This creates the Spectrum2 object in C. For multiple Spectrum objects Spectra2_mz_sorted could be used instead.

@lgatto
Copy link
Owner

lgatto commented Mar 13, 2018

Yes, initialising S4 objects is costly. The problem with your approach is that the object doesn't get validated. I am not keen on mgf data; my general advice here would be to read the data from the mzML file with readMSData(..., mode = "onDisk"). This will be even faster as the data won't be loaded unless it is needed.

I'll still look into it, and check that the spectra aren't checked twice, once upon initiatisation, as you show above, and again when the MSnExp validity if checked.

@veitveit
Copy link
Author

I would love to use mzML instead of mgf. The problem is that my identification pipeline is using mgf as input and then needs the spectrum titles, which are not available through readMSData, at least not to my knowledge.

And yes, it could well be that the validity is checked multiple times.

@lgatto
Copy link
Owner

lgatto commented Mar 13, 2018

I will have a fix later today.

@veitveit
Copy link
Author

Thanks :-)

@lgatto lgatto closed this as completed in e07da71 Mar 13, 2018
@lgatto
Copy link
Owner

lgatto commented Mar 13, 2018

Let me know if you see the desired improvement.

By the way, it should be possible to provide a way to add a custom spectrum title to the mgf file. This would allow you to use the mzML file as main working data and export parts of it to mgf with a title of your choice.

@veitveit
Copy link
Author

Using MSnbase:::Spectrum2_mz_sorted had the desired effect.

ProteoWizard seems to allow specifying the title in the mgf files and we will look into basing the analysis on mzML in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants