readMgfData speed-up #319

veitveit · 2018-03-13T13:23:39Z

It looks that most time is spent on initialization of the Spectrum2 objects (see screenshot below).

As dirty workaround, I substituted (in extractMgfSpectrum2Info):

sp <- new("Spectrum2",
            rt = unname(desc["RTINSECONDS"]),
            scanIndex = unname(as.integer(desc["SCANS"])),
            precursorMz = unname(desc["PEPMASSMZ"]),
            precursorIntensity = unname(desc["PEPMASSINT"]),
            precursorCharge = unname(as.integer(desc["CHARGE"])),
            mz = ms[, 1L],
            intensity = ms[, 2L],
         fromFile = 1L,
           centroided = centroided)

by

sp <- dummy_sp
  
  sp@rt <- unname(desc["RTINSECONDS"])
  sp@scanIndex <- unname(as.integer(desc["SCANS"]))
  sp@precursorMz <- unname(desc["PEPMASSMZ"])
  sp@precursorIntensity <- unname(desc["PEPMASSINT"])
  sp@precursorCharge <- unname(as.integer(desc["CHARGE"]))
  sp@mz <- ms[, 1L]
  sp@intensity <- ms[, 2L]
  sp@fromFile <- 1L
  sp@centroided <- centroided
  sp@peaksCount <- nrow(ms)

where dummy_sp was defined as empty global Spectrum2 object.

As result, reading an mgf-file got much faster (in total 50% less spent. Much less when iterating through the spectra). Things might be missing though.

The text was updated successfully, but these errors were encountered:

jorainer · 2018-03-13T13:32:05Z

Alternatively one could use the MSnbase:::Spectrum2_mz_sorted function. This creates the Spectrum2 object in C. For multiple Spectrum objects Spectra2_mz_sorted could be used instead.

lgatto · 2018-03-13T13:32:17Z

Yes, initialising S4 objects is costly. The problem with your approach is that the object doesn't get validated. I am not keen on mgf data; my general advice here would be to read the data from the mzML file with readMSData(..., mode = "onDisk"). This will be even faster as the data won't be loaded unless it is needed.

I'll still look into it, and check that the spectra aren't checked twice, once upon initiatisation, as you show above, and again when the MSnExp validity if checked.

veitveit · 2018-03-13T14:00:25Z

I would love to use mzML instead of mgf. The problem is that my identification pipeline is using mgf as input and then needs the spectrum titles, which are not available through readMSData, at least not to my knowledge.

And yes, it could well be that the validity is checked multiple times.

lgatto · 2018-03-13T14:11:40Z

I will have a fix later today.

veitveit · 2018-03-13T14:31:28Z

Thanks :-)

lgatto · 2018-03-13T20:32:07Z

Let me know if you see the desired improvement.

By the way, it should be possible to provide a way to add a custom spectrum title to the mgf file. This would allow you to use the mzML file as main working data and export parts of it to mgf with a title of your choice.

veitveit · 2018-03-14T10:37:44Z

Using MSnbase:::Spectrum2_mz_sorted had the desired effect.

ProteoWizard seems to allow specifying the title in the mgf files and we will look into basing the analysis on mzML in the future.

lgatto closed this as completed in e07da71 Mar 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readMgfData speed-up #319

readMgfData speed-up #319

veitveit commented Mar 13, 2018

jorainer commented Mar 13, 2018

lgatto commented Mar 13, 2018

veitveit commented Mar 13, 2018

lgatto commented Mar 13, 2018

veitveit commented Mar 13, 2018

lgatto commented Mar 13, 2018

veitveit commented Mar 14, 2018

readMgfData speed-up #319

readMgfData speed-up #319

Comments

veitveit commented Mar 13, 2018

jorainer commented Mar 13, 2018

lgatto commented Mar 13, 2018

veitveit commented Mar 13, 2018

lgatto commented Mar 13, 2018

veitveit commented Mar 13, 2018

lgatto commented Mar 13, 2018

veitveit commented Mar 14, 2018