Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add function dotproduct and unit tests #17

Closed
wants to merge 19 commits into from

Conversation

tnaake
Copy link
Collaborator

@tnaake tnaake commented Aug 29, 2019

Hi @jorainer, @sgibb, @lgatto

as discussed here (rformassspectrometry/Spectra#49) dotproduct should move to MsCoreUtils.

Copy link
Member

@sgibb sgibb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear Thomas thanks for this PR.

Before looking into it in a more detail I have a few questions:

  1. We have already a dotproduct function in MSnbase: https://github.com/lgatto/MSnbase/blob/604440ae5dd7c86b685e404f22c874cedea19642/R/matching.R#L102-L123. It ignores the mz values. Are the mz values important? IMHO after matching the spectra (e.g. by joinPeaks in Spectra) they should be fairly similar and it should be more or less a multiplication by a constant factor, or?

  2. Are the arguments m and n really needed? Are there real use cases for changing them?

  3. Is a non-normalized similarity useful? So do we need the normalize argument at all?

Copy link
Member

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @tnaake !

@sgibb, regarding the normalize parameter, that was my fault. I suggested to add this parameter because I thought that the normalized dot product is different from the dotproduct that is available in MSnbase (although is not exported!).

R/dotproduct.R Outdated Show resolved Hide resolved
@tnaake
Copy link
Collaborator Author

tnaake commented Sep 2, 2019

Hi @sgibb,

Concerning your questions:

  1. Especially for small molecules I like to give also a weight on the m/z values to accommodate that shared fragments with higher m/z are less likely and will mean that molecules might be more similar.
  2. If we set m and n to 1, the two functions will return the same value. Otherwise, the proposed function with arguments m and n will add a bit more flexibility for calculating the normalized dot product.
  3. I would say it's not very useful and it is misleading when we compare non-normalized values across pairwise similarities. I can remove this.

@jorainer Sure, I can make a line break after every @....

@sgibb
Copy link
Member

sgibb commented Sep 2, 2019

  1. Especially for small molecules I like to give also a weight on the m/z values to accommodate that shared fragments with higher m/z are less likely and will mean that molecules might be more similar.

I am never worked with small molecules so I don't know whether weighting by m/z is useful.
But currently I get a larger dot product for lower m/z values.
Is the following expected and intended?

x <- data.frame(mz = c(1, NA, 100), intensity = c(1, 1, 1))
y <- data.frame(mz = c(1, 2, 101), intensity = c(1, 1, 1))
dotproduct(x, y)
# [1] 0.9999998

x <- data.frame(mz = c(101, NA, 200), intensity = c(1, 1, 1))
y <- data.frame(mz = c(101, 102, 201), intensity = c(1, 1, 1))
dotproduct(x, y)
# [1] 0.9413118
  1. If we set m and n to 1, the two functions will return the same value. Otherwise, the proposed function with arguments m and n will add a bit more flexibility for calculating the normalized dot product.

I see but is there a real use case? Otherwise I would vote for a simpler implementation and API (and remove both arguments).

  1. I would say it's not very useful and it is misleading when we compare non-normalized values across pairwise similarities. I can remove this.

👍

@tnaake
Copy link
Collaborator Author

tnaake commented Sep 6, 2019

Dear @sgibb
thanks for pointing this out.

  • Concerning 1. Given the input, this is expected (but not what we intend). As we have the exponent n=2 the differences in m/z will be intensified, especially 200^n/ 201^n and 100^n and 101^n will change the weights ws1 and ws2 tremendously (with the dotproduct function we will calculate the similarity between ws1 and ws2). If the function gets "aligned" m/z values, the similarity is calculated as follows and as expected (this means, the function should only get the same/aligned m/z values when n != 0):

    x1 <- data.frame(mz = c(101, NA, 201), intensity = c(1, 0, 1))
    y1 <- data.frame(mz = c(101, 102, 201), intensity = c(1, 1, 1))
    dotproduct(x1, y1, m=0.3, n=1) ## 0.8294594
    dotproduct(x1, y1, m=0.3, n=2) ## 0.9413171

    x2 <- data.frame(mz = c(1, NA, 201), intensity = c(1, 0, 1))
    y2 <- data.frame(mz = c(1, 102, 201), intensity = c(1, 1, 1))
    dotproduct(x2, y2, m=0.3, n=1) ## 0.795221
    dotproduct(x2, y2, m=0.3, n=2) ## 0.9378086

For two spectra with "lower" m/z value, we get a lower similarity score. If we set a higher n we will get a higher similarity score.

  • Concerning 2. There was a small error in my text before: n has to be 0 (all m/z values will be 1).

    x1 <- data.frame(mz = c(101, NA, 201), intensity = c(1, 0, 1))
    y1 <- data.frame(mz = c(101, 102, 201), intensity = c(1, 1, 1))
    x2 <- data.frame(mz = c(1, NA, 201), intensity = c(1, 0, 1))
    y2 <- data.frame(mz = c(1, 102, 201), intensity = c(1, 1, 1))
    dotproduct(x1, y1, m=0.3, n=0) ## 0.6666667
    dotproduct(x2, y2, m=0.3, n=0) ## 0.6666667

This means 2 of 3 shared peaks. I used the formula implemented in this dotproduct function (containing weights for m/z and intensities) before and it was also used in some publications. What we could implement is, that n=0 and m=1 (default), i.e. m/z values will not be used, but only intensity values as they are.

x1 <- data.frame(mz = c(101, NA, 201), intensity = c(1, 0, 1))
y1 <- data.frame(mz = c(101, 102, 201), intensity = c(1, 1, 1))
x2 <- data.frame(mz = c(101, NA, 201), intensity = c(3, 0, 5)) 
y2 <- data.frame(mz = c(101, 102, 201), intensity = c(3,4, 5))
dotproduct(x1, y1, m=1, n=0) ## 0.6666667
dotproduct(x2, y2, m=1, n=0) ## 0.68

Copy link
Member

@sgibb sgibb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnaake Thanks for the clarification! Now I reviewed your PR and add some comments. There is a little bit of refactoring needed before we could accept this PR.

Please follow our coding style and it would be nice if you could use conventional commit messages as well.

Please add yourself as contributor to the DESCRIPTION and README.md.

R/dotproduct.R Outdated Show resolved Hide resolved
R/dotproduct.R Outdated Show resolved Hide resolved
R/dotproduct.R Outdated
#' @param n `numeric(1)`, exponent for m/z-based weights
#' @param normalize `logical` whether to calculate the DP (FALSE) or NDP (TRUE)
#' @details
#' `x` and `y` have to be spectrally aligned. Each row in `x` corresponds to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "spectrally aligned" a valid phrase?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to:

Each row in x corresponds to the respective row in y, i.e. the peaks
(entries "mz") per spectrum have to match.

@sgibb, do you think it is clearer this way?

R/dotproduct.R Outdated Show resolved Hide resolved
R/dotproduct.R Outdated
#' intensity=c(2, 0, 3, 1, 4, 0.4))
#' dotproduct(x, y, m=0.5, n=2, normalize=TRUE)
#' @export
dotproduct <- function(x, y, m=0.5, n=2, normalize=TRUE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several points:

  1. I am not sure that data.frame/list input is a good idea. I know the output of join is a list so it would be convenient. But what about mz1, int1, mz2, int2 or x1, y1, x2, y2? @jorainer : what do you think?
  2. You didn't check m and n for (in)valid input.
  3. Please remove the normalize argument as discussed before.
  4. Please follow our coding style: m = 0.5 (see https://rformassspectrometry.github.io/RforMassSpectrometry/articles/RforMassSpectrometry.html#coding-style)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, input of data.frame/list was written with having the output of join in mind. I can change to vector input as outlined above. Maybe we can wait for @jorainer 's ideas on this.
  2. I added checks for m and n now.
  3. I removed the normalize argument.
  4. I updates for coding style.

R/dotproduct.R Outdated Show resolved Hide resolved
R/dotproduct.R Outdated Show resolved Hide resolved
R/dotproduct.R Outdated Show resolved Hide resolved
R/dotproduct.R Outdated Show resolved Hide resolved
R/dotproduct.R Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Sep 17, 2019

Codecov Report

Merging #17 into master will decrease coverage by 0.63%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #17      +/-   ##
==========================================
- Coverage    99.2%   98.56%   -0.64%     
==========================================
  Files          15       17       +2     
  Lines         250      418     +168     
==========================================
+ Hits          248      412     +164     
- Misses          2        6       +4
Impacted Files Coverage Δ
R/dotproduct.R 100% <100%> (ø)
R/graphPeaks.R 97.29% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4aabfe2...b263787. Read the comment docs.

Copy link
Member

@sgibb sgibb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @tnaake, thanks for the refactorisation of your PR and following my suggestions. We are nearly there. Just a few minor points left.
I will ping @jorainer to get his opinion about the arguments (two list/data.frames vs four numeric vectors).

And sorry for the slow review, I just started a new job in September and my spare time was very limited.

DESCRIPTION Outdated
@@ -15,7 +15,8 @@ Authors@R: c(person(given = "Laurent", family = "Gatto",
email = "mail@sebastiangibb.de",
role = c("aut", "cre"),
comment = c(ORCID = "0000-0001-7406-4443")),
person(given = "Sigurdur", family = "Smarason", role = "ctb")
person(given = "Sigurdur", family = "Smarason", role = "ctb"),
person(given = "Thomas", family = "Naake", role = "ctb")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
person(given = "Thomas", family = "Naake", role = "ctb")
person(given = "Thomas", family = "Naake", role = "ctb")

R/dotproduct.R Show resolved Hide resolved
R/dotproduct.R Show resolved Hide resolved
R/dotproduct.R Outdated Show resolved Hide resolved
R/dotproduct.R Outdated Show resolved Hide resolved
R/dotproduct.R Outdated Show resolved Hide resolved
R/dotproduct.R Outdated Show resolved Hide resolved
R/dotproduct.R Outdated Show resolved Hide resolved
R/dotproduct.R Outdated Show resolved Hide resolved
tests/testthat/test_dotproduct.R Outdated Show resolved Hide resolved
Copy link
Member

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice contribution @tnaake !

On the question whether the input parameters x and y should be a data.frame or individual numeric: I would opt against the data.frame. What we get from the upstream functions (being it peaks or joinPeaks, which will match the peaks of two spectra) are a two-column matrix. So I would either use 2 matrix or 4 numeric.

Thinking a little ahead on defining a common API for peak similarity calculation functions, I think that x and y being a matrix might be better. So, parameter FUN in compareSpectra would have to be a function taking two parameters x and y of type matrix representing the (matched) peak matrices of the two spectra with m/z values in its first, and intensity values in its second column.
@tnaake @sgibb , what do you think?

Another question, since the peaks have to be mapped - do you actually need the mz1 and mz2 (I suppose they would have to be similar/identical)?

@lgatto
Copy link
Member

lgatto commented Oct 4, 2019

I also vote for 2 matrices.

In the interest of standardisation, should we define/document somewhere what the column names and or the order of the m/z and intensity in these columns? We could then have a helper function that checks if the data (easy enough to tell m/z and intensities apart) and column names?

@jorainer
Copy link
Member

jorainer commented Oct 4, 2019

You mean you want that check function in MsCoreUtils? I'm not so sure if such a check function on a matrix is not a little too much. We have all the checks already in the Spectra package, and they are not performed on the matrix but rather on the data how it is stored there (i.e. mz and intensity as a NumericList, so we don't have to check for is.numeric, we only check that mz are increasingly ordered.

I agree on having it documented (it's documented already in Spectra if I'm not mistaken - it's what as.list,Spectra returns.

@tnaake
Copy link
Collaborator Author

tnaake commented Oct 6, 2019

I changed the function that it accepts two matrices (actually this was changed in the second pull request, but I didn't know to deal with the sitatuation).

I have two questions still.

You mean you want that check function in MsCoreUtils? I'm not so sure if such a check function on a matrix is not a little too much.

@jorainer do you mean that I should remove the two !is.matrix statements in the first lines of dotproduct?

Another question, since the peaks have to be mapped - do you actually need the mz1 and mz2 (I suppose they would have to be similar/identical)?

@jorainer I think we do need this at the moment. E.g. graphPeaks will return a (list of) two matrices with a column mz. The rows of the two matrices correspond to each other and thus the mz values. However it's possible (and will happen in most cases) that the two columns with the mz are not identical. If we want to store that information properly, then in dotproduct that information has to be assigned to mz1 and mz2. What I could think of is adding an argument that specifies how dotproduct will deal with aligned m/z values (e.g. averaging the two corresponding m/z from each spectrum). Indeed, if n != 0 that would become a valid enhancement to calculate similarities properly.

Co-Authored-By: Sebastian Gibb <mail@sebastiangibb.de>
Copy link
Member

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tnaake ! All fine from my side.

do you mean that I should remove the two !is.matrix statements in the first lines of dotproduct?

No, all fine. This was more related to have a function that evaluates whether a numeric matrix has correct peak data (i.e. a column names "mz", and one "intensity" and that the mz values are ordered increasingly) - I would however not check that within the dotproduct function. This has to be done (and is done) upstream in Spectra.

And thanks for the clarification why we need mz1 and mz2!

Great work!

Copy link
Member

@sgibb sgibb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnaake thanks again for this nice PR and I am sorry that the review took so long and was a little bit cumbersome.

I will merge the PR and apply the change regarding the DOI myself.

#' @references
#' Li et al. (2015): Navigating natural variation in herbivory-induced
#' secondary metabolism in coyote tobacco populations using MS/MS structural
#' analysis. PNAS, E4147--E4155, DOI: 10.1073/pnas.1503106112.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should work with

Suggested change
#' analysis. PNAS, E4147--E4155, DOI: 10.1073/pnas.1503106112.
#' analysis. PNAS, E4147--E4155, [DOI: 10.1073/pnas.1503106112](https://doi.org/10.1073/pnas.1503106112).

@sgibb
Copy link
Member

sgibb commented Oct 8, 2019

I manually rebased & merged it into the devel branch.

@sgibb sgibb closed this Oct 8, 2019
@lgatto lgatto reopened this Oct 11, 2019
@lgatto
Copy link
Member

lgatto commented Oct 11, 2019

Sorry, merged by accident :-/

@sgibb
Copy link
Member

sgibb commented Oct 11, 2019

@lgatto I already merged this into devel. So I am closing it again. But it looks like you merged #20 by accident. I will undo this as well.

@sgibb sgibb closed this Oct 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants