-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add function dotproduct and unit tests #17
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear Thomas thanks for this PR.
Before looking into it in a more detail I have a few questions:
-
We have already a
dotproduct
function inMSnbase
: https://github.com/lgatto/MSnbase/blob/604440ae5dd7c86b685e404f22c874cedea19642/R/matching.R#L102-L123. It ignores the mz values. Are the mz values important? IMHO after matching the spectra (e.g. byjoinPeaks
inSpectra
) they should be fairly similar and it should be more or less a multiplication by a constant factor, or? -
Are the arguments
m
andn
really needed? Are there real use cases for changing them? -
Is a non-normalized similarity useful? So do we need the
normalize
argument at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sgibb, Concerning your questions:
@jorainer Sure, I can make a line break after every |
I am never worked with small molecules so I don't know whether weighting by m/z is useful. x <- data.frame(mz = c(1, NA, 100), intensity = c(1, 1, 1))
y <- data.frame(mz = c(1, 2, 101), intensity = c(1, 1, 1))
dotproduct(x, y)
# [1] 0.9999998
x <- data.frame(mz = c(101, NA, 200), intensity = c(1, 1, 1))
y <- data.frame(mz = c(101, 102, 201), intensity = c(1, 1, 1))
dotproduct(x, y)
# [1] 0.9413118
I see but is there a real use case? Otherwise I would vote for a simpler implementation and API (and remove both arguments).
👍 |
Dear @sgibb
For two spectra with "lower" m/z value, we get a lower similarity score. If we set a higher
This means 2 of 3 shared peaks. I used the formula implemented in this
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnaake Thanks for the clarification! Now I reviewed your PR and add some comments. There is a little bit of refactoring needed before we could accept this PR.
Please follow our coding style and it would be nice if you could use conventional commit messages as well.
Please add yourself as contributor to the DESCRIPTION
and README.md
.
R/dotproduct.R
Outdated
#' @param n `numeric(1)`, exponent for m/z-based weights | ||
#' @param normalize `logical` whether to calculate the DP (FALSE) or NDP (TRUE) | ||
#' @details | ||
#' `x` and `y` have to be spectrally aligned. Each row in `x` corresponds to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "spectrally aligned" a valid phrase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to:
Each row in x
corresponds to the respective row in y
, i.e. the peaks
(entries "mz"
) per spectrum have to match.
@sgibb, do you think it is clearer this way?
R/dotproduct.R
Outdated
#' intensity=c(2, 0, 3, 1, 4, 0.4)) | ||
#' dotproduct(x, y, m=0.5, n=2, normalize=TRUE) | ||
#' @export | ||
dotproduct <- function(x, y, m=0.5, n=2, normalize=TRUE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several points:
- I am not sure that
data.frame
/list
input is a good idea. I know the output ofjoin
is alist
so it would be convenient. But what aboutmz1, int1, mz2, int2
orx1, y1, x2, y2
? @jorainer : what do you think? - You didn't check
m
andn
for (in)valid input. - Please remove the
normalize
argument as discussed before. - Please follow our coding style:
m = 0.5
(see https://rformassspectrometry.github.io/RforMassSpectrometry/articles/RforMassSpectrometry.html#coding-style)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, input of
data.frame
/list
was written with having the output ofjoin
in mind. I can change to vector input as outlined above. Maybe we can wait for @jorainer 's ideas on this. - I added checks for
m
andn
now. - I removed the
normalize
argument. - I updates for coding style.
…y are not identical and n != 0
Codecov Report
@@ Coverage Diff @@
## master #17 +/- ##
==========================================
- Coverage 99.2% 98.56% -0.64%
==========================================
Files 15 17 +2
Lines 250 418 +168
==========================================
+ Hits 248 412 +164
- Misses 2 6 +4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear @tnaake, thanks for the refactorisation of your PR and following my suggestions. We are nearly there. Just a few minor points left.
I will ping @jorainer to get his opinion about the arguments (two list
/data.frame
s vs four numeric
vectors).
And sorry for the slow review, I just started a new job in September and my spare time was very limited.
DESCRIPTION
Outdated
@@ -15,7 +15,8 @@ Authors@R: c(person(given = "Laurent", family = "Gatto", | |||
email = "mail@sebastiangibb.de", | |||
role = c("aut", "cre"), | |||
comment = c(ORCID = "0000-0001-7406-4443")), | |||
person(given = "Sigurdur", family = "Smarason", role = "ctb") | |||
person(given = "Sigurdur", family = "Smarason", role = "ctb"), | |||
person(given = "Thomas", family = "Naake", role = "ctb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
person(given = "Thomas", family = "Naake", role = "ctb") | |
person(given = "Thomas", family = "Naake", role = "ctb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice contribution @tnaake !
On the question whether the input parameters x
and y
should be a data.frame
or individual numeric
: I would opt against the data.frame
. What we get from the upstream functions (being it peaks
or joinPeaks
, which will match the peaks of two spectra) are a two-column matrix
. So I would either use 2 matrix
or 4 numeric
.
Thinking a little ahead on defining a common API for peak similarity calculation functions, I think that x
and y
being a matrix
might be better. So, parameter FUN
in compareSpectra
would have to be a function
taking two parameters x
and y
of type matrix
representing the (matched) peak matrices of the two spectra with m/z values in its first, and intensity values in its second column.
@tnaake @sgibb , what do you think?
Another question, since the peaks have to be mapped - do you actually need the mz1
and mz2
(I suppose they would have to be similar/identical)?
I also vote for 2 matrices. In the interest of standardisation, should we define/document somewhere what the column names and or the order of the m/z and intensity in these columns? We could then have a helper function that checks if the data (easy enough to tell m/z and intensities apart) and column names? |
You mean you want that check function in I agree on having it documented (it's documented already in |
…gestions by sgibb
I changed the function that it accepts two matrices (actually this was changed in the second pull request, but I didn't know to deal with the sitatuation). I have two questions still.
@jorainer do you mean that I should remove the two
@jorainer I think we do need this at the moment. E.g. |
Co-Authored-By: Sebastian Gibb <mail@sebastiangibb.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tnaake ! All fine from my side.
do you mean that I should remove the two !is.matrix statements in the first lines of dotproduct?
No, all fine. This was more related to have a function that evaluates whether a numeric matrix
has correct peak data (i.e. a column names "mz"
, and one "intensity"
and that the mz values are ordered increasingly) - I would however not check that within the dotproduct
function. This has to be done (and is done) upstream in Spectra
.
And thanks for the clarification why we need mz1
and mz2
!
Great work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnaake thanks again for this nice PR and I am sorry that the review took so long and was a little bit cumbersome.
I will merge the PR and apply the change regarding the DOI myself.
#' @references | ||
#' Li et al. (2015): Navigating natural variation in herbivory-induced | ||
#' secondary metabolism in coyote tobacco populations using MS/MS structural | ||
#' analysis. PNAS, E4147--E4155, DOI: 10.1073/pnas.1503106112. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should work with
#' analysis. PNAS, E4147--E4155, DOI: 10.1073/pnas.1503106112. | |
#' analysis. PNAS, E4147--E4155, [DOI: 10.1073/pnas.1503106112](https://doi.org/10.1073/pnas.1503106112). |
I manually rebased & merged it into the devel branch. |
Sorry, merged by accident :-/ |
Hi @jorainer, @sgibb, @lgatto
as discussed here (rformassspectrometry/Spectra#49)
dotproduct
should move toMsCoreUtils
.