Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/skip preliminary analysis on dia #335

Merged
merged 11 commits into from
Jan 10, 2024

Conversation

jspaezp
Copy link
Contributor

@jspaezp jspaezp commented Jan 9, 2024

This PR adds the option to skip the preliminary steps of the dia analysis. (only do a single individual analysis and a single consensus analysis).

(please squash on merge ...)

Copy link

github-actions bot commented Jan 9, 2024

nf-core lint overall result: Passed ✅

Posted for pipeline commit 639e507

+| ✅ 160 tests passed       |+
#| ❔   4 tests were ignored |#

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.11.1
  • Run at 2024-01-09 19:46:13

@ypriverol
Copy link
Member

Before reviewing the PR in details @jspaezp can you explain the impact of not doing preliminary step. I thought the idea with the preliminary step is to be able to generate the library for the final analysis @daichengxin ?

@jspaezp
Copy link
Contributor Author

jspaezp commented Jan 9, 2024

image

@ypriverol Absolutely!

  1. The idea behind the feature is to allow two-stage runs of the pipeline, where a subset of files are used to generate the empirical library and then the extraction is done in all/the rest.
  2. This is especially relevant because the empirical lib generation stage requires staging all the files (all .d/mzml + all .quant + fasta + predicted library) in the same compute environment/disk BUT not for the final quant stage (which does not require the .d/mzml, but does need the .quant + lib). (relevant discussion: https://twitter.com/J_my_sci/status/1744152837247095086)

@jspaezp
Copy link
Contributor Author

jspaezp commented Jan 9, 2024

btw... i dont believe any of the error in the ci/cd checks are caused by my changes ... I see a couple of files missing upstream and mamba not being able to generate environments.

@ypriverol
Copy link
Member

We have to solve that, it is a work in progress because we have to move some files from the current server to PRIDE.

@jspaezp
Copy link
Contributor Author

jspaezp commented Jan 10, 2024

thanks @daichengxin for the review!

@ypriverol
Copy link
Member

ypriverol commented Jan 10, 2024

@jspaezp do you know which impact can have if you sub-select a group of files compared to all the files in the final results?

Another small question, do you think the selection of these files could be done based on replicates technical and biological + the factor value.

@jspaezp
Copy link
Contributor Author

jspaezp commented Jan 10, 2024

I have not tested systematically this to be sure BUT. I would assume that (1) you could miss peptides that show up specifically in one of the conditions/files not used in the library construction. (2) You would have a slightly worse estimate of your FDR due to the smaller sample size.

I am not sure what public dataset could be used to test this hypothesis ... And I am assuming that data sets with more variability will be more prone to have changes depending on the analysis workflow. (I would be surprised if a 'cell line'+treatment dataset of 500 files looks any different if the library is done with the 500 files or with 100).

@ypriverol
Copy link
Member

I have not tested systematically this to be sure BUT. I would assume that (1) you could miss peptides that show up specifically in one of the conditions/files not used in the library construction. (2) You would have a slightly worse estimate of your FDR due to the smaller sample size.

I am not sure what public dataset could be used to test this hypothesis ... And I am assuming that data sets with more variability will be more prone to have changes depending on the analysis workflow. (I would be surprised if a 'cell line'+treatment dataset of 500 files looks any different if the library is done with the 500 files or with 100).

Do you think in the logic we can use some of the SDRF information to do this selection?

@jspaezp
Copy link
Contributor Author

jspaezp commented Jan 10, 2024

I was thinking about this for a while and there might be a way, but it would certainly require a lot more nexflow plumbing that I really want to/can afford to devote right now ... In addition, I am not sure what cvparam could be used to denote that those should be used for the lib ... 1002752 ?? maybe ?

We could certainly have it as an open issue to implement the feature in the future (we could also discuss the right way to do it in the issue).

In other words, that is a much more complex feature than this PR attempts to be and I believe this feature by itself is complementary to that one.

@ypriverol ypriverol merged commit eb6985f into bigbio:dev Jan 10, 2024
14 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants