Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable mzTab for DIA-NN #205

Merged
merged 16 commits into from
Aug 3, 2022
Merged

enable mzTab for DIA-NN #205

merged 16 commits into from
Aug 3, 2022

Conversation

WangHong007
Copy link
Contributor

No description provided.

@github-actions
Copy link

github-actions bot commented Jul 4, 2022

nf-core lint overall result: Passed ✅

Posted for pipeline commit 3e3a524

+| ✅ 146 tests passed       |+
#| ❔   3 tests were ignored |#

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.4.1
  • Run at 2022-08-03 11:51:56

@WangHong007
Copy link
Contributor Author

Hi Yasset! @ypriverol It seems dia container should be updated. This script makes additional use of the Bio Python package.

@jpfeuffer
Copy link
Collaborator

What do you need from the Bio package? Can we do without?

Is there an example output for the mzTab that is created?

@WangHong007
Copy link
Contributor Author

An example is here: quantms/out.mztab
We use Bio package to calculate protein coverage and Calculate.Precursor.Mz, and we can set these to "null" for now.

@ypriverol
Copy link
Member

@WangHong007 Protein groups errors:

I have validated the following mzTab out.mzTab. I found the following error:

[Error-1019] line 103: Column "protein_coverage" value "0.024;0.025" is not a valid Double value.
[Error-1019] line 171: Column "protein_coverage" value "0.032;0.038" is not a valid Double value.
[Error-1019] line 285: Column "protein_coverage" value "0.029;0.035;0.036" is not a valid Double value.
[Error-1019] line 515: Column "protein_coverage" value "0.257;0.257" is not a valid Double value.
[Error-1019] line 884: Column "protein_coverage" value "0.069;0.069" is not a valid Double value.
[Error-1019] line 919: Column "protein_coverage" value "0.047;0.047" is not a valid Double value.
[Error-1019] line 1320: Column "protein_coverage" value "0.027;0.032" is not a valid Double value.
[Error-1019] line 1375: Column "protein_coverage" value "0.052;0.052" is not a valid Double value.

This is mainly because protein groups are added by your script as

PRT	P02919;P02919-2	Penicillin-binding protein 1B	1.26E+06	1.12E+06	1.24E+06	1.34E+06	1.15E+06	1.31E+06	ref_ecoli_k12_ups1_combined	null	null	null	null	null	null	null	0.024;0.025	P02919;P02919-2	0.000861326	null	1178004.286	null	null	1296266.875	null	null	single_protein

This is not valid in the mzTab.

The OpenMS approach @timosachsenberg is to write for each protein group the following:

1- indistinguishable_protein_group: PRT MAPHEAD100010850 null null null mgm-proteins-decoy null null 5.524861878453039e-03 MAPHEAD100010850,MAPHEAD100677610 null 0.061311661311661 2.0 2.0 null null null null null indistinguishable_protein_group In this case select the first protein of the protein group for your protein P02919.
2- Write the two proteins as protein_details. @timosachsenberg has selected to write each member of the group as a single entry with the optional column protein_details. For example: PRT MAPHEAD100010850 null null null mgm-proteins-decoy null null 2.173913043478261e-03 null null 0.072727272727273 null null null null null 2 0 protein_details

With that, you will be able to write of each member of the group the sequence coverage as double and will be a valid protein.

Here is an example:

PXD020692-Sample-12.sdrf_openms_design_openms.mzTab.zip

Replace Bio with pyopenms, disable unique genes matrix and some small fixs
@WangHong007
Copy link
Contributor Author

Hi ! @vdemichev @ypriverol
How can we identify protein type(opt_global_result_type in mzTab) according to Dia-NN main report and matrixs?

For now, I'm classifying by the number of protein ID separated by semicolons in the columns Protein.Group and Protein.Ids in aforementioned result files.
e.g.
Protein.Group="P09152;P19319", Protein.Ids="P09152;P19319" --> protein_details
Protein.Group="P09152", Protein.Ids="P09152;P19319" --> indistinguishable_protein_group
Protein.Group="P09152", Protein.Ids="P09152" --> single_protein

@WangHong007
Copy link
Contributor Author

jmztab online: mztabvalidator
An example is here: diatest_out.mztab

Correct protein coverage and unique
@ypriverol
Copy link
Member

Hi ! @vdemichev @ypriverol How can we identify protein type(opt_global_result_type in mzTab) according to Dia-NN main report and matrixs?

For now, I'm classifying by the number of protein ID separated by semicolons in the columns Protein.Group and Protein.Ids in aforementioned result files. e.g. Protein.Group="P09152;P19319", Protein.Ids="P09152;P19319" --> protein_details Protein.Group="P09152", Protein.Ids="P09152;P19319" --> indistinguishable_protein_group Protein.Group="P09152", Protein.Ids="P09152" --> single_protein

@WangHong007 I don't understand the question here.

@WangHong007
Copy link
Contributor Author

1- indistinguishable_protein_group: PRT MAPHEAD100010850 null null null mgm-proteins-decoy null null 5.524861878453039e-03 MAPHEAD100010850,MAPHEAD100677610 null 0.061311661311661 2.0 2.0 null null null null null indistinguishable_protein_group In this case select the first protein of the protein group for your protein P02919.
2- Write the two proteins as protein_details. @timosachsenberg has selected to write each member of the group as a single entry with the optional column protein_details. For example: PRT MAPHEAD100010850 null null null mgm-proteins-decoy null null 2.173913043478261e-03 null null 0.072727272727273 null null null null null 2 0 protein_details

The question is to determine the type of protein identification result in the protein subtable. How to get three result types(single_protein, protein_details, Indistinguishable_protein_group) from DIA-NN main report and matrix file.

For now, I use the number of values(one-to-one, many-to-many or one-to-many) corresponding to the two columns Protein.Group and Protein.Ids in DIA-NN result files.

e.g.
Protein.Group="P09152", Protein.Ids="P09152" --> single_protein
Protein.Group="P09152;P19319", Protein.Ids="P09152;P19319" --> protein_details
Protein.Group="P09152", Protein.Ids="P09152;P19319" --> indistinguishable_protein_group

@ypriverol
Copy link
Member

@WangHong007, here how those cases should be annotated:

  • Protein.Group="P09152", Protein.Ids="P09152" --> single_protein This is single protein and not protein_details needs to be annotated.
  • Protein.Group="P09152;P19319", Protein.Ids="P09152;P19319" should be annotated as:
    • Select the first protein accession P09152 as the ancore of the group and the others are ambiquity_memebers.
    • The opt_global_result_type can be annotated as indistinguishable_protein_group`.
    • Each protein of the group P09152; P19319 should be also annotated as protein_details where you add the coverage, score, etc. for each protein.
  • Protein.Group="P09152", Protein.Ids="P09152;P19319" should be annotated as:
    • P09152 as the ancore of the group and the others are ambiquity_members.
    • The opt_global_result_type can be annotated as indistinguishable_protein_group`.
    • Each protein of the group P09152; P19319 should be also annotated as protein_details where you add the coverage, score, etc. for each protein.

@ypriverol
Copy link
Member

@WangHong007 The problem that you have there is that the dependency pyopenms is not installed when you run as script using docker.

bin/diann_convert.py Show resolved Hide resolved
bin/diann_convert.py Show resolved Hide resolved
modules/local/diannconvert/main.nf Outdated Show resolved Hide resolved
@ypriverol ypriverol removed the request for review from jpfeuffer August 3, 2022 15:32
@ypriverol ypriverol merged commit 4141176 into bigbio:dev Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants