You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would really help the Pathogen-Host Interactions database (PHI-base) to have a comprehensive ontology of plant disease, so I'm including a table of all plant diseases from PHI-base: specifically all diseases where the host species is part of the Viridiplantae kingdom (NCBITaxon:33090).
Note that the list has not yet been manually reviewed (the PHI-base team is planning to do this over the next month or so), but I'm posting it here first so you can check how suitable it is.
See below for a link to the table as a tab-separated file (I had to zip it because GitHub won't allow TSV files to be uploaded to issues).
I've tried to follow the structure of the TSV file that contains the scrape of the APS website, but I've included some extra columns to help with manual review of the data:
disease_old: the disease name as it currently appears in PHI-base (from version 4.10 of the database, which has not yet been released). I've included this so you can compare the original name to the cleaned name in the next column.
disease: the revised disease name, after normalizing letter case, removing redundant host information (see section below), and fixing inconsistent naming. I've tried to simplify the disease name as much as possible, with the aim of making the fewest assumptions about how PSO is planning to name their disease classes.
pathogen: the NCBI Taxonomy ID for the pathogen species. This is sourced directly from PHI-base.
pathogen_label: The scientific name for the pathogen species. This is sourced directly from PHI-base, and hasn't been cross-referenced against the names in the NCBI Taxonomy.
host: the NCBI Taxonomy ID for the host species. This is sourced directly from PHI-base. Note that the host column includes model host organisms (such as Nicotiana benthamiana in the case of publications containing tobacco leaf assays).
host_label: the scientific name for the host species. This is sourced directly from PHI-base, and hasn't been cross-referenced against the names in the NCBI Taxonomy.
plantstructure_old: the name for the affected plant tissue as it appears in PHI-base. I've remapped these names to Plant Ontology terms in the following columns; I've included the original names so the mapping can be manually reviewed.
plantstructure: the Plant Ontology term ID that is the closest match for the tissue type in PHI-base. Some rows contain multiple values, which are delimited with the pipe character. I manually remapped these terms myself, but I haven't manually reviewed the source publications to check whether the PHI-base tissue name is correct.
Note that I was unable to find a PO term for 'seedling': an exact match for this term exists in the Brenda Tissue Ontology (BTO:0001228), but not in the Plant Ontology. I've left a placeholder called '(seedling)' in the column for now.
plantstructure_label: the corresponding label for the Plant Ontology term ID, also delimited with the pipe character (the order of the labels matches the order of the term IDs).
references: a list of publication identifiers sourced directly from PHI-base. The list is delimited with the pipe character. These have not been manually reviewed, so I can't guarantee that they're primary sources describing the disease. I've preferred the PubMed ID (PMID) where available, and used the DOI where the PMID is missing. I think we almost always have a DOI in addition to the PMID, so please let me know if you want that included in a separate column.
Notes
As previously mentioned: to keep the disease names as general as possible, I've removed the host common name or genus from the disease name where it matches the host in the 'host' column, since the information seems redundant in these cases (although it's arguably less redundant when the disease name contains the genus of the host, since that specifies that the pathogen is not specific to the host).
I've kept the pathogen name in the disease name when it seems to be part of the accepted name for the disease (e.g. Fusarium ear blight). This might not be ideal if PSO plans to include the pathogen and host species with the disease name: if the term names were naively generated, PSO would end up with disease names like 'Fusarium graminearum Fusarium ear blight on Triticum aestivum', instead of something less redundant like 'Fusarium graminearum ear blight on Triticum aestivum'.
For cases where the host name in the disease doesn't match the host in the host column, I've retained the host name (usually the common name) in parentheses after the disease name. This is to help identify cases where a model host organism has been used instead of the 'natural' host; the model hosts are usually Nicotiana benthamiana or Arabidopsis thaliana.
The inclusion of model hosts could be problematic. I'm not sure whether PSO wants to make a distinction for model hosts, or whether they should be included at all. (While it's true that many pathogens can cause disease on tobacco leaves when inoculated, it may not be notable that they can do so.)
Many of the diseases in the list may be synonymous with each other; the PHI-base team is hoping to identify as many of these cases as possible when we manually review the list.
It would really help the Pathogen-Host Interactions database (PHI-base) to have a comprehensive ontology of plant disease, so I'm including a table of all plant diseases from PHI-base: specifically all diseases where the host species is part of the Viridiplantae kingdom (NCBITaxon:33090).
Note that the list has not yet been manually reviewed (the PHI-base team is planning to do this over the next month or so), but I'm posting it here first so you can check how suitable it is.
See below for a link to the table as a tab-separated file (I had to zip it because GitHub won't allow TSV files to be uploaded to issues).
phibase_plant_diseases.zip
Table columns
I've tried to follow the structure of the TSV file that contains the scrape of the APS website, but I've included some extra columns to help with manual review of the data:
disease_old: the disease name as it currently appears in PHI-base (from version 4.10 of the database, which has not yet been released). I've included this so you can compare the original name to the cleaned name in the next column.
disease: the revised disease name, after normalizing letter case, removing redundant host information (see section below), and fixing inconsistent naming. I've tried to simplify the disease name as much as possible, with the aim of making the fewest assumptions about how PSO is planning to name their disease classes.
pathogen: the NCBI Taxonomy ID for the pathogen species. This is sourced directly from PHI-base.
pathogen_label: The scientific name for the pathogen species. This is sourced directly from PHI-base, and hasn't been cross-referenced against the names in the NCBI Taxonomy.
host: the NCBI Taxonomy ID for the host species. This is sourced directly from PHI-base. Note that the host column includes model host organisms (such as Nicotiana benthamiana in the case of publications containing tobacco leaf assays).
host_label: the scientific name for the host species. This is sourced directly from PHI-base, and hasn't been cross-referenced against the names in the NCBI Taxonomy.
plantstructure_old: the name for the affected plant tissue as it appears in PHI-base. I've remapped these names to Plant Ontology terms in the following columns; I've included the original names so the mapping can be manually reviewed.
plantstructure: the Plant Ontology term ID that is the closest match for the tissue type in PHI-base. Some rows contain multiple values, which are delimited with the pipe character. I manually remapped these terms myself, but I haven't manually reviewed the source publications to check whether the PHI-base tissue name is correct.
Note that I was unable to find a PO term for 'seedling': an exact match for this term exists in the Brenda Tissue Ontology (BTO:0001228), but not in the Plant Ontology. I've left a placeholder called '(seedling)' in the column for now.
plantstructure_label: the corresponding label for the Plant Ontology term ID, also delimited with the pipe character (the order of the labels matches the order of the term IDs).
references: a list of publication identifiers sourced directly from PHI-base. The list is delimited with the pipe character. These have not been manually reviewed, so I can't guarantee that they're primary sources describing the disease. I've preferred the PubMed ID (PMID) where available, and used the DOI where the PMID is missing. I think we almost always have a DOI in addition to the PMID, so please let me know if you want that included in a separate column.
Notes
As previously mentioned: to keep the disease names as general as possible, I've removed the host common name or genus from the disease name where it matches the host in the 'host' column, since the information seems redundant in these cases (although it's arguably less redundant when the disease name contains the genus of the host, since that specifies that the pathogen is not specific to the host).
I've kept the pathogen name in the disease name when it seems to be part of the accepted name for the disease (e.g. Fusarium ear blight). This might not be ideal if PSO plans to include the pathogen and host species with the disease name: if the term names were naively generated, PSO would end up with disease names like 'Fusarium graminearum Fusarium ear blight on Triticum aestivum', instead of something less redundant like 'Fusarium graminearum ear blight on Triticum aestivum'.
For cases where the host name in the disease doesn't match the host in the host column, I've retained the host name (usually the common name) in parentheses after the disease name. This is to help identify cases where a model host organism has been used instead of the 'natural' host; the model hosts are usually Nicotiana benthamiana or Arabidopsis thaliana.
The inclusion of model hosts could be problematic. I'm not sure whether PSO wants to make a distinction for model hosts, or whether they should be included at all. (While it's true that many pathogens can cause disease on tobacco leaves when inoculated, it may not be notable that they can do so.)
Many of the diseases in the list may be synonymous with each other; the PHI-base team is hoping to identify as many of these cases as possible when we manually review the list.
CC: @ValWood @CuzickA
The text was updated successfully, but these errors were encountered: