Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transcripts level #256

Open
9 tasks done
antonylebechec opened this issue Jul 31, 2024 · 11 comments
Open
9 tasks done

Add transcripts level #256

antonylebechec opened this issue Jul 31, 2024 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@antonylebechec
Copy link
Collaborator

antonylebechec commented Jul 31, 2024

In order to explore transcripts information related to each variant, especially to calculate scores, need to create a "transcript view". It can be another table or a view (e.g. "transcripts"), which each line correspond to a transcript (i.e. multiple lines for a variant). A transcript ID column as a uniq key is needed.

TODO:

  • Add multiple output PZ info (PZFlag, PZComment...)
  • Add additional output PZ info annotation transcript specific (such as Scores, predictions...)
  • Add prioritization through a list of transcripts of preference (if multiple transcript with same PZScore, or force the list independently)
  • Add option to by flexible with transcript version (refSeq)
  • Add option to order transcript by multiple columns (not only PZFlag and PZScore by default, but all available transcripts annotations)
  • Map refSeq/Ensembl transcript acc
  • Merge transcripts annotation columns (e.g. genename from multiple source/struct)
  • Export transcripts view as a file (TSV)
  • Add multiple prioritize profiles
@antonylebechec
Copy link
Collaborator Author

To create a transcript view, some parameters are needed.
As an example, this param identify a table to generate (transcripts), and a structure corresponding to columns dedicated to transcripts, such as :

  • a uniq annotation field with a specific format (from_column_format) like snpEff annotation,
  • a list of annotation fields corresponding to transcripts in another specific field (from_columns_map), like dbNSFP annotation
{
            "transcripts": {
                "table": "transcripts",
                "struct": {
                    "from_column_format": [
                        {
                            "transcripts_column": "ANN",
                            "transcripts_infos_column": "Feature_ID"
                        }
                    ],
                    "from_columns_map": [
                        {
                            "transcripts_column": "Ensembl_transcriptid",
                            "transcripts_infos_columns": [
                                "genename",
                                "Ensembl_geneid",
                                "LIST_S2_score",
                                "LIST_S2_pred"
                            ]
                        },
                        {
                            "transcripts_column": "Ensembl_transcriptid",
                            "transcripts_infos_columns": [
                                "genename",
                                "VARITY_R_score",
                                "Aloft_pred"
                            ]
                        }
                    ]
                }
            }
        }

This param is used with function Variants.create_transcript_view() to generate a transcripts table:

   #CHROM       POS REF ALT       transcript     transcript_1 AAposAAlength Distance Allele Aloft_pred          HGVSc  ... cDNAposcDNAlength    genename       FeatureID LIST_S2_pred ERRORSWARNINGSINFO VARITY_R_score      GeneID                          Annotation  GeneName_1        HGVSp AnnotationImpact
0    chr1     28736   A   C      NR_024540.1      NR_024540.1          None     None      C       None    n.50+585T>G  ...              None      WASH7P     NR_024540.1         None               None           None      WASH7P                      intron_variant      WASH7P         None         MODIFIER
1    chr1     28736   A   C      NR_036051.1      NR_036051.1          None   1630.0      C       None     n.-1630A>C  ...              None   MIR1302-2     NR_036051.1         None               None           None   MIR1302-2               upstream_gene_variant   MIR1302-2         None         MODIFIER
2    chr1     28736   A   C      NR_036266.1      NR_036266.1          None   1630.0      C       None     n.-1630A>C  ...              None   MIR1302-9     NR_036266.1         None               None           None   MIR1302-9               upstream_gene_variant   MIR1302-9         None         MODIFIER
3    chr1     28736   A   C      NR_036267.1      NR_036267.1          None   1630.0      C       None     n.-1630A>C  ...              None  MIR1302-10     NR_036267.1         None               None           None  MIR1302-10               upstream_gene_variant  MIR1302-10         None         MODIFIER
4    chr1     28736   A   C      NR_036268.1      NR_036268.1          None   1630.0      C       None     n.-1630A>C  ...              None  MIR1302-11     NR_036268.1         None               None           None  MIR1302-11               upstream_gene_variant  MIR1302-11         None         MODIFIER
5    chr1     35144   A   C      NR_026818.1      NR_026818.1          None     None      C       None       n.597T>G  ...              None     FAM138A     NR_026818.1         None               None           None     FAM138A  non_coding_transcript_exon_variant     FAM138A         None         MODIFIER
6    chr1     35144   A   C      NR_026820.1      NR_026820.1          None     None      C       None       n.597T>G  ...              None     FAM138F     NR_026820.1         None               None           None     FAM138F  non_coding_transcript_exon_variant     FAM138F         None         MODIFIER
7    chr1     35144   A   C      NR_026822.1      NR_026822.1          None     None      C       None       n.597T>G  ...              None     FAM138C     NR_026822.1         None               None           None     FAM138C  non_coding_transcript_exon_variant     FAM138C         None         MODIFIER
8    chr1     35144   A   C      NR_036051.1      NR_036051.1          None   4641.0      C       None     n.*4641A>C  ...              None   MIR1302-2     NR_036051.1         None               None           None   MIR1302-2             downstream_gene_variant   MIR1302-2         None         MODIFIER
9    chr1     35144   A   C      NR_036266.1      NR_036266.1          None   4641.0      C       None     n.*4641A>C  ...              None   MIR1302-9     NR_036266.1         None               None           None   MIR1302-9             downstream_gene_variant   MIR1302-9         None         MODIFIER
10   chr1     35144   A   C      NR_036267.1      NR_036267.1          None   4641.0      C       None     n.*4641A>C  ...              None  MIR1302-10     NR_036267.1         None               None           None  MIR1302-10             downstream_gene_variant  MIR1302-10         None         MODIFIER
11   chr1     35144   A   C      NR_036268.1      NR_036268.1          None   4641.0      C       None     n.*4641A>C  ...              None  MIR1302-11     NR_036268.1         None               None           None  MIR1302-11             downstream_gene_variant  MIR1302-11         None         MODIFIER
12   chr1     69101   A   G  ENST00000335137  ENST00000335137          None     None   None          .           None  ...              None       OR4F5            None            T               None     0.27627227        None                                None       OR4F5         None             None
13   chr1     69101   A   G  ENST00000641515  ENST00000641515          None     None   None          .           None  ...              None       OR4F5            None            T               None              .        None                                None       OR4F5         None             None
14   chr1     69101   A   G   NM_001005484.1   NM_001005484.1         4/305     None      G       None        c.11A>G  ...            11/918       OR4F5  NM_001005484.1         None               None           None       OR4F5                    missense_variant       OR4F5    p.Glu4Gly         MODERATE
15   chr1    768251   A   G      NR_047519.1      NR_047519.1          None     None      G       None  n.287+3767A>G  ...              None   LINC01128     NR_047519.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
16   chr1    768251   A   G      NR_047521.1      NR_047521.1          None     None      G       None  n.287+3767A>G  ...              None   LINC01128     NR_047521.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
17   chr1    768251   A   G      NR_047523.1      NR_047523.1          None     None      G       None  n.287+3767A>G  ...              None   LINC01128     NR_047523.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
18   chr1    768251   A   G      NR_047524.1      NR_047524.1          None     None      G       None  n.287+3767A>G  ...              None   LINC01128     NR_047524.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
19   chr1    768251   A   G      NR_047525.1      NR_047525.1          None     None      G       None  n.154+3767A>G  ...              None   LINC01128     NR_047525.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
20   chr1    768251   A   G      NR_047526.1      NR_047526.1          None     None      G       None  n.287+3767A>G  ...              None   LINC01128     NR_047526.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
21   chr1    768252   A   G      NR_047519.1      NR_047519.1          None     None      G       None  n.287+3768A>G  ...              None   LINC01128     NR_047519.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
22   chr1    768252   A   G      NR_047521.1      NR_047521.1          None     None      G       None  n.287+3768A>G  ...              None   LINC01128     NR_047521.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
23   chr1    768252   A   G      NR_047523.1      NR_047523.1          None     None      G       None  n.287+3768A>G  ...              None   LINC01128     NR_047523.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
24   chr1    768252   A   G      NR_047524.1      NR_047524.1          None     None      G       None  n.287+3768A>G  ...              None   LINC01128     NR_047524.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
25   chr1    768252   A   G      NR_047525.1      NR_047525.1          None     None      G       None  n.154+3768A>G  ...              None   LINC01128     NR_047525.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
26   chr1    768252   A   G      NR_047526.1      NR_047526.1          None     None      G       None  n.287+3768A>G  ...              None   LINC01128     NR_047526.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
27   chr1    768253   A   G      NR_047519.1      NR_047519.1          None     None      G       None  n.287+3769A>G  ...              None   LINC01128     NR_047519.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
28   chr1    768253   A   G      NR_047521.1      NR_047521.1          None     None      G       None  n.287+3769A>G  ...              None   LINC01128     NR_047521.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
29   chr1    768253   A   G      NR_047523.1      NR_047523.1          None     None      G       None  n.287+3769A>G  ...              None   LINC01128     NR_047523.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
30   chr1    768253   A   G      NR_047524.1      NR_047524.1          None     None      G       None  n.287+3769A>G  ...              None   LINC01128     NR_047524.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
31   chr1    768253   A   G      NR_047525.1      NR_047525.1          None     None      G       None  n.154+3769A>G  ...              None   LINC01128     NR_047525.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
32   chr1    768253   A   G      NR_047526.1      NR_047526.1          None     None      G       None  n.287+3769A>G  ...              None   LINC01128     NR_047526.1         None               None           None   LINC01128                      intron_variant   LINC01128         None         MODIFIER
33   chr7  55249063   G   A   NM_001346897.2   NM_001346897.2      742/1091     None      A       None      c.2226G>A  ...         2487/3848        EGFR  NM_001346897.2         None               None           None        EGFR                  synonymous_variant        EGFR  p.Gln742Gln              LOW
34   chr7  55249063   G   A   NM_001346898.2   NM_001346898.2      787/1136     None      A       None      c.2361G>A  ...         2622/3983        EGFR  NM_001346898.2         None               None           None        EGFR                  synonymous_variant        EGFR  p.Gln787Gln              LOW
35   chr7  55249063   G   A   NM_001346899.1   NM_001346899.1      742/1165     None      A       None      c.2226G>A  ...         2483/6218        EGFR  NM_001346899.1         None               None           None        EGFR                  synonymous_variant        EGFR  p.Gln742Gln              LOW
36   chr7  55249063   G   A   NM_001346900.2   NM_001346900.2      734/1157     None      A       None      c.2202G>A  ...         2393/9676        EGFR  NM_001346900.2         None               None           None        EGFR                  synonymous_variant        EGFR  p.Gln734Gln              LOW
37   chr7  55249063   G   A   NM_001346941.2   NM_001346941.2       520/943     None      A       None      c.1560G>A  ...         1821/9104        EGFR  NM_001346941.2         None               None           None        EGFR                  synonymous_variant        EGFR  p.Gln520Gln              LOW
38   chr7  55249063   G   A      NM_005228.5      NM_005228.5      787/1210     None      A       None      c.2361G>A  ...         2622/9905        EGFR     NM_005228.5         None               None           None        EGFR                  synonymous_variant        EGFR  p.Gln787Gln              LOW
39   chr7  55249063   G   A      NR_047551.1      NR_047551.1          None     None      A       None      n.1201C>T  ...              None    EGFR-AS1     NR_047551.1         None               None           None    EGFR-AS1  non_coding_transcript_exon_variant    EGFR-AS1         None         MODIFIER

@antonylebechec
Copy link
Collaborator Author

Calculation to add transcripts annotations as a field in INFO in JSON format.
Example (create config/param.transcripts.json with param from help):

howard calculation --input="tests/data/example.ann.transcripts.vcf.gz" --output="/tmp/output.transcript.vcf" --calculations="TRANSCRIPTS_JSON" --param="config/param.transcripts.json"

@antonylebechec
Copy link
Collaborator Author

Prioritization of transcripts in 'HOWARD' mode with 'transcripts' profiles available in a configuration JSON file, with 'PZT' as prefix:

"transcripts": {
  ...
  "prioritization": {
     "profiles": ["transcripts"],
     "prioritization_config": "config/prioritization_transcripts_profiles.json",
     "pzprefix": "PZT",
     "prioritization_score_mode": "HOWARD"
  }
}

With prioritization parameters based on 'LIST_S2_score' (file 'config/prioritization_transcripts_profiles.json'):

{
  "transcripts": {
    "LIST_S2_score": [
      {
        "type": "gt",
        "value": "0.75",
        "score": 10,
        "flag": "PASS",
        "comment": ["Very Good LIST Score"]
      },
      {
        "type": "gt",
        "value": "0.50",
        "score": 10,
        "flag": "PASS",
        "comment": ["Good LIST Score"]
      }
    ]
  }
}

Command:

howard calculation --input='tests/data/example.dbnsfp.transcripts.vcf.gz' --output='/tmp/example.calculation.transcripts.vcf' --param='config/param.transcripts.json' --calculations='TRANSCRIPTS_PRIORITIZATION'

Output VCF with PZTTranscript, PZTScore and PZTFlag (partial output):

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    28736   .       A       C       100     PASS    CLNSIG=pathogenic
chr1    35144   .       A       C       100     PASS    CLNSIG=non-pathogenic
chr1    69101   .       A       G       100     PASS    genename=OR4F5;Ensembl_transcriptid=ENST00000641515,ENST00000335137;LIST_S2_score=0.79822,0.716128;PZTTranscript=ENST00000641515;PZTScore=20;PZTFlag=PASS

@antonylebechec
Copy link
Collaborator Author

Include transcripts annotations, either in JSON format or structured format (like 'snpEff'), with calculation tool.

Parameters in json file (e.g. 'config/param.transcripts.json'):

{
  "transcripts": {
    "transcripts_info_field_json": "transcripts_json",
    "transcripts_info_field_format": "transcripts_ann",
    "table": "transcripts",
    "struct": {...}
    ...
}

Command:

howard calculation --input='tests/data/example.ann.transcripts.vcf.gz' --output='/tmp/example.calculation.transcripts.vcf' --param='config/param.transcripts.json' --calculations='TRANSCRIPTS_ANNOTATIONS'

Output VCF with 'transcripts_json' and 'transcripts_ann' INFO fields (partial output):

##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'">
##INFO=<ID=transcripts_json,Number=.,Type=String,Description="Transcripts in JSON format">
##INFO=<ID=transcripts_ann,Number=.,Type=String,Description="Transcripts annotations: 'transcript | VARITY_R_score | transcript_1 | Annotation | FeatureID | Allele | HGVSc | Aloft_pred | HGVSp | TranscriptBioType | Distance | genename | LIST_S2_score | AAposAAlength | GeneID | Ensembl_geneid | Rank | GeneName_1 | ERRORSWARNINGSINFO | FeatureType | LIST_S2_pred | CDSposCDSlength | cDNAposcDNAlength | AnnotationImpact'">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    69101   .       A       G       100     PASS    ANN=G|missense_variant|...;genename=OR4F5;Ensembl_transcriptid=ENST00000641515,ENST00000335137;LIST_S2_score=0.79822,0.716128;transcripts_json={"ENST00000335137":{"VARITY_R_score":"0.27627227","transcript_1":"ENST00000335137","Annotation":null,"FeatureID":null,"Allele":null,"HGVSc":null,"Aloft_pred":".","HGVSp":null,"TranscriptBioType":null,"Distance":null,"genename":"OR4F5","LIST_S2_score":"0.716128","AAposAAlength":null,"GeneID":null,"Ensembl_geneid":"ENSG00000186092","Rank":null,"GeneName_1":"OR4F5","ERRORSWARNINGSINFO":null,"FeatureType":null,"LIST_S2_pred":"T","CDSposCDSlength":null,"cDNAposcDNAlength":null,"AnnotationImpact":null},"ENST00000641515":{"VARITY_R_score":".","transcript_1":"ENST00000641515","Annotation":null,"FeatureID":null,"Allele":null,"HGVSc":null,"Aloft_pred":".","HGVSp":null,"TranscriptBioType":null,"Distance":null,"genename":"OR4F5","LIST_S2_score":"0.79822","AAposAAlength":null,"GeneID":null,"Ensembl_geneid":"ENSG00000186092","Rank":null,"GeneName_1":"OR4F5","ERRORSWARNINGSINFO":null,"FeatureType":null,"LIST_S2_pred":"T","CDSposCDSlength":null,"cDNAposcDNAlength":null,"AnnotationImpact":null},"NM_001005484.1":{"VARITY_R_score":null,"transcript_1":"NM_001005484.1","Annotation":"missense_variant","FeatureID":"NM_001005484.1","Allele":"G","HGVSc":"c.11A>G","Aloft_pred":null,"HGVSp":"p.Glu4Gly","TranscriptBioType":"protein_coding","Distance":null,"genename":"OR4F5","LIST_S2_score":null,"AAposAAlength":"4/305","GeneID":"OR4F5","Ensembl_geneid":null,"Rank":"1/1","GeneName_1":"OR4F5","ERRORSWARNINGSINFO":null,"FeatureType":"transcript","LIST_S2_pred":null,"CDSposCDSlength":"11/918","cDNAposcDNAlength":"11/918","AnnotationImpact":"MODERATE"}};transcripts_ann=ENST00000335137|0.27627227|ENST00000335137|||||.||||OR4F5|0.716128|||ENSG00000186092||OR4F5|||T|||,ENST00000641515|.|ENST00000641515|||||.||||OR4F5|0.79822|||ENSG00000186092||OR4F5|||T|||,NM_001005484.1||NM_001005484.1|missense_variant|NM_001005484.1|G|c.11A>G||p.Glu4Gly|protein_coding||OR4F5||4/305|OR4F5||1/1|OR4F5||transcript||11/918|11/918|MODERATE

@antonylebechec
Copy link
Collaborator Author

In order to consider also variants' annotations into transcripts prioritization, INFO column of VCF is included into the transcripts view/bubble. Thus, it is now allowed to parameterize prioritization profiles for transcripts with annotations from variants.

Here is a example of a parametrization with an annotation from transcripts 'LIST_S2_score' and an annotation from variants 'CLNSIG':

{
  "transcripts": {
    "LIST_S2_score": [
      {
        "type": "gt",
        "value": "0.75",
        "score": 10,
        "flag": "PASS",
        "comment": ["Very Good LIST Score"]
      },
      {
        "type": "gt",
        "value": "0.50",
        "score": 10,
        "flag": "PASS",
        "comment": ["Good LIST Score"]
      }
    ],
    "CLNSIG": [
      {
        "type": "eq",
        "value": "pathogenic",
        "score": 100,
        "flag": "PASS",
        "comment": ["Pathogenic"]
      }
    ]
  }
}

@antonylebechec
Copy link
Collaborator Author

New options for transcripts prioritization:

{
    "profiles": ["transcripts"],
    "prioritization_config": "prioritization_transcripts_profiles.json",
    "prioritization_score_mode": "HOWARD",
    "pzprefix": "PZT",
    "pzfields": ["Score", "Flag", "spliceAI_score", "spliceAI_pred"],
    "prioritization_transcripts_order": {
        "PZTFlag": "ASC",
        "PZTScore": "DESC",
        "spliceIA_score": "DESC"
    },
    "prioritization_transcripts": "transcripts.tsv",
    "prioritization_transcripts_force": false,
    "prioritization_transcripts_version_force": false
}
  • Sections profiles and prioritization_config and prioritization_score_mode define prioritization criteria.
  • Sections pzprefix and pzfields define specific INFO/tags to add in VCF, specific to the prioritized/chosen transcripts (e.g. PZTScore, PZTFlag, PZTspliceAI_score, PZTspliceAI_pred).
  • Section prioritization_transcripts_order defines the order of transcripts to determine which one is chosen (by default only PZTFlag and PZTScore). All available annotation can be used (e.g. scores, length of transcript, predictions...)
  • Sections prioritization_transcripts and prioritization_transcripts_force and prioritization_transcripts_version_force determine a list of transcript of preference, in case of equal order (usually PZTScore), or by forcing the order, and by forcing to consider transcript version (useful for refSeq version)

@antonylebechec
Copy link
Collaborator Author

New options to control transcript struct mapping:

{
    "from_column_format": [
        {
            "transcripts_column": "ANN",
            "transcripts_infos_column": "Feature_ID",
            "column_rename": {
                "Gene_Name": "genename",
                "Feature_ID": "THETRANSCRIPTOFSNPEFF"
            },
            "column_clean": true,
            "column_case": null
        }
    ],
    "from_columns_map": [
        {
            "transcripts_column": "Ensembl_transcriptid",
            "transcripts_infos_columns": [
                "genename",
                "Ensembl_geneid",
                "LIST_S2_score",
                "LIST_S2_pred"
            ],
            "column_rename": {
                "LIST_S2_score": "LISTScore",
                "LIST_S2_pred": "LISTPred"
            },
            "column_clean": false,
            "column_case": null
        },
        {
            "transcripts_column": "Ensembl_transcriptid",
            "transcripts_infos_columns": [
                "genename",
                "VARITY_R_score",
                "Aloft_pred"
            ],
            "column_rename": null,
            "column_clean": false,
            "column_case": "lower"
        }
    ]
}
  • Section column_rename rename columns/fields
  • Section column_clean clean columns/fields names to remove all special characters (especially . in snpEff annotations)
  • Section column_case change case of columns/fields into loweror upper

Theses options are useful to control fields names, to merge fields from multiple source (e.g. genename). All these options are processed (i.e. combinaison of rename and clean and case). Beware of prioritization parameters that will take into account these name changing.

antonylebechec added a commit that referenced this issue Sep 20, 2024
…e_view

Add options to control transcripts view struct #256
@antonylebechec
Copy link
Collaborator Author

antonylebechec commented Sep 23, 2024

New options to merge and map transcript IDs (e.g. from Ensembl to refSeq), in order to merge multiple-sourced annotations in transcript view.

{
            "table": "transcripts",
            "column_id": "transcript",
            "transcripts_info_json": "transcripts_json",
            "transcripts_info_field": "transcripts_json",
            "transcript_id_remove_version": true,
            "transcript_id_mapping_file": "My_transcripts_mapping_file.tsv.gz",
            "transcript_id_mapping_force": false,
            "struct": {...}
  • Section transcript_id_remove_version remove possible version of transcript (NM_123456.2 to NM_123456)
  • Section transcript_id_mapping_file indicate a transcript mapping file that provides mapping between transcripts IDs
  • Section transcript_id_mapping_force allows to filter transcript IDs only if they are present in the transcript mapping file

Beware of transcript version in transcript mapping file, to prevent fix of transcript with and without version (or use remove version option to be consistent)

Example of transcripts mapping file:

NM_001005484	ENST00000641515.1
NR_024540
NR_036266
NM_001346900
NM_001346897
NR_047551
NM_001346941.2
NM_005228

Example of transcripts view with these options:
image

@antonylebechec
Copy link
Collaborator Author

antonylebechec commented Sep 24, 2024

New option to export transcripts view as aa file.

"export": {
   "output": "/tmp/output.tsv.gz"
}
  • Section output define export file path, in multiple format (TSV, VCF, Parquet...)

Example of output file in TSV:
image

Example of output file in VCF:
image

Example of command line:

howard calculation --input="tests/data/example.ann.transcripts.vcf.gz" --output="/tmp/example.calculation.transcripts.tsv" --param="tests/data/param.transcripts.json" --calculations='TRANSCRIPTS_ANNOTATIONS,TRANSCRIPTS_PRIORITIZATION,TRANSCRIPTS_EXPORT'

@antonylebechec
Copy link
Collaborator Author

New option to extract NOMEN from a field (e.g. hgvs, 'snpeff_hgvs`) with a dynamic table.column transcript list (e.g. from annotation, prioritization) rather than a list of transcripts list file.

"transcripts": {
    "table": "transcripts",
    "column_id": "transcript",
    "transcripts_info_json": "transcripts_json",
    "transcripts_info_field": "transcripts_json",
    "transcript_id_remove_version": true,
    "transcript_id_mapping_file": "transcripts.for_mapping.tsv",
    "transcript_id_mapping_force": false,
    "struct": {
        "from_column_format": [
            {
                "transcripts_column": "ANN",
                "transcripts_infos_column": "Feature_ID",
                "column_clean": true
            }
        ],
        "from_columns_map": [
            {
                "transcripts_column": "Ensembl_transcriptid",
                "transcripts_infos_columns": [
                    "genename",
                    "Ensembl_geneid",
                    "LIST_S2_score",
                    "LIST_S2_pred"
                ]
            },
            {
                "transcripts_column": "Ensembl_transcriptid",
                "transcripts_infos_columns": [
                    "genename",
                    "VARITY_R_score",
                    "Aloft_pred"
                ]
            }
        ]
    },
    "prioritization": {
        "profiles": ["transcripts"],
        "prioritization_config": "prioritization_transcripts_profiles_fields_renamed.json",
        "pzprefix": "PZT",
        "pzfields": ["Score", "Flag", "LIST_S2_score", "LIST_S2_pred"],
        "prioritization_score_mode": "HOWARD",
    }
}
"calculation":  {
   "calculations":  {
      "NOMEN": {
         "options"{
                "hgvs_field": "snpeff_hgvs",
                "transcripts": "transcripts.tsv",
                "transcripts_table": "variants",
                "transcripts_column": "PZTTranscript",
                "transcripts_order": ["column", "file"]
            }
         }
      }
   }

Within section calculation::calculations::NOMEN::options:

  • Section hgvs_field is the column with all HGVS annotation (multiple NOMEN
  • Section transcripts is a file with a list of transcripts of preference (by order)
  • Sections transcripts_table and transcripts_column define where transcripts for each variant are defined (usually after a transcript prioritization)
  • Section transcripts_order is the order to consider lists (by default dynamic column first, then file)

This option is useful to provide a NOMEN corresponding to the "best" transcript of the variant, after transcript prioritization.
Beware of transcripts mapping, espacialy between refSeq and Ensembl. This will result that prioritized transcript should not be available in the HGVS column (usually, transcript annotation with a database such as 'dbNSFP' with Ensembl transcript source, and transcript annotation for HGVS with snpEff tools with refSeq transcript source)

@antonylebechec
Copy link
Collaborator Author

antonylebechec commented Sep 24, 2024

In order to extract a specific prioritize column from another prioritization profile (e.g. transcripts2), add the field with prefix in fields section (e.g. PZTScore_transcripts2.

"prioritization": {
    "profiles": ["transcripts", "transcripts2"],
    "prioritization_config": "prioritization_transcripts_profiles_fields_renamed2.json",
    "pzprefix": "PZT",
    "pzfields": [
        "Score",
        "Flag",
        "LIST_S2_score",
        "LIST_S2_pred",
        "PZTFlag_transcripts",
        "PZTScore_transcripts",
        "PZTFlag_transcripts2",
        "PZTScore_transcripts2"
    ],
    "prioritization_score_mode": "HOWARD"
}

Example of output:

   #CHROM       POS REF ALT       transcript PZTFlag_transcripts  PZTScore_transcripts PZTFlag_transcripts2  PZTScore_transcripts2
0    chr1     69101   A   G     NM_001005484                PASS                     0                 PASS                      0
1    chr1     69101   A   G  ENST00000335137                PASS                     0                 PASS                      0
2    chr1     28736   A   C        NR_036051                PASS                   200                 PASS                    400
3    chr1     28736   A   C        NR_036266                PASS                   200                 PASS                    400
4    chr1     28736   A   C        NR_036267                PASS                   200                 PASS                    400
5    chr1     28736   A   C        NR_036268                PASS                   200                 PASS                    400
6    chr1     28736   A   C        NR_024540                PASS                   200                 PASS                    400
7    chr1     35144   A   C        NR_036051                PASS                   100                 PASS                    200
8    chr1     35144   A   C        NR_036266                PASS                   100                 PASS                    200
9    chr1     35144   A   C        NR_036267                PASS                   100                 PASS                    200
10   chr1     35144   A   C        NR_036268                PASS                   100                 PASS                    200
11   chr1     35144   A   C        NR_026818                PASS                   100                 PASS                    200
12   chr1     35144   A   C        NR_026820                PASS                   100                 PASS                    200
13   chr1     35144   A   C        NR_026822                PASS                   100                 PASS                    200
14   chr1    768251   A   G        NR_047519                PASS                   100                 PASS                    200
15   chr1    768251   A   G        NR_047526                PASS                   100                 PASS                    200
16   chr1    768251   A   G        NR_047521                PASS                   100                 PASS                    200
17   chr1    768251   A   G        NR_047523                PASS                   100                 PASS                    200
18   chr1    768251   A   G        NR_047524                PASS                   100                 PASS                    200
19   chr1    768251   A   G        NR_047525                PASS                   100                 PASS                    200
20   chr1    768252   A   G        NR_047519                PASS                   100                 PASS                    200
21   chr1    768252   A   G        NR_047526                PASS                   100                 PASS                    200
22   chr1    768252   A   G        NR_047521                PASS                   100                 PASS                    200
23   chr1    768252   A   G        NR_047523                PASS                   100                 PASS                    200
24   chr1    768252   A   G        NR_047524                PASS                   100                 PASS                    200
25   chr1    768252   A   G        NR_047525                PASS                   100                 PASS                    200
26   chr1    768253   A   G        NR_047519                PASS                   100                 PASS                    200
27   chr1    768253   A   G        NR_047526                PASS                   100                 PASS                    200
28   chr1    768253   A   G        NR_047521                PASS                   100                 PASS                    200
29   chr1    768253   A   G        NR_047523                PASS                   100                 PASS                    200
30   chr1    768253   A   G        NR_047524                PASS                   100                 PASS                    200
31   chr1    768253   A   G        NR_047525                PASS                   100                 PASS                    200
32   chr7  55249063   G   A        NM_005228                PASS                     0                 PASS                      0
33   chr7  55249063   G   A     NM_001346897                PASS                     0                 PASS                      0
34   chr7  55249063   G   A     NM_001346898                PASS                     0                 PASS                      0
35   chr7  55249063   G   A     NM_001346941                PASS                     0                 PASS                      0
36   chr7  55249063   G   A     NM_001346899                PASS                     0                 PASS                      0
37   chr7  55249063   G   A     NM_001346900                PASS                     0                 PASS                      0
38   chr7  55249063   G   A        NR_047551                PASS                   100                 PASS                    200

antonylebechec added a commit that referenced this issue Oct 7, 2024
Add transcripts options #256 into parameters help docs #4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants