Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepant results between similar Salamae genomes #57

Open
flashton2003 opened this issue Aug 11, 2023 · 2 comments
Open

Discrepant results between similar Salamae genomes #57

flashton2003 opened this issue Aug 11, 2023 · 2 comments

Comments

@flashton2003
Copy link

Hello,

We've isolated some subsp salamaes for one of our projects. I have a few questions about the SISTR output for these isolates:

<style> </style>
sample cgmlst_found_loci cgmlst_matching_alleles cgmlst_subspecies o_antigen serogroup serovar serovar_antigen serovar_cgmlst O antigen prediction H1 antigen prediction(fliC) H2 antigen prediction(fljB) Predicted identification Predicted antigenic profile Predicted serotype average_depth snp_count indel_count N_count reads_cov Reference Organism from Esmie Salmonella genus
CQJ13L 330 129 salamae - B II 1,4,12,[27]:a:z39|II 4:a:z39 II 1,4,12,[27]:a:z39|II 4:a:z39 II 1,4,[5],12,[27]:b:[e,n,x] 4 a z39 Salmonella enterica subspecies salamae (subspecies II) 4:a:z39 II [1],4,12,[27]:a:z39 70.3072 40928 0 0 89.86 GCF_019339485.1   Salmonella species
CQJ127 330 134 salamae - B II B:-:e,n,x II B:-:e,n,x II 1,4,[5],12,[27]:b:[e,n,x] 4 z e,n,x Salmonella enterica subspecies salamae (subspecies II) 4:z:e,n,x II [1],4,12,27:z:e,n,x 62.4964 40202 0 0 89.21 GCF_019339485.1 Salmonella Typhimurium Salmonella typhimurium
  1. Neither of these samples have a prediction in the o_antigen column, but when I blast them against your database, they have quite a good match (97% similarity, >99% coverage) to "304|584|1,4,12,27|B" from the wzx database. Is this match not good enough to call the O antigen? Or is there uncertainty about record 304?
  2. They are quite similar results across most fields (and for my blast results against the wxy and wxz databases), but they have different results in the output. How come?

Here are the fasta files, in case you want to dig in.

https://www.dropbox.com/scl/fi/ppkiflqyvtwrn28nah9s6/CQJ127_S25_L001.fna?rlkey=t19wktth1uguutnmqkuej3jq7&dl=0
https://www.dropbox.com/scl/fi/k4yg4iwo1k0mjdinyahx0/CQJ13L_S23_L001.fna?rlkey=woawgupve273z3b7y7tx2xipi&dl=0

Thanks,

Phil

@kbessonov1984
Copy link
Collaborator

kbessonov1984 commented Sep 3, 2024

Hello,
These are complex isolates to type as serovars are not summarized by single name but rather an antigenic profile. SISTR uses antigens, cgMLST and MASH (if selected) to provide a final serovar call with antigen results taking precedence overall all other evidences.
The O antigen values summarized by o_antigen field is deduced from the serovar by reverse WHO known serovars table lookup sistr/data/Salmonella-serotype_serogroup_antigen_table-WHO_2007.csv. The most informative is the json output format specified via -f json option that provides all intermediate and reliability values. For both samples I would use cgMLST serovar as a final serovar.

Serovar prediction logic

  • CQJ127 - the final serovar was assigned by the antigenic O and H antigens alleles database as there was no good match between cgMLST, MASH and O and H antigen BLAST results. Looking at H2 antigen hits for the fljb gene, there is almost perfect match to e,n,x antigen. For H1 antigen and fliC gene there was almost perfect e,n,x hit, but after filtering the antigen to the serovar table Salmonella-serotype_serogroup_antigen_table-WHO_2007.csv there are not serovars that have both H1=e,n,x and H2=e,n,x value, so the H1 antigen was not assigned. Similarly there is an almost perfect hit for O-antigen 1,4,12,27, but again it is not reported due to no match to the antigen to serovar table Salmonella-serotype_serogroup_antigen_table-WHO_2007.csv. There seem to be an issue with the H1 or H2 antigen correct assignment. The H1 possible expected value could be b, a, l,v based on the antigen to serovar metadata. Please note that there are only 134/330 cgMLST 100% matching alleles hinting that there might be extra work needed to polish the assembly or that these isolates are of a new serovar. The most probable WKLM serovar is II 1,4,[5],12,[27]:b:[e,n,x] the only caveat is the H1 antigen was not detected as b. Here are predictions from the 3 sources

    • antigens: II B:-:e,n,x
    • cgMLST: II 1,4,[5],12,[27]:b:[e,n,x]
    • MASH: II 4,12:e,n,x:1,2,7 (based on the closest reference genome)
  • CQJ13L - the final serovar was assigned by the antigenic O and H antigens alleles database and it is a mixed called based on the | symbol. This sample is of higher quality than CQJ127. The H1 antigen is clearly a and H2 antigen is z39 that predict O antigen as 1,4,12,[27]. The O-antigen is reported is none due to antigen to serovar table mismatch, but the top hit is the expected 1,4,12,27. The cgMLST call in this case is 129/330 which is a weak predictor. Thus the antigen prediction is the most reliable. The final serovar is most probably is II 1,4,12,[27]:a:z39. I checked the Salmonella-serotype_serogroup_antigen_table-WHO_2007.csv and there is II 4:a:z39,"1,4,12,[27]",a,z39,,B,FALSE,salamae entry that is redundant in my opinion giving this mixed antigenic call.
    Here are predictions for the 3 sources

    • antigens: II 1,4,12,[27]:a:z39|II 4:a:z39 (the | means OR, just pick one)
    • cgMLST: II 1,4,[5],12,[27]:b:[e,n,x] (a complete miss in my opinion due to to less than <50% cgMLST alleles matching)
    • MASH: II 4,12:e,n,x:1,2,7 at 0.00594242 mash distance

Both samples belong to subspecies salamae but serovars are different. We provide all information from all evidences so the end users can finalize the serovar prediction. We are currently working on the version 1.1.3 release update that will be released soon and provide more transparent serovar prediction logic messages in the log.

SISTR v1.1.2 results

SeqSero2 results for comparison

Input files: CQJ127_S25_L001.fna
O antigen prediction: 4
H1 antigen prediction(fliC): 1,2,7
H2 antigen prediction(fljB): e,n,x
Predicted identification: Salmonella enterica subspecies salamae (subspecies II)
Predicted antigenic profile: 4:1,2,7:e,n,x
Predicted serotype: II 4:1,2,7:e,n,x
Note: This predicted serotype is not in the Kauffman-White scheme.

Input files: CQJ13L_S23_L001.fna
O antigen prediction: 4
H1 antigen prediction(fliC): a
H2 antigen prediction(fljB): z39
Predicted identification: Salmonella enterica subspecies salamae (subspecies II)
Predicted antigenic profile: 4:a:z39
Predicted serotype: II [1],4,12,[27]:a:z39
Note:

WKLM scheme

@flashton2003
Copy link
Author

Thanks very much, apologies for the slow response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants