Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a mixcr result, too long CDR3 sequences were generated #332

Closed
miaoyu01 opened this issue Jan 31, 2018 · 5 comments
Closed

As a mixcr result, too long CDR3 sequences were generated #332

miaoyu01 opened this issue Jan 31, 2018 · 5 comments

Comments

@miaoyu01
Copy link

I use the pair-end 150 sequencing method ,but now I find too long CDR3 sequences were generated in the result,Spliced sequence is "CGCTCAGGCTGGAGTCGGCTGCTCCCTCCCAGACATCTGTGTACTTCTGTGCCAGCAGTTACGACGGACAAAGAACAGATACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCGGTAAGCGGGGGCTCCCGCTGAAGCCCCGGAACTGGGGAGGGGGCGCCCCGGGACGCCGGGGGCGTCGCAGGGCCAGTTTCTGTGCCGCGTCTCGGGGCTGTGAGCCAAAAACATTCAGTACTTCGGCGCCGGGACCCGGCTCTCAGTGC",but the corresponding CDR3 sequence in the result is “TGTGCCAGCAGTTACGACGGACAAAGAACAGATACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCGGTAAGCGGGGGCTCCCGCTGAAGCCCCGGAACTGGGGAGGGGGCGCCCCGGGACGCCGGGGGCGTCGCAGGGCCAGTTTCTGTGCCGCGTCTCGGGGCTGTGAGCCAAAAACATTCAGTACTTC”,However, different CDR3 sequences were obtained after IMGT alignment. After blast, this sequence was found to be a true gene. I wonder Whether this CDR3 sequence should be retained?Thank you !

@dbolotin
Copy link
Member

dbolotin commented Feb 2, 2018

Please post raw alignments from this clone, like this:

mixcr exportAlignmentsPretty -e 'TGTGCCAGCAGTTACGACGGACAAAGAACAGATACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCGGTAAGCGGGGGCTCCCGCTGAAGCCCCGGAACTGGGGAGGGGGCGCCCCGGGACGCCGGGGGCGTCGCAGGGCCAGTTTCTGTGCCGCGTCTCGGGGCTGTGAGCCAAAAACATTCAGTACTTC' alignments.vdjca

@dbolotin
Copy link
Member

dbolotin commented Feb 2, 2018

What MiXCR version do you use?

@miaoyu01
Copy link
Author

miaoyu01 commented Feb 4, 2018

Sorry, I check carefully the mixcr-2.1.5 result, most of the too long CDR3 sequence do not contain terminating codons or in non-coding boxes. While after I Remove these clone, there are still some long sequences that have been retained. The following files is one clone with too long CDR3 sequence, and the mixcr result is different from the IMGT-alignment-result:
(1)Spliced.sequence:[Spliced sequence.zip]
(https://github.com/milaboratory/mixcr/files/1692497/Spliced.sequence.zip)
(2) this.clone.mixcr-result:[This.clone.mixcr-result .zip]
(https://github.com/milaboratory/mixcr/files/1692499/This.clone.mixcr-result.zip)
(3) this clone alignment.result:[alignment.result.zip]
(https://github.com/milaboratory/mixcr/files/1692504/alignment.result.zip)
(4) IMGT_alignment result:[IMGT_V-QUEST.zip]
(https://github.com/milaboratory/mixcr/files/1692508/IMGT_V-QUEST.zip)

@miaoyu01
Copy link
Author

miaoyu01 commented Feb 4, 2018

Thank you for your help!

@dbolotin
Copy link
Member

Hi! Sorry for the late response.

Both sequences you provided seems to be an artefact.

In the first one, that you mentioned in the first message, the wrong J gene was chosen by the splicing machinery. We've seen such sequences a lot in RNA-Seq data. As you can see on the picture below, after successful rearrangement many "acceptor" splice sites near the J genes still remain in the gene sequence (marked by stars), and splicing, as one would expect, can't perfectly distinguish between these sites, and selects wrong one from time to time:
image
This leads to the presence of several J genes in the sequence. Here is the alignment for your sequence (I removed weird alignments for D genes to keep the picture clean):

>>> Read id: 0

                                                          FR3><CDR3        V>
   Quality     66666666666666666666666666666666666666666666666666666666666666666666666666666666
   Target0   0 CGCTCAGGCTGGAGTCGGCTGCTCCCTCCCAGACATCTGTGTACTTCTGTGCCAGCAGTTACGACGGACAAAGAACAGAT 79   Score
TRBV6-6*00 223 cgctcaggctggagtTggctgctccctcccagacatctgtgtacttctgtgccagcagttac                   284  296
TRBV6-1*00 223 cgctcaggctggagtcggctgctccctcccagacatctgtgtacttctgtgccagcagt                      281  295
TRBV6-5*00 223 cgctcaggctgCTgtcggctgctccctcccagacatctgtgtacttctgtgccagcagttac                   284  282
TRBV6-9*00 223 cgctcaggctggagtcAgctgctccctcccagacatctgtAtacttctgtgccagcagtta                    283  277
TRBJ2-3*00  20                                                                        agCacagat 28   231


   Quality    666666666666666666666666666666666666666666666666666666666666666666666666666666
   Target0 80 ACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCGGTAAGCGGGGGCTCCCGCTGAAGCCCCGGAACTGGGGA 157  Score
TRBJ2-3*00 29 acgcagtattttggcccaggcacccggctgacagtgctcg                                       68   231

                                                                               <J
   Quality     66666666666666666666666666666666666666666666666666666666666666666666666666666666
   Target0 158 GGGGGCGCCCCGGGACGCCGGGGGCGTCGCAGGGCCAGTTTCTGTGCCGCGTCTCGGGGCTGTGAGCCAAAAACATTCAG 237  Score
TRBJ2-4*00  20                                                                 agccaaaaacattcag 35   235

                CDR3><FR4
   Quality     6666666666666666666666666666666
   Target0 238 TACTTCGGCGCCGGGACCCGGCTCTCAGTGC 268  Score
TRBJ2-4*00  36 tacttcggcgccgggacccggctctcagtgc 66   235

MiXCR uses top J gene (by score) as a basis to locate the CDR3 boundary, in this case TRBJ2-4*00 was chosen as having higher score than TRBJ2-3*00. So you see this long CDR3.

Here is also BLAST result for this sequence:
image

The second case seems to be just a a faulty rearrangement. I searched the sequence in BLAST, and here is the result:
image
As you can see it was aligned against relatively long genomic sequence after TRBV6-2, the sequence that in normal circumstances must be removed by VDJ recombination. Alternative recombination site or whatnot, may be a result of this.

So, summing up:

  1. Both of the sequences seems to be an artefacts of molecular machinery of the cell. (at the same time it might be a staring point for the publication in Cell 🥇 😄 )
  2. In the first case correct CDR3 may still be extracted. I created corresponding issue for this case: Select leftmost J gene in case of several J hits in different sequence locations #353. We will implement this eventually.
  3. The second sequence is inherently faulty, and CDR3 is technically legit. At the same time this seems to be a non-productive rearrangement, and I would filter it off for the normal repertoire analysis.
  4. So, as a workaround for all of this, filtering by CDR3 length might be a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants