As a mixcr result, too long CDR3 sequences were generated #332

miaoyu01 · 2018-01-31T04:24:17Z

I use the pair-end 150 sequencing method ,but now I find too long CDR3 sequences were generated in the result,Spliced sequence is "CGCTCAGGCTGGAGTCGGCTGCTCCCTCCCAGACATCTGTGTACTTCTGTGCCAGCAGTTACGACGGACAAAGAACAGATACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCGGTAAGCGGGGGCTCCCGCTGAAGCCCCGGAACTGGGGAGGGGGCGCCCCGGGACGCCGGGGGCGTCGCAGGGCCAGTTTCTGTGCCGCGTCTCGGGGCTGTGAGCCAAAAACATTCAGTACTTCGGCGCCGGGACCCGGCTCTCAGTGC",but the corresponding CDR3 sequence in the result is “TGTGCCAGCAGTTACGACGGACAAAGAACAGATACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCGGTAAGCGGGGGCTCCCGCTGAAGCCCCGGAACTGGGGAGGGGGCGCCCCGGGACGCCGGGGGCGTCGCAGGGCCAGTTTCTGTGCCGCGTCTCGGGGCTGTGAGCCAAAAACATTCAGTACTTC”，However, different CDR3 sequences were obtained after IMGT alignment. After blast, this sequence was found to be a true gene. I wonder Whether this CDR3 sequence should be retained？Thank you !

dbolotin · 2018-02-02T16:47:34Z

Please post raw alignments from this clone, like this:

mixcr exportAlignmentsPretty -e 'TGTGCCAGCAGTTACGACGGACAAAGAACAGATACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCGGTAAGCGGGGGCTCCCGCTGAAGCCCCGGAACTGGGGAGGGGGCGCCCCGGGACGCCGGGGGCGTCGCAGGGCCAGTTTCTGTGCCGCGTCTCGGGGCTGTGAGCCAAAAACATTCAGTACTTC' alignments.vdjca

dbolotin · 2018-02-02T16:51:19Z

What MiXCR version do you use?

miaoyu01 · 2018-02-04T06:25:26Z

Sorry, I check carefully the mixcr-2.1.5 result, most of the too long CDR3 sequence do not contain terminating codons or in non-coding boxes. While after I Remove these clone, there are still some long sequences that have been retained. The following files is one clone with too long CDR3 sequence, and the mixcr result is different from the IMGT-alignment-result:
(1)Spliced.sequence:[Spliced sequence.zip]
(https://github.com/milaboratory/mixcr/files/1692497/Spliced.sequence.zip)
(2) this.clone.mixcr-result:[This.clone.mixcr-result .zip]
(https://github.com/milaboratory/mixcr/files/1692499/This.clone.mixcr-result.zip)
(3) this clone alignment.result:[alignment.result.zip]
(https://github.com/milaboratory/mixcr/files/1692504/alignment.result.zip)
(4) IMGT_alignment result:[IMGT_V-QUEST.zip]
(https://github.com/milaboratory/mixcr/files/1692508/IMGT_V-QUEST.zip)

miaoyu01 · 2018-02-04T06:27:52Z

Thank you for your help！

dbolotin · 2018-02-15T19:28:15Z

Hi! Sorry for the late response.

Both sequences you provided seems to be an artefact.

In the first one, that you mentioned in the first message, the wrong J gene was chosen by the splicing machinery. We've seen such sequences a lot in RNA-Seq data. As you can see on the picture below, after successful rearrangement many "acceptor" splice sites near the J genes still remain in the gene sequence (marked by stars), and splicing, as one would expect, can't perfectly distinguish between these sites, and selects wrong one from time to time:

This leads to the presence of several J genes in the sequence. Here is the alignment for your sequence (I removed weird alignments for D genes to keep the picture clean):

>>> Read id: 0

                                                          FR3><CDR3        V>
   Quality     66666666666666666666666666666666666666666666666666666666666666666666666666666666
   Target0   0 CGCTCAGGCTGGAGTCGGCTGCTCCCTCCCAGACATCTGTGTACTTCTGTGCCAGCAGTTACGACGGACAAAGAACAGAT 79   Score
TRBV6-6*00 223 cgctcaggctggagtTggctgctccctcccagacatctgtgtacttctgtgccagcagttac                   284  296
TRBV6-1*00 223 cgctcaggctggagtcggctgctccctcccagacatctgtgtacttctgtgccagcagt                      281  295
TRBV6-5*00 223 cgctcaggctgCTgtcggctgctccctcccagacatctgtgtacttctgtgccagcagttac                   284  282
TRBV6-9*00 223 cgctcaggctggagtcAgctgctccctcccagacatctgtAtacttctgtgccagcagtta                    283  277
TRBJ2-3*00  20                                                                        agCacagat 28   231


   Quality    666666666666666666666666666666666666666666666666666666666666666666666666666666
   Target0 80 ACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCGGTAAGCGGGGGCTCCCGCTGAAGCCCCGGAACTGGGGA 157  Score
TRBJ2-3*00 29 acgcagtattttggcccaggcacccggctgacagtgctcg                                       68   231

                                                                               <J
   Quality     66666666666666666666666666666666666666666666666666666666666666666666666666666666
   Target0 158 GGGGGCGCCCCGGGACGCCGGGGGCGTCGCAGGGCCAGTTTCTGTGCCGCGTCTCGGGGCTGTGAGCCAAAAACATTCAG 237  Score
TRBJ2-4*00  20                                                                 agccaaaaacattcag 35   235

                CDR3><FR4
   Quality     6666666666666666666666666666666
   Target0 238 TACTTCGGCGCCGGGACCCGGCTCTCAGTGC 268  Score
TRBJ2-4*00  36 tacttcggcgccgggacccggctctcagtgc 66   235

MiXCR uses top J gene (by score) as a basis to locate the CDR3 boundary, in this case TRBJ2-4*00 was chosen as having higher score than TRBJ2-3*00. So you see this long CDR3.

Here is also BLAST result for this sequence:

The second case seems to be just a a faulty rearrangement. I searched the sequence in BLAST, and here is the result:

As you can see it was aligned against relatively long genomic sequence after TRBV6-2, the sequence that in normal circumstances must be removed by VDJ recombination. Alternative recombination site or whatnot, may be a result of this.

So, summing up:

Both of the sequences seems to be an artefacts of molecular machinery of the cell. (at the same time it might be a staring point for the publication in Cell 🥇 😄 )
In the first case correct CDR3 may still be extracted. I created corresponding issue for this case: Select leftmost J gene in case of several J hits in different sequence locations #353. We will implement this eventually.
The second sequence is inherently faulty, and CDR3 is technically legit. At the same time this seems to be a non-productive rearrangement, and I would filter it off for the normal repertoire analysis.
So, as a workaround for all of this, filtering by CDR3 length might be a solution.

PoslavskySV added aligner discussion_required labels Feb 15, 2018

dbolotin mentioned this issue May 15, 2018

Losing large % of reads due to lack of clone sequence #383

Closed

PoslavskySV closed this as completed Oct 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a mixcr result, too long CDR3 sequences were generated #332

As a mixcr result, too long CDR3 sequences were generated #332

miaoyu01 commented Jan 31, 2018

dbolotin commented Feb 2, 2018

dbolotin commented Feb 2, 2018

miaoyu01 commented Feb 4, 2018

miaoyu01 commented Feb 4, 2018

dbolotin commented Feb 15, 2018

As a mixcr result, too long CDR3 sequences were generated #332

As a mixcr result, too long CDR3 sequences were generated #332

Comments

miaoyu01 commented Jan 31, 2018

dbolotin commented Feb 2, 2018

dbolotin commented Feb 2, 2018

miaoyu01 commented Feb 4, 2018

miaoyu01 commented Feb 4, 2018

dbolotin commented Feb 15, 2018