Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I would like to know how Trust4 can directly analyze paired-end .fastq format data from the 10X Genomics platform for single-cell analysis.? #271

Open
fight2021 opened this issue May 17, 2024 · 21 comments

Comments

@fight2021
Copy link

I would like to ask how Trust4 can directly analyze paired-end .fastq format data from the 10X Genomics platform for single-cell analysis, instead of analyzing BAM format data. Can you provide support for this analysis? The current analysis speed is too slow.

run-trust4 -t 25 -b /home/zxsys/data6/bam/SRR22007527_genome_bam.bam -f /home/zxsys/data6/hg38_bcrtcr.fa --ref /home/zxsys/data6/human_IMGT+C.fa --barcode CB

Is it possible to directly use FASTQ format for paired-end single-cell data analysis without using BAM files, while still ensuring that Trust4 operates normally?

@mourisl
Copy link
Collaborator

mourisl commented May 17, 2024

Here is the reply from the discussion just in case you missed it: Yes, you can. It would be something like running fastq files from this section: https://github.com/liulab-dfci/TRUST4?tab=readme-ov-file#10x-genomics-data-and-barcode-based-single-cell-data . For the running speed, which version of TRUST4 are you using? Which step do you find is too slow?

@fight2021
Copy link
Author

I am currently using the Cell Ranger to analyze upstream FASTQ data to obtain BAM format data for 10X single-cell transcriptome analysis of the immune repertoire. Then, I use the command run-trust4 -t 25 -b /home/zxsys/data6/bam/SRR22007527_genome_bam.bam -f /home/zxsys/data6/hg38_bcrtcr.fa --ref /home/zxsys/data6/human_IMGT+C.fa --barcode CB to analyze the BAM data to obtain single-cell immune repertoire data. This workflow is too slow, preventing rapid completion of data analysis. I would now like to know how to use the Trust4 command to directly analyze single-cell transcriptome FASTQ data to obtain immune repertoire data, without first using Cell Ranger to analyze and obtain BAM. Currently, when I use the command run-trust4 -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -u path_to_10X_fastqs/R2.fastq.gz --barcode path_to_10X_fastqs/R1.fastq.gz --readFormat bc:0:15 --barcodeWhitelist cellranger_folder/cellranger-cs/VERSION/lib/python/cellranger/barcodes/737K-august-2016.txt [other options] to analyze single-cell transcriptome data, it results in errors and the analysis cannot be completed.

@fight2021
Copy link
Author

First of all, thank you for your reply.

@mourisl
Copy link
Collaborator

mourisl commented May 17, 2024

What error message did you get? Is your data 10X gene expression data or 10X vdj-kit data? Which version of TRUST4 are you using? Your command looks right to me. (Let's use this issue instead of the Discussion).

@fight2021
Copy link
Author

Hello expert, I am currently using the following command which only supports single-end data. Could you provide a command for analyzing paired-end data? Since I am a beginner, there are many things I still need to learn. run-trust4 -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -u path_to_10X_fastqs/R2.fastq.gz --barcode path_to_10X_fastqs/R1.fastq.gz --readFormat bc:0:15 --barcodeWhitelist cellranger_folder/cellranger-cs/VERSION/lib/python/cellranger/barcodes/737K-august-2016.txt [other options]

@mourisl
Copy link
Collaborator

mourisl commented May 17, 2024

This depends on your structure. For example, if the read is in both R1, R2, and barcode and UMI is also in R1's first 26bp (16bp barcode + 10bp UMI), You can use "-1 R1 -2 R2 --barcode R1 --readFromat bc:0:15,r1:26:-1" for this.

@ruffyp
Copy link

ruffyp commented Oct 31, 2024

Hello, I tested TRUST4 on raw FASTQ files derived from 10x, paired-end, 16bp barcode + 10bp UMI in R1's first 26bp, and ~40M reads, I used this code:

run-trust4 --readFormat bc:0:15,um:16:25 --barcodeWhitelist 737K-august-2016.txt -t 16 -f human_imgt.fa --ref human_imgt.fa -1 R1.fastq.gz -2 R2.fastq.gz --barcode R1.fastq.gz --UMI R1.fastq.gz -o tmp

it ran well, but slowly, taking close to 4 hours:

[Wed Oct 30 13:49:50 2024] TRUST4 v1.1.4-r534 begins.
[Wed Oct 30 13:49:50 2024] SYSTEM CALL: fastq-extractor -t 16 -f human_imgt.fa -o tmp_toassemble --readFormat bc:0:15,um:16:25 --barcodeWhitelist 737K-august-2016.txt -1 R1.fastq.gz -2 R2.fastq.gz --barcode R1.fastq.gz --UMI R1.fastq.gz
[Wed Oct 30 13:49:51 2024] Start to extract candidate reads from read files.
[Wed Oct 30 14:15:29 2024] Finish extracting reads.
[Wed Oct 30 14:15:29 2024] SYSTEM CALL: trust4 -t 16 -f human_imgt.fa -o tmp -1 tmp_toassemble_1.fq -2 tmp_toassemble_2.fq --barcode tmp_toassemble_bc.fa --UMI tmp_toassemble_umi.fa
[Wed Oct 30 14:15:32 2024] Read in and count kmers for 100000 reads.
[Wed Oct 30 14:15:35 2024] Read in and count kmers for 200000 reads.
[Wed Oct 30 14:15:38 2024] Read in and count kmers for 300000 reads.
[Wed Oct 30 14:15:42 2024] Read in and count kmers for 400000 reads.
[Wed Oct 30 14:15:45 2024] Read in and count kmers for 500000 reads.
...
[Wed Oct 30 14:57:01 2024] Read in and count kmers for 38600000 reads.
[Wed Oct 30 14:57:10 2024] Read in and count kmers for 38700000 reads.
[Wed Oct 30 14:57:19 2024] Read in and count kmers for 38800000 reads.
[Wed Oct 30 15:44:37 2024] Found 38695893 reads.
[Wed Oct 30 15:44:50 2024] Get barcode-wise kmer count.
[Wed Oct 30 15:53:28 2024] Finish barcode-wise kmer count.
[Wed Oct 30 15:55:24 2024] Finish sorting the reads.
[Wed Oct 30 16:00:48 2024] Finish rough annotations.
[Wed Oct 30 16:01:13 2024] Processed 100000 reads (100000 are used for assembly).
[Wed Oct 30 16:01:13 2024] Processed 200000 reads (196942 are used for assembly).
[Wed Oct 30 16:01:13 2024] Processed 300000 reads (296941 are used for assembly).
[Wed Oct 30 16:01:13 2024] Processed 400000 reads (394364 are used for assembly).
...
[Wed Oct 30 16:46:37 2024] Processed 35900000 reads (25222072 are used for assembly).
[Wed Oct 30 16:47:09 2024] Processed 36000000 reads (25293193 are used for assembly).
[Wed Oct 30 16:47:39 2024] Processed 36100000 reads (25363234 are used for assembly).
[Wed Oct 30 16:48:04 2024] Assembled 25394573 reads.
[Wed Oct 30 16:48:04 2024] Try to rescue 2505841 reads for assembly.
[Wed Oct 30 17:02:55 2024] Rescued 217580 reads.
[Wed Oct 30 17:07:03 2024] SYSTEM CALL: annotator -f human_imgt.fa -a tmp_final.out -t 16 -o tmp --barcode --UMI -r tmp_assembled_reads.fa --airrAlignment > tmp_annot.fa
[Wed Oct 30 17:07:03 2024] Start to annotate assemblies.
[Wed Oct 30 17:09:20 2024] Start to realign reads for CDR3 analysis.
[Wed Oct 30 17:20:31 2024] Compute CDR3 abundance.
[Wed Oct 30 17:20:44 2024] Finish annotation.
[Wed Oct 30 17:21:42 2024] SYSTEM CALL: perl trust-barcoderep.pl tmp_cdr3.out -a tmp_annot.fa --chainsInBarcode 2 > tmp_barcode_report.tsv
[Wed Oct 30 17:22:07 2024] SYSTEM CALL: perl trust-simplerep.pl tmp_cdr3.out --barcodeCnt --filterBarcoderep tmp_barcode_report.tsv > tmp_report.tsv
[Wed Oct 30 17:22:12 2024] SYSTEM CALL: perl trust-airr.pl tmp_report.tsv tmp_annot.fa --airr-align tmp_airr_align.tsv > tmp_airr.tsv
[Wed Oct 30 17:22:18 2024] SYSTEM CALL: perl trust-airr.pl tmp_barcode_report.tsv tmp_annot.fa --format barcoderep --airr-align tmp_airr_align.tsv > tmp_barcode_airr.tsv
[Wed Oct 30 17:22:28 2024] TRUST4 finishes.

In comparison, cellranger vdj ran for half an hour, so, I would like to know is this normal or something wrong with me?

@mourisl
Copy link
Collaborator

mourisl commented Oct 31, 2024

It's normal. I think your data is from 10x VDJ kit. TRUST4 is designed for regular RNA-seq data, like regular 10x 5' scRNA-seq without VDJ amplification. The read coverage is sparse for untarget-amplified data, so the assembly procedure, including data preprocessing, is more complex to achieve good sensitivity. I'm working on improving the running efficiency for the targeted-amplified data, like TCR-seq/BCR-seq, in recent versions, but there is still a long way to go in the current code structure.

@ruffyp
Copy link

ruffyp commented Oct 31, 2024

Yes, the data was from 10x TCR/BCR Amplification Kit! And, thank you for your reply, looking forward to the new version!

@ruffyp
Copy link

ruffyp commented Nov 19, 2024

Hi, I'm trying to understand the output files generated by the command above, but have some questions,

1、_toassemble_1.fq,_toassemble_2.fq,*_toassemble_bc.fa files have the same number of reads, in my case is 20,173,744,can I regard it as the number of reads after barcode correction? but why more reads are found in the log file:

[Mon Nov 18 12:03:24 2024] Read in and count kmers for 38800000 reads.
[Mon Nov 18 12:38:04 2024] Found 38561994 reads.
[Mon Nov 18 12:38:13 2024] Get barcode-wise kmer count.
[Mon Nov 18 12:44:42 2024] Finish barcode-wise kmer count.
[Mon Nov 18 12:46:31 2024] Finish sorting the reads.
[Mon Nov 18 12:51:04 2024] Finish rough annotations.

2、How can I find the number like 'Valid Barcodes' in 10x qc report, which means 'Fraction of reads with barcodes that match the whitelist after barcode correction.'?

3、Is there a detailed document about the output files?

@mourisl
Copy link
Collaborator

mourisl commented Nov 19, 2024

  1. The number of reads shouldn't go above the number of reads in the toassemble files. Could you please check the "trust4" command in the running log to make sure the parameter passed to the main assembly program is correct?
  2. The main assembly algorithm will ignore the reads without "valid" barcode by default. Nevertheless, the uncorrectable barcode will be labeled as "missing_barcode" in the toassemble_bc file, and this information can be used to infer fraction of reads with valid barcode.
  3. The output files are mainly described in the README file: in the "Input/Output" and "Practical notes: 10X Genomics data and barcode-based single-cell data" sections.

@ruffyp
Copy link

ruffyp commented Nov 19, 2024

thanks for your reply!

I used the following command:

run-trust4 --readFormat bc:0:15,r1:26:-1 --barcodeWhitelist 737K-august-2016.txt -t 16 -f human_imgt_ref_241025.fa --ref human_imgt_ref.fa -1 R1_001.fastq.gz -2 R2_001.fastq.gz --barcode R1_001.fastq.gz -o result_tmp --outputReadAssignment
and got the following files:
image

the number of reads in the result_tmp_toassemble_bc.fa:

image

the nohup.out file:

[Mon Nov 18 11:11:12 2024] TRUST4 v1.1.4-r534 begins.
[Mon Nov 18 11:11:12 2024] SYSTEM CALL: fastq-extractor -t 16 -f human_imgt_ref_241025.fa -o result_tmp_toassemble --readFormat bc:0:15,r1:26:-1 --barcodeWhitelist 737K-august-2016.txt -1 R1_001.fastq.gz -2 R2_001.fastq.gz --barcode R1_001.fastq.gz
[Mon Nov 18 11:11:13 2024] Start to extract candidate reads from read files.
[Mon Nov 18 11:32:59 2024] Finish extracting reads.
[Mon Nov 18 11:33:00 2024] SYSTEM CALL: trust4 -t 16 -f human_imgt_ref_241025.fa -o result_tmp -1 result_tmp_toassemble_1.fq -2 result_tmp_toassemble_2.fq --barcode result_tmp_toassemble_bc.fa
[Mon Nov 18 11:33:02 2024] Read in and count kmers for 100000 reads.
[Mon Nov 18 11:33:04 2024] Read in and count kmers for 200000 reads.
...
[Mon Nov 18 12:03:24 2024] Read in and count kmers for 38800000 reads.
[Mon Nov 18 12:38:04 2024] Found 38561994 reads.
[Mon Nov 18 12:38:13 2024] Get barcode-wise kmer count.
[Mon Nov 18 12:44:42 2024] Finish barcode-wise kmer count.
[Mon Nov 18 12:46:31 2024] Finish sorting the reads.
[Mon Nov 18 12:51:04 2024] Finish rough annotations.
[Mon Nov 18 12:51:28 2024] Processed 100000 reads (79135 are used for assembly).
[Mon Nov 18 13:33:47 2024] Processed 35900000 reads (25176464 are used for assembly).
[Mon Nov 18 13:34:17 2024] Assembled 25239130 reads.
[Mon Nov 18 13:34:17 2024] Try to rescue 3600898 reads for assembly.
[Mon Nov 18 13:51:03 2024] Rescued 373810 reads.
...
[Mon Nov 18 14:08:06 2024] Finish annotation.
...
[Mon Nov 18 14:09:48 2024] TRUST4 finishes.

@ruffyp
Copy link

ruffyp commented Nov 20, 2024

Hi, do you have any advice on this situation? Please let me know if you need more information.

@mourisl
Copy link
Collaborator

mourisl commented Nov 20, 2024

Since "result_tmp' is a pretty general name, I'm wondering whether you are running multiple instances of TRUST4 at the same time, and the intermediate files are overwritten by another sample?

@ruffyp
Copy link

ruffyp commented Nov 21, 2024

Thanks for your advice, I created a new directory, modified the '-o' parameter, executed the command only once and got the consistent result as before.

the command:
run-trust4 --readFormat bc:0:15,um:16:25,r1:26:-1 --barcodeWhitelist 737K-august-2016.txt -t 16 -f human_imgt_ref.fa --ref human_imgt_ref.fa -1 R1_001.fastq.gz -2 R2_001.fastq.gz --barcode R1_001.fastq.gz -o Test_TRUST410x --outputReadAssignment

all files, and the number of reads in the toassemble_bc.fa:
image

the nohup.out file:
[Wed Nov 20 09:31:01 2024] TRUST4 v1.1.4-r534 begins.
[Wed Nov 20 09:31:01 2024] SYSTEM CALL: fastq-extractor -t 16 -f human_imgt_ref.fa -o Test_TRUST410x_toassemble --readFormat bc:0:15,um:16:25,r1:26:-1 --barcodeWhitelist 737K-august-2016.txt -1 R1_001.fastq.gz -2 R2_001.fastq.gz --barcode R1_001.fastq.gz
[Wed Nov 20 09:31:01 2024] Start to extract candidate reads from read files.
[Wed Nov 20 09:52:57 2024] Finish extracting reads.
[Wed Nov 20 09:52:57 2024] SYSTEM CALL: trust4 -t 16 -f human_imgt_ref.fa -o Test_TRUST410x -1 Test_TRUST410x_toassemble_1.fq -2 Test_TRUST410x_toassemble_2.fq --barcode Test_TRUST410x_toassemble_bc.fa
[Wed Nov 20 09:53:00 2024] Read in and count kmers for 100000 reads.
[Wed Nov 20 09:53:02 2024] Read in and count kmers for 200000 reads.
...
[Wed Nov 20 10:26:21 2024] Read in and count kmers for 38700000 reads.
[Wed Nov 20 10:26:28 2024] Read in and count kmers for 38800000 reads.
[Wed Nov 20 11:07:11 2024] Found 38561994 reads.
[Wed Nov 20 11:07:30 2024] Get barcode-wise kmer count.
[Wed Nov 20 11:16:09 2024] Finish barcode-wise kmer count.
[Wed Nov 20 11:17:45 2024] Finish sorting the reads.
[Wed Nov 20 11:22:19 2024] Finish rough annotations.
[Wed Nov 20 11:22:39 2024] Processed 100000 reads (79135 are used for assembly).
[Wed Nov 20 11:22:39 2024] Processed 200000 reads (161463 are used for assembly).
...
[Wed Nov 20 12:20:21 2024] Processed 35900000 reads (25176464 are used for assembly).
[Wed Nov 20 12:21:12 2024] Assembled 25239130 reads.
[Wed Nov 20 12:21:12 2024] Try to rescue 3600898 reads for assembly.
[Wed Nov 20 12:42:47 2024] Rescued 373810 reads.
[Wed Nov 20 12:49:14 2024] SYSTEM CALL: annotator -f human_imgt_ref.fa -a Test_TRUST410x_final.out -t 16 -o Test_TRUST410x --barcode --readAssignment Test_TRUST410x_assign.out -r Test_TRUST410x_assembled_reads.fa --airrAlignment > Test_TRUST410x_annot.fa
[Wed Nov 20 12:49:14 2024] Start to annotate assemblies.
[Wed Nov 20 12:51:23 2024] Start to realign reads for CDR3 analysis.
[Wed Nov 20 13:03:00 2024] Compute CDR3 abundance.
[Wed Nov 20 13:03:13 2024] Finish annotation.
[Wed Nov 20 13:04:10 2024] SYSTEM CALL: perl trust-barcoderep.pl Test_TRUST410x_cdr3.out -a Test_TRUST410x_annot.fa --chainsInBarcode 2 > Test_TRUST410x_barcode_report.tsv
[Wed Nov 20 13:04:36 2024] SYSTEM CALL: perl trust-simplerep.pl Test_TRUST410x_cdr3.out --barcodeCnt --filterBarcoderep Test_TRUST410x_barcode_report.tsv > Test_TRUST410x_report.tsv
[Wed Nov 20 13:04:41 2024] SYSTEM CALL: perl trust-airr.pl Test_TRUST410x_report.tsv Test_TRUST410x_annot.fa --airr-align Test_TRUST410x_airr_align.tsv > Test_TRUST410x_airr.tsv
[Wed Nov 20 13:04:46 2024] SYSTEM CALL: perl trust-airr.pl Test_TRUST410x_barcode_report.tsv Test_TRUST410x_annot.fa --format barcoderep --airr-align Test_TRUST410x_airr_align.tsv > Test_TRUST410x_barcode_airr.tsv
[Wed Nov 20 13:04:56 2024] TRUST4 finishes.

Is there any further testing I can do?

@mourisl
Copy link
Collaborator

mourisl commented Nov 21, 2024

Can you show me the first a few lines of the two input fastq files?

@ruffyp
Copy link

ruffyp commented Nov 21, 2024

R1_001.fastq.gz

@LH00169:488:22HLWYLT4:7:1101:2141:1042 1:N:0:AAGATTGGAT+AAATCCCGCT
ANGGGTCGTCAGGACATGGAGCTCTCTTTCTTATATGGGGAAGCCCTGAATCAGATGCAGTGCTTCCTGTCCCTCTGTGCCATGGGCCCCGGGCTCCTCTGCTGGGCACTGCTTTGTCTCCTGGGAGCAGGCTTAGTGGACGCTGGAGTC
+
I#I-IIIIIIIIIII9I99II9IIII9II-9I-IIIIIIIII9IIIII9I9IIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIII9III-9IIIIIIIIIIIIIIIIIIIIIIII
@LH00169:488:22HLWYLT4:7:1101:2157:1042 1:N:0:AAGATTGGAT+AAATCCCGCT
TNGTACCTCGGATGTTGCCCCCGCCGTTTCTTATATGGCCTCAGTTCCGAAAACCAACAAAATAGAACCGCGGTCCTATTCCATTATTCCTAGCTGCGGCATCCAGGCGGCTCGGGCCTGCTTTGAACACTCTAATTTTTTCAAAGTAAA
+
I#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIII9II-IIIIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIIII9II9
@LH00169:488:22HLWYLT4:7:1101:5151:1042 1:N:0:AAGATTGGAT+AAATCCCGCT
ANCTGCCCAAGTTCTGGCCGGCGAAATTTCTTATATGGGGATTAAGAGGGACGGCCGGGGGCATTCGTATTGCGCCGCTAGAGGTGAAATTCTTGGACCGGCGCAAGACGGACCAGAGCGAAAGCATTTGCCAAGAATGTTTTCATTAAT

R2_001.fastq.gz

@LH00169:488:22HLWYLT4:7:1101:2141:1042 2:N:0:AAGATTGGAT+AAATCCCGCT
TNTGATGGCTCAAACACAGCGACCTCGGGTGGGAACACGTTTTTCAGGTCCTCTAGCACGGTGAGCCGTGTCCCTGGCCCGAAGAACTGCTCATTGACAGTCCCCCAACTGCTGGCACAGAGATAGAGGGCCGAGTCCCCCAGCAACAAG
+
I#9II9IIII9III9IIIIIIIIIIIIIIIIIIIIII9IIIIIII-IIII9IIIIII9IIIIIII9IIIIIIIIIIIIIII9IIIIII9IIIIIIIIIIIIIIIIIIIIII99IIIIIIIIIIIIIIIIIIIIIIIIIII-99IIIIIII
@LH00169:488:22HLWYLT4:7:1101:2157:1042 2:N:0:AAGATTGGAT+AAATCCCGCT
GNTAGTGACGAAAAATAACAATACAGGACTCTTTCGAGGCCCTGTAATTGGAATGAGTCCACTTTAAATCCTTTAACGAGGATCCATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTT
+
I#9IIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIII-III9IIIIIIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIIIIIII
@LH00169:488:22HLWYLT4:7:1101:5151:1042 2:N:0:AAGATTGGAT+AAATCCCGCT
ANTCCTGTCCGTGTCCGGGCCGGGTGAGGTTTCCCGTGTTGGGTCAAATTAAGCCGCAGGCTCCACTCCTGGTGGTGCCCTTCCGTCAATTCCTTTAAGTTTCAGCTTTGCAACCATACTCCCCCCGGAACCCAAAGACTTTGGTTTCCC

@mourisl
Copy link
Collaborator

mourisl commented Nov 21, 2024

Oh, the read count in the log will count the paired-end read twice, one for read1, and one for read2. So 20M read pairs would mean around 40M reads during the counting kmer. So the numbers are consistent. I should have given this explanation much earlier...

@ruffyp
Copy link

ruffyp commented Nov 21, 2024

That makes sense! Now we know,

  1. the number of raw reads is 45,304,657 in each fastq.gz file, after fastq-extractor, we have 20,173,744 reads in the toassemble_1.fq, toassemble_2.fq and toassemble_bc.fa.

  2. As mentioned earlier, the uncorrectable barcode will be labeled as 'missing_barcode' in the toassemble_bc file, and this information can be used to infer fraction of reads with valid barcode. There are 736,202 reads labeled as 'missing_barcode' in the toassemble_bc.fa file.

So could we assume that,

  1. Valid Barcodes = (20173744-736202)/20173744 = 0.9635069 ?
    Although it can be understood as a sampling statistic, this value is somewhat overestimated according to the analysis results based on raw reads.In the 10x report, the Valid Barcodes is 83.0%.

  2. Reads Mapped to Any V(D)J Gene = 20173744/45304657 = 0.4452907 ?
    Since the value, is 52.1% in the 10x report, indicates a fraction of reads that partially or wholly map to any germline V(D)J gene segment, what do you think are the possible sources of the difference? aligners or what they represent?

@mourisl
Copy link
Collaborator

mourisl commented Nov 21, 2024

  1. The overestimation could be true. Since the candidate reads have to match the TCR/BCR reference, many low quality or low complexity reads will also be filtered. Those reads usually have uncorrectable barcodes.
  2. I'm not sure about this difference. This is an interesting finding. The fastq-extractor's alignment is very loose, so there are usually many false positive. TRUST4 and cellranger utilizes different reference sequences, so that could be the source.

@ruffyp
Copy link

ruffyp commented Nov 22, 2024

Thanks very much for your reply! I will run more tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants