Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very low number of mapped read when using --read-format #136

Open
nouroddinc opened this issue Aug 26, 2023 · 15 comments
Open

very low number of mapped read when using --read-format #136

nouroddinc opened this issue Aug 26, 2023 · 15 comments
Labels
bug Something isn't working

Comments

@nouroddinc
Copy link

Hi chromap team,
Thank you so much for preparing this helpful tools and I'm very impressed by its performances. I am testing it on my scATAC-seq data and it works pretty good when I don't use --read-format.

I have a scATAC-seq with R1 and R2 to be mapped and cell barcodes in R2 ( at positions bc1:22:29 and bc2:60-67)

Looking at my R2 file:

# zcat read_R2.fastq.gz | head

@lh00134:68:225MHKLT3:7:1101:1157:1080 2:N:0:TAAGGC
CACGCGTTGACTTCTCGCATCTCAACCACAATCCACGTGCTTGAGAGGCCAGAGCATTCGCGACTGGAGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGGCCTAGTCGGCCATCATAGGAATGAGAGGCCCT
+
---55-FFF-F-FFFFFFFF5FFFF5555555-555555-5-555F-FFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFFFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFF5FF
@lh00134:68:225MHKLT3:7:1101:1583:1080 2:N:0:TAAGGC
CAAGCGTTGACTTCTCGCATCTCCTCCTGAATCCACGTGCTTGAGAGGCCAGAGCATTCGGACTAGTAGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGGCACTATAGGGGAGCGTCCGAGTTTCAAACAAA
+
--F5F55-F--F5-F-FFFFFFFFFFFFFFFFFFFFF-FFF-FFFF-FFFFFFFFFF-FFFFFFFFFFFFFF-FFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@lh00134:68:225MHKLT3:7:1101:1619:1080 2:N:0:TAAGGC
CAAGCGCTGGCTTCTCGCATCTCAGCGTTAATCCACGTGCTTGAGAGGCCAGAGCATTCGACCACTGTGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGGCCCTGACCCTAGCGGGGAGGGTGAGGAAGCTG

using the followoing command:

chromap  \
          --preset atac \
          -x GRCm38_genome.index  \
          -r GRCm38_genome.fa \
          -1 read_R1.fastq.gz \
          -2 read_R2.fastq.gz \
          -o aln.bed \
          -b read_R2.fastq.gz \
          --read-format bc:60:67,bc:22:29,r1:0:-1,r2:117:-1 \
          -t 10 \
          --barcode-whitelist bc50.txt \
          2> chromap_with_read_format.log

gives me very low number of the mapped reads as the chromap_with_read_format.log
shows.

but if I extract barcodes from read_R2.fq.gz into read_R3.fq and leave the genome in read_R2.fq as a new file like the below:

# cat read_R2.fastq | head
@lh00134:68:225MHKLT3:7:1101:1157:1080 2:N:0:TAAGGC
GCCTAGTCGGCCATCATAGGAATGAGAGGCCCT
+
FFFFFFFFFFFFFFFFFFFFFFF5FFFFFF5FF
@lh00134:68:225MHKLT3:7:1101:1583:1080 2:N:0:TAAGGC
GCACTATAGGGGAGCGTCCGAGTTTCAAACAAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@lh00134:68:225MHKLT3:7:1101:1619:1080 2:N:0:TAAGGC
GCCCTGACCCTAGCGGGGAGGGTGAGGAAGCTG

and

# cat read_R3.fastq | head
CAACCACACGACTGGA
+
FFF55555FFFFFFFF
@lh00134:68:225MHKLT3:7:1101:1583:1080 2:N:0:TAAGGC
CCTCCTGAGACTAGTA
+
FFFFFFFFFFFFFFFF
@lh00134:68:225MHKLT3:7:1101:1619:1080 2:N:0:TAAGGC
CAGCGTTAACCACTGT

and then run like this:

chromap  \
          --preset atac \
          -x GRCm38_genome.index  \
          -r GRCm38_genome.fa \
          -1 read_R1.fastq.gz \
          -2 read_R2.fq \
          -o aln.bed \
          -b read_R3.fq \
          -t 10 \
          --barcode-whitelist bc50.txt \
          2> chromap_without_read_format.log

it works very good as log file chromap_without_read_format.log shows. my chromap version is

# ./chromap/chromap -v
0.2.5-r473

and this is my barcode whitelist file:

# cat bc50.txt | head
AACGTGATAACGTGAT
AACGTGATGCGAGTAA
AACGTGATATGCCTAA
AACGTGATGCTAACGA
AACGTGATACCACTGT
AACGTGATACATTGGC
AACGTGATCAGATCTG
AACGTGATCATCAAGT
AACGTGATCGCTGATC
AACGTGATACAAGCTA

I would appreciate if you could help me to find what i am doing wrong because using --read-format argument can make it much faster and easier to run my pipeline.

many Thanks,
Noori

@nouroddinc nouroddinc added the bug Something isn't working label Aug 26, 2023
@haowenz
Copy link
Owner

haowenz commented Aug 26, 2023

@mourisl Is this way of using read-format supported?

@mourisl
Copy link
Collaborator

mourisl commented Aug 26, 2023

@nouroddinc Is your barcode starting from 60-67 followed by 22-29? Chromap will internally sort it to 22-29 followed by 60-67 to avoid typos in ranges. Maybe this is an undesirable feature now.

@nouroddinc
Copy link
Author

@haowenz @mourisl thanks for helping me on this. yes, my barcode starts from 60-67 and ends with 22-29. I have tested with --read-format bc:22:29,bc:60:67,r1:0:-1,r2:117:-1 and the output log file and bed file is exactly the same as when I tested --read-format bc:60:67,bc:22:29,r1:0:-1,r2:117:-1 . I should also mention that my barcode file bc50.txt includes both barcodes bc:22:29+60:67 and bc:60:67+22:29 in some cases. For example

# cat bc50.txt | head
AACGTGATAACGTGAT
AACGTGATGCGAGTAA
AACGTGATATGCCTAA
AACGTGATGCTAACGA
AACGTGATACCACTGT
AACGTGATACATTGGC
AACGTGATCAGATCTG
AACGTGATCATCAAGT
AACGTGATCGCTGATC
AACGTGATACAAGCTA

I have both combinations of barcodes AACGTGAT and AACGTGAT which is AACGTGATATGCCTAA and ATGCCTAAAACGTGAT but I only have AACGTGATGCGAGTAA and Not GCGAGTAAAACGTGAT

@mourisl
Copy link
Collaborator

mourisl commented Aug 27, 2023

@nouroddinc I have removed the sort in the branch "li_dev4". Could you please checkout this branch, "make clean; make" to recompile chromap, and give the new version a try?

@nouroddinc
Copy link
Author

nouroddinc commented Aug 27, 2023

@mourisl I have tested the new branch and the output still shows the low number of mapped reads!
chromap_with_read_format_new_branch.log

@mourisl
Copy link
Collaborator

mourisl commented Aug 28, 2023

I rechecked your running log. For the "_without_read_format" run, you used the files

1th read 1 file: cellranger_inputs/6bp_D01291_NG02620_S1_L001_R1_001.fastq.gz
1th read 2 file: cellranger_inputs/6bp_D01291_NG02620_S1_L001_R3_001.fastq
1th cell barcode file: cellranger_inputs/6bp_D01291_NG02620_S1_L001_R2_001.fastq

For the files with "_with_read_format", the files are:

1th read 1 file: cellranger_inputs/6bp_D01291_NG02620_S1_L001_R1_001.fastq.gz
1th read 2 file: 6bp_D01291_NG02620_linker2_R2.fastq.gz
1th cell barcode file: 6bp_D01291_NG02620_linker2_R2.fastq.gz

Since some of the files are in cellranger_inputs, and some are not, I just want to make sure you used the right input files.

@nouroddinc
Copy link
Author

@mourisl I think you might be right. I tested with a new data set with smaller size to get faster output and sounds like they are very similar but still not identical. Is that normal?

chromap_with_read_format.log

chromap_without_read_format.log

@mourisl
Copy link
Collaborator

mourisl commented Aug 28, 2023

The "Number of barcodes in whitelist" is still different "580224" vs "539675". So I guess the manually extracted barcode and the automatic extracted barcode are still different. Could you please try:

  1. Reverse the order of "bc:" in --read-format, just to make sure the new implementation is correct.
  2. Could you please run Chromap with "--SAM" to generate the SAM format output. Then you can compare the barcode from the same read in two different input and check which one is correct.

Thank you for helping us debug the issue.

@nouroddinc
Copy link
Author

@mourisl is this what you asked for?

1- I have tried reverse barcode like bc:60:67:-,bc:22:29:-,r1:0:-1,r2:117:-1 and it gives Less than 5% barcodes can be found or corrected based on the barcode whitelist error which makes sense to me as the log file shows chromap_with_read_format_rev.log

2- I have extracted CB:Z tag barcodes from both sam files and they match the whitelist. I mean I can't see any wrong barcode in neither of them.

barcodes_w_rf.txt

barcodes_wo_rf.txt

@mourisl
Copy link
Collaborator

mourisl commented Aug 28, 2023

  1. I meant to change the order in read format as bc:22:29,bc:60:67.
  2. Since SAM file contains the read id, you can check whether the same read will get the same barcode in this format.

Thank you for the testing.

@nouroddinc
Copy link
Author

alright, I have two outputs.

  1. with the normal barcode (bc:60:67,bc:22:29)
    chromap_with_read_format.log

sam files output in normal barcode is like this:

# samtools view -h aln.sam | sed -n '67,74 p'

lh00134:68:225MHKLT3:7:1281:23172:18977	99	chr1	3009659	60	115M	=	3009741	115	TAGAGTGACTGTTAAGAAGTTATGTGAAATTCTAGGAAGTCTGCCTTCATATGTTACTTTGCCTTTTCCCATTTCAGCTTTTAATATTCTTTTTTGATGTTGTTGTTCTATAAGT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:115	CB:Z:ACTATGCACACTTCGA
lh00134:68:225MHKLT3:7:1281:23172:18977	147	chr1	3009741	60	33M	=	3009659	-115	AATATTCTTTTTTGATGTTGTTGTTCTATAAGT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:ACTATGCACACTTCGA
lh00134:68:225MHKLT3:7:2242:32307:22630	163	chr1	3045894	60	33M	=	3046071	327	TAGTATAACTGTGTAGTGAAAGCTATTTGCTCA	FFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:GAATCTGAAAGGTACA
lh00134:68:225MHKLT3:7:2242:32307:22630	83	chr1	3046071	60	150M	=	3045894	-327	AAGTGCGCACATAGGATTTGTAGTATCTGCAATTTTGAGCACCTCATGGTATTCATGGAACATGTTCCAATGAACCATATGGATTTGTCATGTCTTAACTGATCAAAAGATTGGCCTTCCATTGTAGGGCCAAGTAACCTTATGCTAAGG	FFFFF5FFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:3	MD:Z:5G20A86A36	CB:Z:GAATCTGAAAGGTACA
lh00134:68:225MHKLT3:7:1162:45492:5711	163	chr1	3062726	46	33M	=	3062726	106	GTATGGAAGGAAACGGCGAGAGAAATAATGGAG	FFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:ACACAGAAAGTACAAG
lh00134:68:225MHKLT3:7:1162:45492:5711	83	chr1	3062726	46	106M	=	3062726	-106	GTATGGAAGGAAACGGCGAGAGAAATAATGGAGGCCAAGACATGTTTCTGTTCAAGGCCCCCAAGTTTACTAAGAGTCTGTGCTTATAAAGTGGGGAGGCCCATCC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:1	MD:Z:103C2	CB:Z:ACACAGAAAGTACAAG
lh00134:68:225MHKLT3:7:2149:19696:3243	99	chr1	3094994	60	73M	=	3095034	73	GCTCAATGAACTTGGCTTTGGGAACTGCTCCTGTGGTAACTAGCAGTCACATCTGTGTGTACCAATCTGCATA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:73	CB:Z:AAGGTACAGAGCTGAA
lh00134:68:225MHKLT3:7:2149:19696:3243	147	chr1	3095034	60	33M	=	3094994	-73	TAGCAGTCACATCTGTGTGTACCAATCTGCATA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:AAGGTACAGAGCTGAA


  1. the reverse barcode (bc:22:29,bc:60:67) log file:

chromap_with_read_format_rev.log

and reverse barcode sam file is like :

# samtools view -h aln_rev.sam | sed -n '67,74 p'

lh00134:68:225MHKLT3:7:1281:23172:18977	99	chr1	3009659	60	115M	=	3009741	115	TAGAGTGACTGTTAAGAAGTTATGTGAAATTCTAGGAAGTCTGCCTTCATATGTTACTTTGCCTTTTCCCATTTCAGCTTTTAATATTCTTTTTTGATGTTGTTGTTCTATAAGT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:115	CB:Z:CACTTCGAACTATGCA
lh00134:68:225MHKLT3:7:1281:23172:18977	147	chr1	3009741	60	33M	=	3009659	-115	AATATTCTTTTTTGATGTTGTTGTTCTATAAGT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:CACTTCGAACTATGCA
lh00134:68:225MHKLT3:7:2242:32307:22630	163	chr1	3045894	60	33M	=	3046071	327	TAGTATAACTGTGTAGTGAAAGCTATTTGCTCA	FFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:AAGGTACAGAATCTGA
lh00134:68:225MHKLT3:7:2242:32307:22630	83	chr1	3046071	60	150M	=	3045894	-327	AAGTGCGCACATAGGATTTGTAGTATCTGCAATTTTGAGCACCTCATGGTATTCATGGAACATGTTCCAATGAACCATATGGATTTGTCATGTCTTAACTGATCAAAAGATTGGCCTTCCATTGTAGGGCCAAGTAACCTTATGCTAAGG	FFFFF5FFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:3	MD:Z:5G20A86A36	CB:Z:AAGGTACAGAATCTGA
lh00134:68:225MHKLT3:7:1162:45492:5711	163	chr1	3062726	46	33M	=	3062726	106	GTATGGAAGGAAACGGCGAGAGAAATAATGGAG	FFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:AGTACAAGACACAGAA
lh00134:68:225MHKLT3:7:1162:45492:5711	83	chr1	3062726	46	106M	=	3062726	-106	GTATGGAAGGAAACGGCGAGAGAAATAATGGAGGCCAAGACATGTTTCTGTTCAAGGCCCCCAAGTTTACTAAGAGTCTGTGCTTATAAAGTGGGGAGGCCCATCC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:1	MD:Z:103C2	CB:Z:AGTACAAGACACAGAA
lh00134:68:225MHKLT3:7:2149:19696:3243	99	chr1	3094994	60	73M	=	3095034	73	GCTCAATGAACTTGGCTTTGGGAACTGCTCCTGTGGTAACTAGCAGTCACATCTGTGTGTACCAATCTGCATA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:73	CB:Z:GAGCTGAAAAGGTACA
lh00134:68:225MHKLT3:7:2149:19696:3243	147	chr1	3095034	60	33M	=	3094994	-73	TAGCAGTCACATCTGTGTGTACCAATCTGCATA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:GAGCTGAAAAGGTACA

and the read2 file is this:

# zcat read_R2.fastq.gz | grep -A 3 -F -f list_of_read_ids.txt | awk 'BEGIN{RS="--"; ORS=""} {print $0}'

@lh00134:68:225MHKLT3:7:2149:19696:3243 2:N:0:TNAGGC
CACGCATTTGCTTCTCGCATCTGAGCTGAAATCCACGTGCTTGAGAGGCCAGAGCATTCGAAGGTACAGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGTATGCAGATTGGTACACACAGATGTGACTGCTA
+
F5F-FF-FF-FFFF5FFFFFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

@lh00134:68:225MHKLT3:7:1281:23172:18977 2:N:0:TNAGGC
CAAGCTTTGGCTTCTCGCATCTCACTTCGAATCCACGTGCTCGATAGGCCAGAGCATTCGACTATGCAGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGACTTATAGAACAACAACATCAAAAAAGAATATT
+
F55FFF5FFFFFFFF55555555555-555555555555555FFFFFFFFFFFFFFF-F-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

@lh00134:68:225MHKLT3:7:1162:45492:5711 2:N:0:TNAGGC
CACTCGTTGCCCTCTCGCATCTAGTACAAGATCCACGTGCTTGATAGGCCAGAGCATTCGACACAGAAGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGGTATGGAAGGAAACGGCGAGAGAAATAATGGAG
+
F5-FF-5FFFF-F-FF-5FFF5555555555-555-55-555555555FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF55FFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFF

@lh00134:68:225MHKLT3:7:2242:32307:22630 2:N:0:TNAGGC
CAAGCGTTGGCTTCTCGCATCTAAGGTACAATCCACGTGCTTGAGAGGCCAGAGCATTCGGAATCTGAGTGGCCGATGTTTCGCATCGGCGTACGACTAGATGTGTATAAGAGACAGTAGTATAACTGTGTAGTGAAAGCTATTTGCTCA
+
-FFF-FFFF555-FFFFF5FFFF5555555FFF-FFFFFFFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFF

hopefully, this is what you were looking for. Thanks

@mourisl
Copy link
Collaborator

mourisl commented Aug 28, 2023

Seems the barcode order issue is indeed fixed. Are the barcodes from readformat run the same as your manually-created barcode file (without readformat) based on the SAM file?

@nouroddinc
Copy link
Author

nouroddinc commented Aug 28, 2023

it sounds like NOT; they are in reverse of each other!

chromap --preset atac -x GRCm38_genome.index -r GRCm38_genome.fa -1 read_R1.fastq.gz -2 read_R2.fastq.gz --SAM -o aln.sam -b read_R2.fastq.gz --read-format bc:60:67,bc:22:29,r1:0:-1,r2:117:-1 -t 10 --barcode-whitelist bc50.txt 2> chromap_with_read_format.log

the out put is

# samtools view -h aln.sam | sed -n '67,74 p'
lh00134:68:225MHKLT3:7:1281:23172:18977	99	chr1	3009659	60	115M	=	3009741	115	TAGAGTGACTGTTAAGAAGTTATGTGAAATTCTAGGAAGTCTGCCTTCATATGTTACTTTGCCTTTTCCCATTTCAGCTTTTAATATTCTTTTTTGATGTTGTTGTTCTATAAGT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:115	CB:Z:ACTATGCACACTTCGA
lh00134:68:225MHKLT3:7:1281:23172:18977	147	chr1	3009741	60	33M	=	3009659	-115	AATATTCTTTTTTGATGTTGTTGTTCTATAAGT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:ACTATGCACACTTCGA
lh00134:68:225MHKLT3:7:2242:32307:22630	163	chr1	3045894	60	33M	=	3046071	327	TAGTATAACTGTGTAGTGAAAGCTATTTGCTCA	FFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:GAATCTGAAAGGTACA
lh00134:68:225MHKLT3:7:2242:32307:22630	83	chr1	3046071	60	150M	=	3045894	-327	AAGTGCGCACATAGGATTTGTAGTATCTGCAATTTTGAGCACCTCATGGTATTCATGGAACATGTTCCAATGAACCATATGGATTTGTCATGTCTTAACTGATCAAAAGATTGGCCTTCCATTGTAGGGCCAAGTAACCTTATGCTAAGG	FFFFF5FFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:3	MD:Z:5G20A86A36	CB:Z:GAATCTGAAAGGTACA
lh00134:68:225MHKLT3:7:1162:45492:5711	163	chr1	3062726	46	33M	=	3062726	106	GTATGGAAGGAAACGGCGAGAGAAATAATGGAG	FFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:ACACAGAAAGTACAAG
lh00134:68:225MHKLT3:7:1162:45492:5711	83	chr1	3062726	46	106M	=	3062726	-106	GTATGGAAGGAAACGGCGAGAGAAATAATGGAGGCCAAGACATGTTTCTGTTCAAGGCCCCCAAGTTTACTAAGAGTCTGTGCTTATAAAGTGGGGAGGCCCATCC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:1	MD:Z:103C2	CB:Z:ACACAGAAAGTACAAG
lh00134:68:225MHKLT3:7:2149:19696:3243	99	chr1	3094994	60	73M	=	3095034	73	GCTCAATGAACTTGGCTTTGGGAACTGCTCCTGTGGTAACTAGCAGTCACATCTGTGTGTACCAATCTGCATA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:73	CB:Z:AAGGTACAGAGCTGAA
lh00134:68:225MHKLT3:7:2149:19696:3243	147	chr1	3095034	60	33M	=	3094994	-73	TAGCAGTCACATCTGTGTGTACCAATCTGCATA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:AAGGTACAGAGCTGAA

and without --read-format

chromap --preset atac -x GRCm38_genome.index -r GRCm38_genome.fa -1 read_R1.fastq.gz -2 read_R2.fq --SAM -o aln_wo.sam -b read_R3.fq -t 10 --barcode-whitelist bc50.txt 2> chromap_without_read_format.log

# samtools view -h aln_wo.sam | sed -n '67,74 p'
lh00134:68:225MHKLT3:7:1281:23172:18977	99	chr1	3009659	60	115M	=	3009741	115	TAGAGTGACTGTTAAGAAGTTATGTGAAATTCTAGGAAGTCTGCCTTCATATGTTACTTTGCCTTTTCCCATTTCAGCTTTTAATATTCTTTTTTGATGTTGTTGTTCTATAAGT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:115	CB:Z:CACTTCGAACTATGCA
lh00134:68:225MHKLT3:7:1281:23172:18977	147	chr1	3009741	60	33M	=	3009659	-115	AATATTCTTTTTTGATGTTGTTGTTCTATAAGT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:CACTTCGAACTATGCA
lh00134:68:225MHKLT3:7:2242:32307:22630	163	chr1	3045894	60	33M	=	3046071	327	TAGTATAACTGTGTAGTGAAAGCTATTTGCTCA	FFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:AAGGTACAGAATCTGA
lh00134:68:225MHKLT3:7:2242:32307:22630	83	chr1	3046071	60	150M	=	3045894	-327	AAGTGCGCACATAGGATTTGTAGTATCTGCAATTTTGAGCACCTCATGGTATTCATGGAACATGTTCCAATGAACCATATGGATTTGTCATGTCTTAACTGATCAAAAGATTGGCCTTCCATTGTAGGGCCAAGTAACCTTATGCTAAGG	FFFFF5FFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:3	MD:Z:5G20A86A36	CB:Z:AAGGTACAGAATCTGA
lh00134:68:225MHKLT3:7:1162:45492:5711	163	chr1	3062726	46	33M	=	3062726	106	GTATGGAAGGAAACGGCGAGAGAAATAATGGAG	FFFFFFFFFFFFFFFFFFFF5FFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:AGTACAAGACACAGAA
lh00134:68:225MHKLT3:7:1162:45492:5711	83	chr1	3062726	46	106M	=	3062726	-106	GTATGGAAGGAAACGGCGAGAGAAATAATGGAGGCCAAGACATGTTTCTGTTCAAGGCCCCCAAGTTTACTAAGAGTCTGTGCTTATAAAGTGGGGAGGCCCATCC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:1	MD:Z:103C2	CB:Z:AGTACAAGACACAGAA
lh00134:68:225MHKLT3:7:2149:19696:3243	99	chr1	3094994	60	73M	=	3095034	73	GCTCAATGAACTTGGCTTTGGGAACTGCTCCTGTGGTAACTAGCAGTCACATCTGTGTGTACCAATCTGCATA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:73	CB:Z:GAGCTGAAAAGGTACA
lh00134:68:225MHKLT3:7:2149:19696:3243	147	chr1	3095034	60	33M	=	3094994	-73	TAGCAGTCACATCTGTGTGTACCAATCTGCATA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:33	CB:Z:GAGCTGAAAAGGTACA

this means that reverse barcode that i already tested is aligned with the one with out --read-format

@mourisl
Copy link
Collaborator

mourisl commented Aug 28, 2023

Could you please be more specific about whether the "without --read-format" run had mistakes due to the input order being wrong, or the "with --read-format" run generated wrong barcodes? Thank you.

@nouroddinc
Copy link
Author

@mourisl I think it works now pretty well with my manual barcode extraction pipeline. If I use --read-format bc:22:29,bc:60:67,r1:0:-1,r2:117:-1 which we called reverse barcode in the above that matches exactly with the manual barcode extraction one. You are good to close this. Thank you so much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants