Error preprocessing COVID-19 sample from SRA #418

yeredh · 2020-07-03T02:37:42Z

Hello,

I downloaded the FASTQ files for sample GSM4339771 (SRR11181956) from SRA in the original format.
(https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11181956)
So I end up with these two files

C143_R1.fastq.gz.1
C143_R2.fastq.gz.1

I was able to identify the cell barcodes with umi_tools

umi_tools whitelist --stdin C143_R1_test.fastq.gz  \
                    --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
                    --set-cell-number=100 \
                    --log2stderr > whitelist.txt;

However, when I tried the next step; extracting the barcodes and UMIs and add to read names

umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
                  --stdin C143_R1.fastq.gz \
                  --stdout C143_R1_extracted.fastq.gz \
                  --read2-in C143_R2.fastq.gz  \
                  --read2-out=C143_R2_extracted.fastq.gz \
                  --filter-cell-barcode \
                  --whitelist=whitelist.txt;

I get the following error message

ValueError: 
Read pairs do not match
CL200152206L1C001R001_0/1 != CL200152206L1C001R001_0/2

What am I doing wrong?

Thanks in advance for your time and attention,

Yered

The text was updated successfully, but these errors were encountered:

TomSmithCGAT · 2020-07-03T13:31:10Z

Hi @yeredh - Could you please confirm which version of umi_tools you are using. Thanks.

yeredh · 2020-07-03T14:30:33Z

Hi @TomSmithCGAT ,

I am using

UMI-tools version: 1.0.1

Also, I noticed that the files were generated on a BGISEQ sequencer not Illumina. So I guess the headers have a different format.

yeredh · 2020-07-03T21:53:07Z

Hi,

I was directed to the solution here: #325

The following does work

umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
                  --stdin C143_R1.fastq.gz \
                  --stdout C143_R1_extracted.fastq.gz \
                  --read2-in C143_R2.fastq.gz  \
                  --read2-out=C143_R2_extracted.fastq.gz \
                  --filter-cell-barcode \
                  --read-name-suffix-strip \
                  --whitelist=whitelist.txt;

yeredh closed this as completed Jul 3, 2020

TomSmithCGAT mentioned this issue Jul 3, 2020

{ts} ignore read pair suffixes #421

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error preprocessing COVID-19 sample from SRA #418

Error preprocessing COVID-19 sample from SRA #418

yeredh commented Jul 3, 2020 •

edited

Loading

TomSmithCGAT commented Jul 3, 2020

yeredh commented Jul 3, 2020

yeredh commented Jul 3, 2020

Error preprocessing COVID-19 sample from SRA #418

Error preprocessing COVID-19 sample from SRA #418

Comments

yeredh commented Jul 3, 2020 • edited Loading

TomSmithCGAT commented Jul 3, 2020

yeredh commented Jul 3, 2020

yeredh commented Jul 3, 2020

yeredh commented Jul 3, 2020 •

edited

Loading