Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error preprocessing COVID-19 sample from SRA #418

Closed
yeredh opened this issue Jul 3, 2020 · 3 comments
Closed

Error preprocessing COVID-19 sample from SRA #418

yeredh opened this issue Jul 3, 2020 · 3 comments

Comments

@yeredh
Copy link

yeredh commented Jul 3, 2020

Hello,

I downloaded the FASTQ files for sample GSM4339771 (SRR11181956) from SRA in the original format.
(https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11181956)
So I end up with these two files

  • C143_R1.fastq.gz.1
  • C143_R2.fastq.gz.1

I was able to identify the cell barcodes with umi_tools

umi_tools whitelist --stdin C143_R1_test.fastq.gz  \
                    --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
                    --set-cell-number=100 \
                    --log2stderr > whitelist.txt;

However, when I tried the next step; extracting the barcodes and UMIs and add to read names

umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
                  --stdin C143_R1.fastq.gz \
                  --stdout C143_R1_extracted.fastq.gz \
                  --read2-in C143_R2.fastq.gz  \
                  --read2-out=C143_R2_extracted.fastq.gz \
                  --filter-cell-barcode \
                  --whitelist=whitelist.txt; 

I get the following error message

ValueError: 
Read pairs do not match
CL200152206L1C001R001_0/1 != CL200152206L1C001R001_0/2

What am I doing wrong?

Thanks in advance for your time and attention,

Yered

@TomSmithCGAT
Copy link
Member

Hi @yeredh - Could you please confirm which version of umi_tools you are using. Thanks.

@yeredh
Copy link
Author

yeredh commented Jul 3, 2020

Hi @TomSmithCGAT ,

I am using

UMI-tools version: 1.0.1

Also, I noticed that the files were generated on a BGISEQ sequencer not Illumina. So I guess the headers have a different format.

@yeredh
Copy link
Author

yeredh commented Jul 3, 2020

Hi,

I was directed to the solution here: #325

The following does work

umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
                  --stdin C143_R1.fastq.gz \
                  --stdout C143_R1_extracted.fastq.gz \
                  --read2-in C143_R2.fastq.gz  \
                  --read2-out=C143_R2_extracted.fastq.gz \
                  --filter-cell-barcode \
                  --read-name-suffix-strip \
                  --whitelist=whitelist.txt; 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants