Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems preprocessing COVID-19 Sample from Paper #9

Closed
yeredh opened this issue Jul 3, 2020 · 3 comments
Closed

Problems preprocessing COVID-19 Sample from Paper #9

yeredh opened this issue Jul 3, 2020 · 3 comments

Comments

@yeredh
Copy link

yeredh commented Jul 3, 2020

Hello,

I downloaded the FASTQ files for sample GSM4339771 (SRR11181956) from SRA in the original format from https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11181956

So I end up with two files

  • C143_R1.fastq.gz.1
  • C143_R2.fastq.gz.1

I was able to identify the cell barcodes with umi_tools

umi_tools whitelist --stdin C143_R1_test.fastq.gz  \
                    --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
                    --set-cell-number=100 \
                    --log2stderr > whitelist.txt;

However, when I tried the next step; extracting the barcodes and UMIs and add to read names

umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
                  --stdin C143_R1.fastq.gz \
                  --stdout C143_R1_extracted.fastq.gz \
                  --read2-in C143_R2.fastq.gz  \
                  --read2-out=C143_R2_extracted.fastq.gz \
                  --filter-cell-barcode \
                  --whitelist=whitelist.txt; 

I get the following error message

ValueError: 
Read pairs do not match
CL200152206L1C001R001_0/1 != CL200152206L1C001R001_0/2

What am I doing wrong?

Best,

Yered

@Dragonlongzhilin
Copy link

I guess that the ids are not consistent one-to-one match between read 1 and read2. You should check the fastq file.

@PierreBSC
Copy link
Owner

Hi Yered,

So basically you are doing it compli right and the problem comes from the files.
UMI-tools has been designed to process fastq files produced by Illumina devices. The files you are mentionning have been generated by a BGI machine : therefore the headers are a bit different.
This is problematic but can be solved. First you need to install a specific version of UMI-tools : https://github.com/CGATOxford/UMI-tools/tree/%7BTS%7D-IgnoreReadPairSuffix. You then need to modify the extract line as describe here : CGATOxford/UMI-tools#325 and it should do the job !

Hope this will help,

Best

Pierre

@yeredh
Copy link
Author

yeredh commented Jul 3, 2020

Thank you Pierre for your prompt reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants