Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does this support demultiplexing with forward and reverse barcode sequence pairs? #1

Open
dnk8n opened this issue Nov 27, 2018 · 5 comments

Comments

@dnk8n
Copy link

dnk8n commented Nov 27, 2018

I have a fastq file with sequences that may or may not be displayed in reverse compliment form.

I expect some fastq records to match one of many samples, each with a forward barcode AND reverse barcode. If no match I would need to reverse compliment the record's sequence and try the barcode pair again.

If this is not yet supported, I would like to implement it in a format that suits you if this feature is something you feel might be worthwhile.

@jenzopr
Copy link
Owner

jenzopr commented Nov 27, 2018

Hi Dean,
thanks for your suggestion. Very welcome!
I guess its possible! Barcodes sequences (whether they're forward or reverse complement doesn't matter in the first place) are matched via a regex and then looked up in a mutationhash. One could easily implement the optional inclusion of reverse complement of each barcode in the mutationhash to enable the two-way search - even without implementing an explicit "second search".
I'd be happy to receive a PR from you - or you give me a couple of days to implement it.
Best,
Jens

@dnk8n
Copy link
Author

dnk8n commented Nov 27, 2018

I might have misunderstood how your tool works. Does it look for the barcode in the record header rather than the record sequence itself? Our header information is lacking the barcodes, but they are present within the sequence itself (with potential for error, in which things like edit distance, etc should be evaluated).

Perhaps your tool is solving a slightly different set of demultiplexing problems than what I had in mind...

I would be happy to contribute a PR but my primary focus is finalizing a processing pipeline, so first prize is to use a tool which already has the feature I am after. Second prize would be to submit this feature upstream to the tool with the lowest barrier to entry.

I want to avoid re-implementing something new if I can. But if it is the fastest way, then I may have to do that for now. Will let you know if I choose to commit to submitting a PR.

@jenzopr
Copy link
Owner

jenzopr commented Nov 27, 2018

I get your point. Typically (e.g. in scRNA-seq), you'll have a paired-end sequencing. One read in the pair (e.g. R1) will contain the barcode sequences (as records, not headers), the other read in the pair (e.g. R2) will contain the actual RNA sequence of interest. To avoid reading two files simultaneously, we re-write the header of R2 to contain the sequence record from R1 (this happens directly in bcl2fastq from Illumina). pydemult then takes R2 as input.

@dnk8n
Copy link
Author

dnk8n commented Nov 27, 2018

I am new to the world of bioinformatics, so forgive my misinterpretation. Didn't realise how many different ways there were of doing things!

I am working with PacBio data.

I will review this convo once the fog has lifted and my other tasks are done.

@jenzopr
Copy link
Owner

jenzopr commented Nov 27, 2018

Oh, welcome then 😃
No worries - just come back whenever you've time to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants