-
Notifications
You must be signed in to change notification settings - Fork 50
Use python3 syntax: eg. for print #153
Comments
Hi @ghuls , We plan to address this soon: we will support both print as a statement and as a function (the lack of print statement is something that I personally really hate in Python 3, so it will be supported as well). argparse is certainly better, and is definitely planned for later releases. Currently we don't have enough manpower to implement that sooner (porting getopt was trivial, but argparse does a lots of runtime typing tricks that make porting it harder). |
@inumanag Seq seems like an interesting language. Is there a better way (than Kmers in a list) to check if a certain read matches the big list of Kmers with hamming distance of 1 or less? The current version seems quite slow (27 seconds for the first 10000 reads). Is there also a way to indicate failure for a function (e.g.: return None) or do you need to create a specific type?
|
Hi @ghuls
|
Also how can parallelism be added? I assume gzipping (when reading a FASTQ file) happens in a separate thread. Does seq allow running the read_barcode_whitelist_from_file function in parallel, and collect the results later? Numba for example allows something like this with arrays: https://numba.readthedocs.io/en/stable/user/parallel.html
Another thing I noticed is that |
Hi, Maybe a better approach is to flip this procedure; right now you iterate over (reads x barcodes), which is 737280 x 10000 > 7 billion Hamming distance calculations (maybe less by a factor of ~2 since you stop early). Instead you could construct a Here is an example of iterating over Hamming neighbors: https://docs.seq-lang.org/cookbook.html#k-mer-hamming-neighbors In your case the code might look something like: whitelist = { ... } # set of k-mers from whitelist
def bc_matches_whitelist(bc: Kmer[16]):
if bc in whitelist:
return 0 # exact match
for neighbor in neighbors(bc):
if neighbor in whitelist:
return 1 # Hamming neighbor in whitelist; distance = 1
return -1 # no match Let me know if this makes sense! P.S. Just |
Thanks a lot. Now it only takes 45 seconds for the whole FASTQ file (15 milion reads). My original code was still running from this morning on the whole file. |
I got the code parallelized, but I assumed it would run each FASTQ file in a separate thread (+ a thread for gunzipping the FASTQ file), but it seems to work slightly different. As you can see in the end, setting Can the number of threads be controlled from from sys import argv
bc_10x_sc_atac_whitelist_filename = '737K-cratac-v1.txt'
def read_barcode_whitelist_from_file[K](bc_whitelist_filename: str):
"""
Read whitelisted barcodes from file and convert to a set of Kmers.
"""
bc_whitelist = set[K]()
with open(bc_whitelist_filename, 'r') as fh_bc:
for bc_line in fh_bc:
bc_str = bc_line.strip()
if len(bc_str) == 0 or bc_str[0] == '#':
# Skip empty lines and lines that start with a comment.
continue
bc = K(seq(bc_str))
bc_whitelist.add(bc)
return bc_whitelist
bc_whitelist = read_barcode_whitelist_from_file[Kmer[16]](bc_10x_sc_atac_whitelist_filename)
def neighbors(kmer):
"""
Create kmers with hamming distance of 1.
"""
for i in range(len(kmer)):
for b in (k'A', k'C', k'G', k'T'):
if kmer[i] != b:
yield kmer |> base(i, b)
def bc_matches_whitelist(bc: Kmer[16]) -> int:
if bc in bc_whitelist:
# Exact match.
return 0
for neighbor in neighbors(bc):
if neighbor in bc_whitelist:
# Hamming neighbor in whitelist (distance = 1).
return 1
# No match.
return -1
def find_bc_in_fastq(fastq_filename):
nbr_reads = 0
nbr_bc_0_mismatches = 0
nbr_bc_1_mismatches = 0
for record in FASTQ(fastq_filename, gzip=True, validate=False, copy=True):
nbr_reads += 1
bc_match = bc_matches_whitelist(Kmer[16](record.seq))
if bc_match == 0:
nbr_bc_0_mismatches += 1
elif bc_match == 1:
nbr_bc_1_mismatches += 1
#if nbr_reads == 100000:
# break
print f'{fastq_filename}:\nnbr_reads:\t{nbr_reads}\ntotal_bc_found\t{nbr_bc_0_mismatches + nbr_bc_1_mismatches}\nnbr_bc_0_mismatches\t{nbr_bc_0_mismatches}\nnbr_bc_1_mismatches\t{nbr_bc_1_mismatches}\n'
if len(argv) == 1:
print 'Usage: {argv[0]} fastq_with_bc_file [fastq_with_bc_file ...]'
else:
iter(argv[1:]) ||> find_bc_in_fastq Running it like this:
|
There's no simple way right now to set the number of threads programmatically, although it can technically be done by calling the cimport omp_set_num_threads(i32)
omp_set_num_threads(i32(num_threads)) # num_threads is an int The code you posted definitely looks reasonable to me, and I'd expect it to be amenable to parallelization. How many cores/threads does the machine you're running on have? |
@ghuls : Thanks for spotting this! Two things:
|
The machine has 36 cores, so if I don't specify How would I parallelize The I/0 should be pretty fast. The node I am using uses data from GPFS over infiniband. |
How to use define I also assume that just defining a variable global is not thread safe (e.g. when modifying it inside nbr_reads = 0
nbr_bc_0_mismatches = 0
nbr_bc_1_mismatches = 0
def find_bc_in_fastq(fastq_filename: str):
# nbr_reads = 0
# nbr_bc_0_mismatches = 0
# nbr_bc_1_mismatches = 0
global nbr_reads
global nbr_bc_0_mismatches
global nbr_bc_1_mismatches
#for record in FASTQ(fastq_filename, gzip=True, validate=False, copy=True):
def process(record: FASTQRecord):
@atomic
def update_counters(bc_match: int):
global nbr_reads
global nbr_bc_0_mismatches
global nbr_bc_1_mismatches
nbr_reads += 1
if bc_match == 0:
nbr_bc_0_mismatches += 1
elif bc_match == 1:
nbr_bc_1_mismatches += 1
bc_match = bc_matches_whitelist(Kmer[16](record.seq))
update_counters(bc_match)
#if nbr_reads == 100000:
# break
FASTQ(fastq_filename, gzip=True, validate=False, copy=True) |> blocks(size=1000) ||> iter |> process
print f'{fastq_filename}:\nnbr_reads:\t{nbr_reads}\ntotal_bc_found\t{nbr_bc_0_mismatches + nbr_bc_1_mismatches}\nnbr_bc_0_mismatches\t{nbr_bc_0_mismatches}$
if len(argv) == 1:
stderr.write('Usage: {argv[0]} fastq_with_bc_file [fastq_with_bc_file ...]\n')
exit(1)
else:
iter(argv[1:]) ||> find_bc_in_fastq |
|
Thanks. I just played a bit around with parsing of a BAM file and noticed that the example is lacking some info: https://docs.seq-lang.org/cookbook.html#reading-sam-bam-cram
Also is there a way to get the type from a tag, so the right function can be called on the tag? Can
|
Most of these are quirks of htslib, which we use for parsing BAM/SAM. But,
|
Storing the contig name in the SAMRecord is not necessary for me. I figured out a way to get type info from tags in htslib (see here to see how to convert the one byte type char to the correct type): Basically get a pointer to the result returned by: from bio.c_htslib import bam_endpos, bam_aux_get
bam_reader = BAM(bam_filename)
for r in bam_reader:
# Only use reads with are not marked as duplicate.
if not r.unmapped and not r.duplicate:
# Get cell barcode tag.
cb = r.aux('CB')
# Only use reads which have cell barcode tag.
if cb:
print "\n" + r.name
print(f'{bam_reader.contig(r.tid)}\t{r.pos}\t{int(bam_endpos(bam_reader.__raw__()))}\t{cb.Z}')
print 'CB:Z', ptr[byte](cb.s)[0]
print 'AS:i', ptr[byte](r.aux('AS').s)[0]
print 'HI:i', ptr[byte](r.aux('AS').s)[0]
Output:
Could you also add
Or allow it to be calculated from a SAM record: hts_pos_t bam_endpos(const bam1_t *b)
{
hts_pos_t rlen = (b->core.flag & BAM_FUNMAP)? 0 : bam_cigar2rlen(b->core.n_cigar, bam_get_cigar(b));
if (rlen == 0) rlen = 1;
return b->core.pos + rlen;
} |
Ah, good to know! This should be very easy to add then.
|
Just pushed some additions related to your post above:
These will all be included in the next release! |
This comment should have Line 167 in ee3d3d4
|
|
Fixed now on #132 . Thanks. |
Use python3 syntax: eg. for print. Now
print('a')
also prints(a)
instead of justa
.Also https://github.com/seq-lang/seq/blob/develop/stdlib/getopt.seq is from python 2.7 instead of python 3. argparse is a bit nice to use: https://docs.python.org/3/library/argparse.html
The text was updated successfully, but these errors were encountered: