-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partition sweep #164
Partition sweep #164
Conversation
Pull request merged into ged-lab/bleeding-edge
…g-edge Updates since graceful-fail2 pull
Bleeding edge
…ning_on_abundance Conflicts: lib/hashbits.hh
@cswelcher Please merge in the latest from the master branch so that @ged-jenkins can test it out :-) |
(hmm.. Jenkins should have built this by now. Lets test the magic phrase) please test this |
Grr.. the actual phrase is: test this please |
test this please |
Test PASSed. |
Instructions for SweepThe script to use is scripts/sweep-reads-by-partition-buffer.py, which consumes a partitioned file and runs through a list of read files, outputting them to files by color. This version has a read buffering system built in which greatly speeds up the output process. Usage: python sweep-reads-by-partition-buffered.py -k -x -i -r -b -e <aprx. partitions> -o <output_prefix> -m ... Hashsize can be quite small; I often go with 1e9 (comes out to 5e8 bytes), as it just needs to be big enough to properly store the assembly graph, and preferably avoid unreasonable spurious edges when traversing from reads. Traversal range should be set high; -1 sets it to max (81 in the current khmer codebase). Buffer size is the total number of reads to keep in all buffers, which is dependent on memory -- 1000000 works quite well. Aprx partitions is a rough guess of the number of partitions (or exact, doesn't matter that much). Max buffers is the number of colors to buffer at any given time; 25000-50000 works well. Notes on the buffering parameters: the program works by maintaining up to lists of reads in memory, the total of which can reach before files are written. When an individual buffer exceeds / reads, that buffer is written out to its file. If the total number of reads in the buffer exceeds or if the total number of buffers exceeds , all buffers are flushed to files. In practice, with the recommended values, all buffers get flushed at around % 200000 reads, increasing as increases. There is a tradeoff here between the number of buffers increasing query times and flushing all buffers often, though testing suggests that the cost of querying the data structure is minimal, and one is better off to choose -m liberally. -i, the assembly file, should of course be partitioned, assembled contigs, and file1...fileN should be the reads to be swept. K length is up to user; one might want to choose this based on assembly K (Trinity uses 25). So, most parameters are for the buffered output; this could be improved (and will be), but for now it works and works quite well. With max traversal range and the full, rather complex lamprey graph, the script chews through about 20m reads/hour, and all in one step. Yay! |
…ted search at that breadth when a tag is found
Things currently being worked on:
|
Additional things:
|
Closing due to moving branch; see #24 |
Open pull request for sparse graph coloring.
Has some old commits from bleeding-edge, but should be fine.