Partition sweep #164

camillescott · 2013-09-13T18:25:33Z

Open pull request for sparse graph coloring.

Has some old commits from bleeding-edge, but should be fine.

…g-edge

Pull request merged into ged-lab/bleeding-edge

…g-edge Updates since graceful-fail2 pull

Bleeding edge

…g-edge

…ing-edge

…ashbits

…itioning works

…ning_on_abundance Conflicts: lib/hashbits.hh

mr-c · 2013-10-15T03:44:22Z

@cswelcher Please merge in the latest from the master branch so that @ged-jenkins can test it out :-)

mr-c · 2013-10-17T17:29:09Z

(hmm.. Jenkins should have built this by now. Lets test the magic phrase)

please test this

mr-c · 2013-10-17T17:31:23Z

Grr.. the actual phrase is:

test this please

mr-c · 2013-10-17T17:35:11Z

test this please

ged-jenkins · 2013-10-17T17:37:16Z

Test PASSed.
Refer to this link for build results: http://ci.ged.msu.edu/job/khmer-multi-pullrequest/7/

camillescott · 2013-11-05T18:29:16Z

Instructions for Sweep

The script to use is scripts/sweep-reads-by-partition-buffer.py, which consumes a partitioned file and runs through a list of read files, outputting them to files by color. This version has a read buffering system built in which greatly speeds up the output process. Usage:

python sweep-reads-by-partition-buffered.py -k -x -i -r -b -e <aprx. partitions> -o <output_prefix> -m ...

Hashsize can be quite small; I often go with 1e9 (comes out to 5e8 bytes), as it just needs to be big enough to properly store the assembly graph, and preferably avoid unreasonable spurious edges when traversing from reads. Traversal range should be set high; -1 sets it to max (81 in the current khmer codebase). Buffer size is the total number of reads to keep in all buffers, which is dependent on memory -- 1000000 works quite well. Aprx partitions is a rough guess of the number of partitions (or exact, doesn't matter that much). Max buffers is the number of colors to buffer at any given time; 25000-50000 works well.

Notes on the buffering parameters: the program works by maintaining up to lists of reads in memory, the total of which can reach before files are written. When an individual buffer exceeds / reads, that buffer is written out to its file. If the total number of reads in the buffer exceeds or if the total number of buffers exceeds , all buffers are flushed to files. In practice, with the recommended values, all buffers get flushed at around % 200000 reads, increasing as increases. There is a tradeoff here between the number of buffers increasing query times and flushing all buffers often, though testing suggests that the cost of querying the data structure is minimal, and one is better off to choose -m liberally.

-i, the assembly file, should of course be partitioned, assembled contigs, and file1...fileN should be the reads to be swept. K length is up to user; one might want to choose this based on assembly K (Trinity uses 25).

So, most parameters are for the buffered output; this could be improved (and will be), but for now it works and works quite well. With max traversal range and the full, rather complex lamprey graph, the script chews through about 20m reads/hour, and all in one step. Yay!

…ted search at that breadth when a tag is found

camillescott · 2013-11-13T18:52:25Z

Things currently being worked on:

Tests for sweep-reads-by-partition-buffered.py
Traversal optimization
Read pair awareness for sweep
Better IO error handling in sweep script
subclass all sparse labeling code
Refactor names using 'coloring' to use 'labeling'

camillescott · 2013-11-13T20:42:48Z

Additional things:

handle sweeping read short than k size
general error handling in python glue

…ches

camillescott · 2013-11-14T18:44:16Z

Closing due to moving branch; see #24

camillescott and others added 30 commits July 25, 2013 17:16

Merge github.com:ged-lab/khmer into bleeding-edge

8844c12

Merge branch 'bleeding-edge' of github.com:ged-lab/khmer into bleedin…

e32703b

…g-edge

added threading params to filter-abund.py

2ee322c

Merge branch 'graceful-fail2' into bleeding-edge

a7d6343

Pull request merged into ged-lab/bleeding-edge

Merge branch 'bleeding-edge' of github.com:ged-lab/khmer into bleedin…

4378a7b

…g-edge Updates since graceful-fail2 pull

Merge pull request #1 from ged-lab/bleeding-edge

d9fda8f

Bleeding edge

Merge branch 'bleeding-edge' of github.com:ged-lab/khmer into bleedin…

d3bb3d2

…g-edge

Merge branch 'bleeding-edge' of github.com:cswelcher/khmer into bleed…

32fa1c6

…ing-edge

moved graph traversal functionality up to Hashtable base class from H…

afe46ec

…ashbits

removed unused/mostly untested traversal and graph complexity code

f73d3fa

initial code for graph partitioning based on abundance

370ca2f

fixed bug in count_partitions; fixed associated tests; abundance part…

2004ede

…itioning works

Merge branch 'master' of github.com:ged-lab/khmer into graph_partitio…

4aa1fdd

…ning_on_abundance Conflicts: lib/hashbits.hh

added some comments to abundance partitioning tests

68c5ced

added compare_partitions function

d6eb05c

added another test for partition comparison

031984a

Added typedefs to hashtable.hh

c9467fb

Added function prototype for consume and tag with colors

05edfb8

added subset_partition_size_distribution function to counting hashes

f65a1b2

added code for SubsetPartition objects in Python

43d330e

shifted counting hash subset functions onto the new subset object

ddc8711

added partition_sizes function to retrieve partition IDs & sizes

8c7763a

added partition_sizes test

306a294

replaced now-missing function hash_subset_count_partitions for hashbits

ccffe06

added partition coverage calculations

34031ff

added some simple tests for partition average coverages

9592b55

added code to insert colors into colormap

0e92a8d

added consume_fasta functions

1465c74

merged in refactor from ged-lab/graph_partitioning_on_abundance

e19e139

promoted new code to hashtable to be in line with refactor

b33c71a

fixed conflict with ktable test in lib Makefile

b71e086

camillescott added 7 commits October 22, 2013 10:32

added debugging option to make

fa8d5bd

fixed color test build params

4e30cec

couple changes to color-Test

cd86674

changes to original sweep reads, now deprecated...

e54b216

Merge branch 'master' of github.com:cswelcher/khmer

b871ad6

added combined sweep and file output script

d13cb52

Merge branch 'master' of github.com:ged-lab/khmer

412eec9

camillescott added 4 commits November 5, 2013 14:43

changed bad env line

8120362

important change in traversal code: removed optimization which trunca…

0d89921

…ted search at that breadth when a tag is found

Merge branch 'master' into partition_sweep

c6035a1

added default parameters to buffered sweep

bd2fcdb

camillescott added 2 commits November 13, 2013 14:30

added error handling to file opening and buffer flushing

9278ecd

added warning output for errors, updated description

ac63a8b

camillescott added 4 commits November 13, 2013 17:09

added minimum k and hashsizes to prevent inanely complex useless sear…

d524257

…ches

started tests

3a7ff00

working on tests

37ac9d8

changed something...

0ccb623

camillescott closed this Nov 14, 2013

camillescott mentioned this pull request Nov 14, 2013

Partition sweep #216

Merged

mr-c mentioned this pull request Jan 16, 2014

provide a Python function to return tags that are present in a sequence #62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition sweep #164

Partition sweep #164

camillescott commented Sep 13, 2013

mr-c commented Oct 15, 2013

mr-c commented Oct 17, 2013

mr-c commented Oct 17, 2013

mr-c commented Oct 17, 2013

ged-jenkins commented Oct 17, 2013

camillescott commented Nov 5, 2013

camillescott commented Nov 13, 2013

camillescott commented Nov 13, 2013

camillescott commented Nov 14, 2013

Partition sweep #164

Partition sweep #164

Conversation

camillescott commented Sep 13, 2013

mr-c commented Oct 15, 2013

mr-c commented Oct 17, 2013

mr-c commented Oct 17, 2013

mr-c commented Oct 17, 2013

ged-jenkins commented Oct 17, 2013

camillescott commented Nov 5, 2013

Instructions for Sweep

camillescott commented Nov 13, 2013

camillescott commented Nov 13, 2013

camillescott commented Nov 14, 2013