Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition sweep #164

Closed
wants to merge 105 commits into from
Closed

Conversation

camillescott
Copy link
Member

Open pull request for sparse graph coloring.

Has some old commits from bleeding-edge, but should be fine.

camillescott and others added 30 commits July 25, 2013 17:16
Pull request merged into ged-lab/bleeding-edge
…ning_on_abundance

Conflicts:
	lib/hashbits.hh
@mr-c
Copy link
Contributor

mr-c commented Oct 15, 2013

@cswelcher Please merge in the latest from the master branch so that @ged-jenkins can test it out :-)

@mr-c
Copy link
Contributor

mr-c commented Oct 17, 2013

(hmm.. Jenkins should have built this by now. Lets test the magic phrase)

please test this

@mr-c
Copy link
Contributor

mr-c commented Oct 17, 2013

Grr.. the actual phrase is:

test this please

@mr-c
Copy link
Contributor

mr-c commented Oct 17, 2013

test this please

@ged-jenkins
Copy link

Test PASSed.
Refer to this link for build results: http://ci.ged.msu.edu/job/khmer-multi-pullrequest/7/

@camillescott
Copy link
Member Author

Instructions for Sweep

The script to use is scripts/sweep-reads-by-partition-buffer.py, which consumes a partitioned file and runs through a list of read files, outputting them to files by color. This version has a read buffering system built in which greatly speeds up the output process. Usage:

python sweep-reads-by-partition-buffered.py -k -x -i -r -b -e <aprx. partitions> -o <output_prefix> -m ...

Hashsize can be quite small; I often go with 1e9 (comes out to 5e8 bytes), as it just needs to be big enough to properly store the assembly graph, and preferably avoid unreasonable spurious edges when traversing from reads. Traversal range should be set high; -1 sets it to max (81 in the current khmer codebase). Buffer size is the total number of reads to keep in all buffers, which is dependent on memory -- 1000000 works quite well. Aprx partitions is a rough guess of the number of partitions (or exact, doesn't matter that much). Max buffers is the number of colors to buffer at any given time; 25000-50000 works well.

Notes on the buffering parameters: the program works by maintaining up to lists of reads in memory, the total of which can reach before files are written. When an individual buffer exceeds / reads, that buffer is written out to its file. If the total number of reads in the buffer exceeds or if the total number of buffers exceeds , all buffers are flushed to files. In practice, with the recommended values, all buffers get flushed at around % 200000 reads, increasing as increases. There is a tradeoff here between the number of buffers increasing query times and flushing all buffers often, though testing suggests that the cost of querying the data structure is minimal, and one is better off to choose -m liberally.

-i, the assembly file, should of course be partitioned, assembled contigs, and file1...fileN should be the reads to be swept. K length is up to user; one might want to choose this based on assembly K (Trinity uses 25).

So, most parameters are for the buffered output; this could be improved (and will be), but for now it works and works quite well. With max traversal range and the full, rather complex lamprey graph, the script chews through about 20m reads/hour, and all in one step. Yay!

@camillescott
Copy link
Member Author

Things currently being worked on:

  • Tests for sweep-reads-by-partition-buffered.py
  • Traversal optimization
  • Read pair awareness for sweep
  • Better IO error handling in sweep script
  • subclass all sparse labeling code
  • Refactor names using 'coloring' to use 'labeling'

@camillescott
Copy link
Member Author

Additional things:

  • handle sweeping read short than k size
  • general error handling in python glue

@camillescott
Copy link
Member Author

Closing due to moving branch; see #24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants