-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User experience upgrades for khmer #732
Comments
Some notes from my experience... It would be great to have more or specifically more centralized documentation for diagnostic performance figures. This info can be found in various sources on the web but a single location where users can see the types of figures that can be used to assess the performance of diginorm or partitioning would be useful. It would be nice to see a collection of example figures from report files generated by More documentation/examples/tutorials for the Python API as opposed to the scripts. An unrelated but minor annoyance is the Better documentation for Just two cents, -Chuck |
1 annoyance: setting the size of the hash table with Would have to provide some guidelines in the docs about how much RAM one needs to do common types of jobs, but this info is already given to the user in http://khmer.readthedocs.org/en/v1.3/user/choosing-table-sizes.html#rules-of-thumb |
Hey @chuckpr, thanks for the feedback!
|
Thanks, all. @chuckpr, we don't actually tweak the diginorm parameters at all (or at least, I don't) - k=20/C=20 works well for single pass, and the procedures/parameters in khmer-protocols seem to work fine for mRNAseq and metagenome data sets. @macmanes, yep, that's a simpler approach to things than anything I was planning - thanks! |
Half of this list can be implemented before 2.0; all are good ideas. |
More streaming requests.. If you allowed users to name output files in
Right now |
On Wed, Jan 28, 2015 at 11:55:20AM -0800, Matt MacManes wrote:
Agreed. Thx! |
+1 for @macmanes suggestion for a |
Updates for @macmanes:
Streaming is still in progress, but fixed read name handling, broken-paired read handling, and explicit naming of output files has been updated through many of our scripts (see #763 and #759). I'm sure there's loose ends but we will try to identify them & clean those up by 2.0. |
We will revisit this and post new issues after the 2.0 release! |
UX issues ported to #1216. |
This issue is to track desired user experience improvements. Most of these break I/O formats or file naming, so probably belong in khmer 2.0. The question is, do we try to get 'em all in to khmer 2.0?
better filename naming, happening automatically - see Inconsistent and incomplete output filename handling #81.
output to gz/bz2 automatically, based on input format but overridable - see Add options for outputting gzipped/bzip2ed sequence #505, with defaults determined by input format. ref state of streaming I/O in khmer, mark 2 #700 also.
exact kmer counting/automatic memory usage determination -
a constant source of confusion for users and a constant documentation/tutorial challenge :).
two complementary fixes: first, Choosing hash table size and numbers according to number of unique k-mers #347: count k-mers w/hyperloglog hyperloglog counter #233 HyperLogLog Counter #257 to choose the sizes of
our high-efficiency data structures appropriately. This is trading computation/time for memory.
Note that it will not work properly for streaming approaches like diginorm, where we can't know the
final number of k-mers until after we do all the hard work!
second, implement dynamically sized k-mer counting data structures, either exact or inexact or both. For example,
use google sparsehash implement google's sparsehash #689 to count k-mers exactly. Exact counting will be much less memory
efficient than probabilistic data structures, but it's not clear that matters for many applications like
diginorm, where streaming/sublinear memory is already a massive win. Worth noting that exact
counting would also be a useful addition to our Python API, and might also help overcome the
concerns of people who dislike inexact counting (despite publications showing that
you shouldn't worry about it :).
handling “broken paired” format in input - this affects streaming (state of streaming I/O in khmer, mark 2 #700 State of streaming sequence IO in khmer #654 Investigate/implement streaming approaches, more generally #393) and may also need to be addressed in khmer multiprocessing + seqan #655.
handle both new and old formats for FASTQ sequences (Casava 1.8 pair checking in scripts/normalize-by-median, scripts/abund-filter and khmer/thread-utils #23 [docs] /1 & /2 read naming not latest Casava file format #305 Fastq ID stripping #606).
Allow k > 32 - tracked in Allow k > 32 in khmer #60. Once HyperLog Log HyperLogLog Counter #257 is integrated, we will have murmurhash in the
codebase; perhaps this can be used in addition to/instead of Feature/cyclic hash #624?
Anything I'm forgetting? The above list covers most of the complaints I've heard :)
The text was updated successfully, but these errors were encountered: