User experience upgrades for khmer #732

ctb · 2015-01-18T12:42:41Z

This issue is to track desired user experience improvements. Most of these break I/O formats or file naming, so probably belong in khmer 2.0. The question is, do we try to get 'em all in to khmer 2.0?

better filename naming, happening automatically - see Inconsistent and incomplete output filename handling #81.
output to gz/bz2 automatically, based on input format but overridable - see Add options for outputting gzipped/bzip2ed sequence #505, with defaults determined by input format. ref state of streaming I/O in khmer, mark 2 #700 also.
exact kmer counting/automatic memory usage determination -
a constant source of confusion for users and a constant documentation/tutorial challenge :).

two complementary fixes: first, Choosing hash table size and numbers according to number of unique k-mers #347: count k-mers w/hyperloglog hyperloglog counter #233 HyperLogLog Counter #257 to choose the sizes of
our high-efficiency data structures appropriately. This is trading computation/time for memory.
Note that it will not work properly for streaming approaches like diginorm, where we can't know the
final number of k-mers until after we do all the hard work!

second, implement dynamically sized k-mer counting data structures, either exact or inexact or both. For example,
use google sparsehash implement google's sparsehash #689 to count k-mers exactly. Exact counting will be much less memory
efficient than probabilistic data structures, but it's not clear that matters for many applications like
diginorm, where streaming/sublinear memory is already a massive win. Worth noting that exact
counting would also be a useful addition to our Python API, and might also help overcome the
concerns of people who dislike inexact counting (despite publications showing that
you shouldn't worry about it :).
handling “broken paired” format in input - this affects streaming (state of streaming I/O in khmer, mark 2 #700 State of streaming sequence IO in khmer #654 Investigate/implement streaming approaches, more generally #393) and may also need to be addressed in khmer multiprocessing + seqan #655.
handle both new and old formats for FASTQ sequences (Casava 1.8 pair checking in scripts/normalize-by-median, scripts/abund-filter and khmer/thread-utils #23 [docs] /1 & /2 read naming not latest Casava file format #305 Fastq ID stripping #606).
Allow k > 32 - tracked in Allow k > 32 in khmer #60. Once HyperLog Log HyperLogLog Counter #257 is integrated, we will have murmurhash in the
codebase; perhaps this can be used in addition to/instead of Feature/cyclic hash #624?

Anything I'm forgetting? The above list covers most of the complaints I've heard :)

chuckpr · 2015-01-18T20:32:04Z

Some notes from my experience...

It would be great to have more or specifically more centralized documentation for diagnostic performance figures. This info can be found in various sources on the web but a single location where users can see the types of figures that can be used to assess the performance of diginorm or partitioning would be useful. It would be nice to see a collection of example figures from report files generated by normalize-by-median.py for instance. What does successful normalization look like from real-world data? What does it look like when parameters need to be tweaked?

More documentation/examples/tutorials for the Python API as opposed to the scripts.

An unrelated but minor annoyance is the check_space_for_hashtable function. As far as I can tell check_space_for_hashtable checks for space in the current working directory. Most often we are running khmer from an r3.8xlarge EC2 instance working from a persistent volume while saving big files (like the presence tables) to one of the ssd drives. It's not helpful in this case to check the working directory for space when saving to a different partition. This might just be an flaw with the way we're choosing to work but maybe the check_space_for_hashtable should explicitly check space on the partition where the hashtable is being saved rather than only the working partition? Great that this can be overridden with the `-f`` flag however.

Better documentation for find-knots.py parameters.

Just two cents,

-Chuck

macmanes · 2015-01-19T20:22:33Z

1 annoyance: setting the size of the hash table with -x. I am constantly fielding question about this, and it's a pain for many people, especially novice users.. Why can't we ask users to specify the max amount of RAM they want khmer to use, --max_mem and then calculate that based on a default -N 4. Remove these options from the user interface altogether..

Would have to provide some guidelines in the docs about how much RAM one needs to do common types of jobs, but this info is already given to the user in http://khmer.readthedocs.org/en/v1.3/user/choosing-table-sizes.html#rules-of-thumb

mr-c · 2015-01-19T20:33:58Z

Hey @chuckpr, thanks for the feedback!

I think @ctb can answer this one
The Python API documentation is purposely absent as only the command-line API is supported. Like you we all want a nice & well documented API. You can read about our plans for future releases at http://khmer.readthedocs.org/en/v1.3/roadmap.html
Ah, another case of free space check incorrectly fails on HPCC #668
Agreed

ctb · 2015-01-21T22:56:56Z

Thanks, all.

@chuckpr, we don't actually tweak the diginorm parameters at all (or at least, I don't) - k=20/C=20 works well for single pass, and the procedures/parameters in khmer-protocols seem to work fine for mRNAseq and metagenome data sets.

@macmanes, yep, that's a simpler approach to things than anything I was planning - thanks!

mr-c · 2015-01-27T19:49:47Z

Half of this list can be implemented before 2.0; all are good ideas.

macmanes · 2015-01-28T19:55:19Z

More streaming requests.. If you allowed users to name output files in split-paired-reads.py, I could stream all the way through my khmer pipeline.. Yes, I know interleaved is the One True Format but not everybody understands this, yet..

interleave-reads.py | normalize-by-median.py | split-paired-reads.py

Right now split-paired-reads.py takes the input file names and modifies them, which will not work when the input file name is /dev/stdin

ctb · 2015-01-28T21:40:46Z

On Wed, Jan 28, 2015 at 11:55:20AM -0800, Matt MacManes wrote:

More streaming requests.. If you allowed users to name output files in split-paired-reads.py, I could stream all the way through my khmer pipeline.. Yes, I know interleaved is the One True Format but not everybody understands this, yet..
interleave-reads.py | normalize-by-median.py | split-paired-reads.py
Right now split-paired-reads.py takes the input file names and modifies them, which will not work when the input file name is /dev/stdin

Agreed. Thx!

standage · 2015-03-11T16:35:08Z

+1 for @macmanes suggestion for a --max_mem option.

ctb · 2015-06-29T15:26:53Z

Updates for @macmanes:

-M/--max-memory-usage was merged in #1050. Automatic memory determination based on k-mer numbers is being worked on in #1117.

Streaming is still in progress, but fixed read name handling, broken-paired read handling, and explicit naming of output files has been updated through many of our scripts (see #763 and #759). I'm sure there's loose ends but we will try to identify them & clean those up by 2.0.

ctb · 2015-07-29T20:53:21Z

We will revisit this and post new issues after the 2.0 release!

ctb · 2015-07-31T14:08:38Z

UX issues ported to #1216.

ctb mentioned this issue Jan 18, 2015

Deal with "broken-paired" input/output, for better streaming. #733

Closed

ctb mentioned this issue Jan 27, 2015

max memory parameter specification #744

Closed

This was referenced Feb 4, 2015

Move trim-low-abund into scripts/ #754

Closed

Support for longer kmer #757

Closed

Enable specification of output directory #752

Merged

update split-paired-reads to support -1 and -2 options #762

Merged

Minor fixes to split-paired-reads.py #763

Closed

ctb mentioned this issue Mar 2, 2015

Test all scripts/ for appropriate behavior when len(read) < K #859

Open

This was referenced May 8, 2015

Properly handle singleton reads in normalize-by-median. #988

Closed

Compare behavior of filter-abund -V with normalize-by-median; maybe adjust defaults #989

Open

SensibleSalmon self-assigned this May 15, 2015

ctb mentioned this issue May 18, 2015

Turn on better reporting by default for khmer 2.0 #1011

Closed

ctb mentioned this issue May 31, 2015

Add --max-memory-usage argument, and associated khmer_args refactoring. #1050

Merged

12 tasks

SensibleSalmon mentioned this issue Jun 23, 2015

Implementing --max-mem and auto'd hashtable agrs #1117

Closed

SensibleSalmon mentioned this issue Jun 29, 2015

Implement auto'd optimal hashtable args and memory limitation checking #1126

Merged

mr-c added this to the 2.0+ milestone Jul 29, 2015

SensibleSalmon mentioned this issue Jul 31, 2015

Cleaning up automatic argument setting #1214

Merged

ctb mentioned this issue Jul 31, 2015

Meta issue for user experience upgrades in 2.0+ #1216

Open

ctb closed this as completed Jul 31, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User experience upgrades for khmer #732

User experience upgrades for khmer #732

ctb commented Jan 18, 2015

chuckpr commented Jan 18, 2015

macmanes commented Jan 19, 2015

mr-c commented Jan 19, 2015

ctb commented Jan 21, 2015

mr-c commented Jan 27, 2015

macmanes commented Jan 28, 2015

ctb commented Jan 28, 2015

standage commented Mar 11, 2015

ctb commented Jun 29, 2015

ctb commented Jul 29, 2015

ctb commented Jul 31, 2015

User experience upgrades for khmer #732

User experience upgrades for khmer #732

Comments

ctb commented Jan 18, 2015

chuckpr commented Jan 18, 2015

macmanes commented Jan 19, 2015

mr-c commented Jan 19, 2015

ctb commented Jan 21, 2015

mr-c commented Jan 27, 2015

macmanes commented Jan 28, 2015

ctb commented Jan 28, 2015

standage commented Mar 11, 2015

ctb commented Jun 29, 2015

ctb commented Jul 29, 2015

ctb commented Jul 31, 2015