Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User experience upgrades for khmer #732

Closed
ctb opened this issue Jan 18, 2015 · 11 comments
Closed

User experience upgrades for khmer #732

ctb opened this issue Jan 18, 2015 · 11 comments
Assignees
Milestone

Comments

@ctb
Copy link
Member

ctb commented Jan 18, 2015

This issue is to track desired user experience improvements. Most of these break I/O formats or file naming, so probably belong in khmer 2.0. The question is, do we try to get 'em all in to khmer 2.0?

  1. better filename naming, happening automatically - see Inconsistent and incomplete output filename handling #81.

  2. output to gz/bz2 automatically, based on input format but overridable - see Add options for outputting gzipped/bzip2ed sequence #505, with defaults determined by input format. ref state of streaming I/O in khmer, mark 2 #700 also.

  3. exact kmer counting/automatic memory usage determination -
    a constant source of confusion for users and a constant documentation/tutorial challenge :).

    two complementary fixes: first, Choosing hash table size and numbers according to number of unique k-mers #347: count k-mers w/hyperloglog hyperloglog counter #233 HyperLogLog Counter #257 to choose the sizes of
    our high-efficiency data structures appropriately. This is trading computation/time for memory.
    Note that it will not work properly for streaming approaches like diginorm, where we can't know the
    final number of k-mers until after we do all the hard work!

    second, implement dynamically sized k-mer counting data structures, either exact or inexact or both. For example,
    use google sparsehash implement google's sparsehash #689 to count k-mers exactly. Exact counting will be much less memory
    efficient than probabilistic data structures, but it's not clear that matters for many applications like
    diginorm, where streaming/sublinear memory is already a massive win. Worth noting that exact
    counting would also be a useful addition to our Python API, and might also help overcome the
    concerns of people who dislike inexact counting (despite publications showing that
    you shouldn't worry about it :).

  4. handling “broken paired” format in input - this affects streaming (state of streaming I/O in khmer, mark 2 #700 State of streaming sequence IO in khmer #654 Investigate/implement streaming approaches, more generally #393) and may also need to be addressed in khmer multiprocessing + seqan #655.

  5. handle both new and old formats for FASTQ sequences (Casava 1.8 pair checking in scripts/normalize-by-median, scripts/abund-filter and khmer/thread-utils #23 [docs] /1 & /2 read naming not latest Casava file format #305 Fastq ID stripping #606).

  6. Allow k > 32 - tracked in Allow k > 32 in khmer #60. Once HyperLog Log HyperLogLog Counter #257 is integrated, we will have murmurhash in the
    codebase; perhaps this can be used in addition to/instead of Feature/cyclic hash #624?

Anything I'm forgetting? The above list covers most of the complaints I've heard :)

@chuckpr
Copy link
Contributor

chuckpr commented Jan 18, 2015

Some notes from my experience...

It would be great to have more or specifically more centralized documentation for diagnostic performance figures. This info can be found in various sources on the web but a single location where users can see the types of figures that can be used to assess the performance of diginorm or partitioning would be useful. It would be nice to see a collection of example figures from report files generated by normalize-by-median.py for instance. What does successful normalization look like from real-world data? What does it look like when parameters need to be tweaked?

More documentation/examples/tutorials for the Python API as opposed to the scripts.

An unrelated but minor annoyance is the check_space_for_hashtable function. As far as I can tell check_space_for_hashtable checks for space in the current working directory. Most often we are running khmer from an r3.8xlarge EC2 instance working from a persistent volume while saving big files (like the presence tables) to one of the ssd drives. It's not helpful in this case to check the working directory for space when saving to a different partition. This might just be an flaw with the way we're choosing to work but maybe the check_space_for_hashtable should explicitly check space on the partition where the hashtable is being saved rather than only the working partition? Great that this can be overridden with the `-f`` flag however.

Better documentation for find-knots.py parameters.

Just two cents,

-Chuck

@macmanes
Copy link

1 annoyance: setting the size of the hash table with -x. I am constantly fielding question about this, and it's a pain for many people, especially novice users.. Why can't we ask users to specify the max amount of RAM they want khmer to use, --max_mem and then calculate that based on a default -N 4. Remove these options from the user interface altogether..

Would have to provide some guidelines in the docs about how much RAM one needs to do common types of jobs, but this info is already given to the user in http://khmer.readthedocs.org/en/v1.3/user/choosing-table-sizes.html#rules-of-thumb

@mr-c
Copy link
Contributor

mr-c commented Jan 19, 2015

Hey @chuckpr, thanks for the feedback!

  1. I think @ctb can answer this one

  2. The Python API documentation is purposely absent as only the command-line API is supported. Like you we all want a nice & well documented API. You can read about our plans for future releases at http://khmer.readthedocs.org/en/v1.3/roadmap.html

  3. Ah, another case of free space check incorrectly fails on HPCC #668

  4. Agreed

@ctb
Copy link
Member Author

ctb commented Jan 21, 2015

Thanks, all.

@chuckpr, we don't actually tweak the diginorm parameters at all (or at least, I don't) - k=20/C=20 works well for single pass, and the procedures/parameters in khmer-protocols seem to work fine for mRNAseq and metagenome data sets.

@macmanes, yep, that's a simpler approach to things than anything I was planning - thanks!

@mr-c
Copy link
Contributor

mr-c commented Jan 27, 2015

Half of this list can be implemented before 2.0; all are good ideas.

@macmanes
Copy link

More streaming requests.. If you allowed users to name output files in split-paired-reads.py, I could stream all the way through my khmer pipeline.. Yes, I know interleaved is the One True Format but not everybody understands this, yet..

interleave-reads.py | normalize-by-median.py | split-paired-reads.py

Right now split-paired-reads.py takes the input file names and modifies them, which will not work when the input file name is /dev/stdin

@ctb
Copy link
Member Author

ctb commented Jan 28, 2015

On Wed, Jan 28, 2015 at 11:55:20AM -0800, Matt MacManes wrote:

More streaming requests.. If you allowed users to name output files in split-paired-reads.py, I could stream all the way through my khmer pipeline.. Yes, I know interleaved is the One True Format but not everybody understands this, yet..

interleave-reads.py | normalize-by-median.py | split-paired-reads.py

Right now split-paired-reads.py takes the input file names and modifies them, which will not work when the input file name is /dev/stdin

Agreed. Thx!

@standage
Copy link
Member

+1 for @macmanes suggestion for a --max_mem option.

@ctb
Copy link
Member Author

ctb commented Jun 29, 2015

Updates for @macmanes:

-M/--max-memory-usage was merged in #1050. Automatic memory determination based on k-mer numbers is being worked on in #1117.

Streaming is still in progress, but fixed read name handling, broken-paired read handling, and explicit naming of output files has been updated through many of our scripts (see #763 and #759). I'm sure there's loose ends but we will try to identify them & clean those up by 2.0.

@ctb
Copy link
Member Author

ctb commented Jul 29, 2015

We will revisit this and post new issues after the 2.0 release!

@ctb
Copy link
Member Author

ctb commented Jul 31, 2015

UX issues ported to #1216.

@ctb ctb closed this as completed Jul 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants