-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CLI flag to load-into-counting.py allowing TSV output of basic stats #649
Conversation
Can one of the admins verify this patch? |
1 similar comment
Can one of the admins verify this patch? |
|
add to testlist |
|
||
n_kmers = htable.n_occupied() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking news: see #562 (comment). A new n_kmers() function is pending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, was wondering if this could be implemented 😄
Will update this PR once that one is merged, unless you'd prefer me to do so now.
Question/thought for you and @mr-c -- what about providing this in a JSON or YAML format, instead of tsv? More structured formats could allow for later expansion into additional information, and I think we could put semantic versioning to use by specifying the minimum information in the file by version, allowing for flexible expansion. (This was an idle thought I had yesterday while wandering around.) |
And, if going the JSON route, can it be a newline delimited JSON? This would make it compatible with dat, and is still reasonably easy to parse. |
Yes, after hacking with this a bit, I like the idea of something more extensible than a TSV (particularly something both versioned and dat-friendly). Although that has the disadvantage of being less easy to just glob into R for a quick plot. Maybe the CLI flag could take a format parameter that defaults to JSON if not given, but can also do something tabular like TSV for easy interfacing with R and friends. |
Do you mind rebasing off of our master branch? |
ceea16d
to
af47c33
Compare
I've also been more consistent in using the |
Looking good so far @kdmurray91. I think you're ready for a ChangeLog entry. |
Will do this tonight and push it up 😄 |
Also, do I need to rebase again? GH is complaining about merge conflicts |
593c10b
to
455c856
Compare
Rebased against master, and changed the script to use |
@@ -54,6 +56,10 @@ def get_parser(): | |||
parser.add_argument('-b', '--no-bigcount', dest='bigcount', default=True, | |||
action='store_false', | |||
help='Do not count k-mers past 255') | |||
parser.add_argument('--machine-readable-info', '-m', default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's got to be a better parameter name than this :). @mr-c, any thoughts?
Maybe... --summary-info?
Apart from that one comment, LGTM. |
Open to any suggestions for a better flag name. I agree the current flag is overly long and too verbose, but can't think of a better one. |
+1 to the shorter name On Wed Nov 19 2014 at 7:29:39 PM Kevin Murray notifications@github.com
|
Cool, I'll also make the short flag be |
This changes the overly long flag we added in PR dib-lab#649 to be a little shorter.
c35f13e
to
d8e8aeb
Compare
This changes the overly long flag we added in PR dib-lab#649 to be a little shorter.
Ooops, I seem to have buggered up a rebase. Sorry! |
d8e8aeb
to
5d47b86
Compare
This changes the overly long flag we added in PR dib-lab#649 to be a little shorter.
That's better! All good from my end 😄 EDIT: Scratch that, found a bug. |
Fixed issue with test, I think. At least it works on my machine. |
print >>sys.stderr, 'Saving k-mer counting table to %s' % base | ||
print >>sys.stderr, 'Loading kmers from sequences in %s' % repr(filenames) | ||
print >>sys.stderr, 'Saving k-mer counting table', base | ||
print >>sys.stderr, 'Loading kmers from sequences in', repr(filenames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not change these two lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, think that was the result of merge conflict resolution. Will revert.
|
||
with open(base + '.info', 'a') as info_fp: | ||
print >> sys.stderr, "Writing run information to", base + '.info' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeat of line 136/172
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Another merge conflict mistake. Will remove it
retest this please |
I can confirm that the test is fixed |
We now use `with open(X) as y:` which automatically handles the closing of files.
This adds a CLI flag to load-into-counting.py allowing it to output the FPR, total number of distinct k-mers (cm-sketch occupancy) and input files in a machine/R/excel readable format to save us from resorting to regex hacks to pull this info out of out.kh.info etc. I've also make the info files contain all files used as input. I've also also added a test of the tsv creation behaviour. Changes: modified: scripts/load-into-counting.py: new cli flag, write all inputs to .info file modified: tests/test_scripts.py: add test of the new cli flag
Writing it to the info file had the wrong sentence.
This changes the CLI flag that enables machine readable info so that to enalbe this behaviour, you give it an argument of either 'json' or 'tsv'. This allows for a user choice of a versioned, dat-friendly json or something easy to whack into R and plot.
Add a file format version key-val pair
We now use the choices kwarg to parser.add_argument to validate the ---machine-readable-info argument, instead of manually checking it. Also, updates the tests. modified: scripts/load-into-counting.py modified: tests/test_scripts.py
modified: ChangeLog
New API feature in ChangeLog on 2014-11-06. Also solves merge conflict (hopefully) modified: scripts/load-into-counting.py
This changes the overly long flag we added in PR dib-lab#649 to be a little shorter.
Forgot to update the expected stderr after a rebase changed it. modified: tests/test_scripts.py
Accidentally changed a few unrelated issues while fixing merge conflcits. Reverted as asked to by @mr-c. Also remove duplicated status line introduced in a similar manner. modified: scripts/load-into-counting.py
d564155
to
8fb778c
Compare
@mr-c's bugs fixed, rebased to remove merge conflict in ChangeLog. |
This changes the overly long flag we added in PR #649 to be a little shorter.
@kdmurray91 Congratulations on your first commit to the khmer project! Your name will be included in the release notes for the next version and you'll be listed amongst our other contributors in the next software release paper. |
Merged as 820710c |
Hi all,
We're developing a few things based off khmer, and one of the metrics we're using is the total number of k-mers. I know this can be obtained programmatically via the khmer API (by loading in the counting hash and grabbing its occupancy), but this takes a while with many large hashes. So, currently, we're grepping through log files and .info files for the FPR and total number of kmers.
This PR adds a CLI flag to load-into-counting.py allowing it to output the FPR, total number of distinct k-mers (cm-sketch occupancy) and input files in a machine/R/spreadsheet readable format (tab-seperated file). This saves us from resorting to regex hacks to pull this info out of out.kh.info and logs. I've added a test of the tsv creation behaviour too, and pep8 is clean.
An example of the tsv format is below (markdown table-ified for prettyness):
I've also make the info files contain all files used as input. (This can be reverted, I forgot to separate the commits)
Hope you find it helpful 😄
Cheers,
Kevin