Add CLI flag to load-into-counting.py allowing TSV output of basic stats #649

kdm9 · 2014-11-06T04:26:20Z

Hi all,

We're developing a few things based off khmer, and one of the metrics we're using is the total number of k-mers. I know this can be obtained programmatically via the khmer API (by loading in the counting hash and grabbing its occupancy), but this takes a while with many large hashes. So, currently, we're grepping through log files and .info files for the FPR and total number of kmers.

This PR adds a CLI flag to load-into-counting.py allowing it to output the FPR, total number of distinct k-mers (cm-sketch occupancy) and input files in a machine/R/spreadsheet readable format (tab-seperated file). This saves us from resorting to regex hacks to pull this info out of out.kh.info and logs. I've added a test of the tsv creation behaviour too, and pep8 is clean.

An example of the tsv format is below (markdown table-ified for prettyness):

python ./scripts/load-into-counting.py  --machine-readable-info -x 1e7 -N 4 -k 20 tst data/100k-filtered.fa data/100k-surrendered.fa

name	fpr	kmers	files
tst	0.010	3167409	data/100k-filtered.fa;data/100k-surrendered.fa

I've also make the info files contain all files used as input. (This can be reverted, I forgot to separate the commits)

Hope you find it helpful 😄

Cheers,
Kevin

ged-jenkins · 2014-11-06T04:26:21Z

Can one of the admins verify this patch?

ged-jenkins · 2014-11-06T04:26:21Z

Can one of the admins verify this patch?

kdm9 · 2014-11-06T04:27:48Z

Is it mergable?
Did it pass the tests?
If it introduces new functionality in scripts/ is it tested?
Check for code coverage.
Is it well formatted? Look at make pep8, make diff_pylint_report,
make cppcheck, and make doc output. Use make format and manual fixing as needed.
Is it documented in the ChangeLog?
Was a spellchecker run on the source code and documentation after
changes were made?

mr-c · 2014-11-06T20:11:26Z

add to testlist

mr-c · 2014-11-06T20:47:43Z

scripts/load-into-counting.py


+    n_kmers = htable.n_occupied()


Breaking news: see #562 (comment). A new n_kmers() function is pending.

Very nice, was wondering if this could be implemented 😄

Will update this PR once that one is merged, unless you'd prefer me to do so now.

ctb · 2014-11-07T15:06:43Z

Question/thought for you and @mr-c -- what about providing this in a JSON or YAML format, instead of tsv? More structured formats could allow for later expansion into additional information, and I think we could put semantic versioning to use by specifying the minimum information in the file by version, allowing for flexible expansion. (This was an idle thought I had yesterday while wandering around.)

luizirber · 2014-11-07T15:22:47Z

And, if going the JSON route, can it be a newline delimited JSON? This would make it compatible with dat, and is still reasonably easy to parse.

kdm9 · 2014-11-08T04:22:07Z

Yes, after hacking with this a bit, I like the idea of something more extensible than a TSV (particularly something both versioned and dat-friendly). Although that has the disadvantage of being less easy to just glob into R for a quick plot.

Maybe the CLI flag could take a format parameter that defaults to JSON if not given, but can also do something tabular like TSV for easy interfacing with R and friends.

mr-c · 2014-11-15T03:14:27Z

Do you mind rebasing off of our master branch?

kdm9 · 2014-11-15T03:45:43Z

I've also been more consistent in using the print >>fp, 'x' syntax rather than fp.write()

mr-c · 2014-11-17T22:50:00Z

Looking good so far @kdmurray91. I think you're ready for a ChangeLog entry.

kdm9 · 2014-11-18T08:16:05Z

Will do this tonight and push it up 😄

kdm9 · 2014-11-18T08:16:37Z

Also, do I need to rebase again? GH is complaining about merge conflicts

kdm9 · 2014-11-18T12:21:39Z

Rebased against master, and changed the script to use ht.n_unique_kmers() as suggested by @mr-c on an outdated diff somewhere I think 😄

ctb · 2014-11-19T21:48:14Z

scripts/load-into-counting.py

@@ -54,6 +56,10 @@ def get_parser():
    parser.add_argument('-b', '--no-bigcount', dest='bigcount', default=True,
                        action='store_false',
                        help='Do not count k-mers past 255')
+    parser.add_argument('--machine-readable-info', '-m', default=None,


There's got to be a better parameter name than this :). @mr-c, any thoughts?

Maybe... --summary-info?

ctb · 2014-11-19T21:49:24Z

Apart from that one comment, LGTM.

kdm9 · 2014-11-20T00:29:36Z

Open to any suggestions for a better flag name. I agree the current flag is overly long and too verbose, but can't think of a better one. --summary-info isn't bad. I'll see what @mr-c says before implementing it though...

mr-c · 2014-11-20T00:39:06Z

+1 to the shorter name

On Wed Nov 19 2014 at 7:29:39 PM Kevin Murray notifications@github.com
wrote:

Open to any suggestions for a better flag name. I agree the current flag
is overly long and too verbose, but can't think of a better one.
--summary-info isn't bad. I'll see what @mr-c https://github.com/mr-c
says before implementing it though...

—
Reply to this email directly or view it on GitHub
#649 (comment).

kdm9 · 2014-11-20T00:43:47Z

Cool, I'll also make the short flag be -s.

This changes the overly long flag we added in PR dib-lab#649 to be a little shorter.

kdm9 · 2014-11-20T04:46:12Z

Ooops, I seem to have buggered up a rebase. Sorry!

This changes the overly long flag we added in PR dib-lab#649 to be a little shorter.

kdm9 · 2014-11-20T04:48:56Z

That's better! All good from my end 😄

EDIT: Scratch that, found a bug.

kdm9 · 2014-11-20T05:13:00Z

Fixed issue with test, I think. At least it works on my machine.

mr-c · 2014-11-20T17:29:59Z

scripts/load-into-counting.py

-    print >>sys.stderr, 'Saving k-mer counting table to %s' % base
-    print >>sys.stderr, 'Loading kmers from sequences in %s' % repr(filenames)
+    print >>sys.stderr, 'Saving k-mer counting table',  base
+    print >>sys.stderr, 'Loading kmers from sequences in', repr(filenames)


Lets not change these two lines

Sorry, think that was the result of merge conflict resolution. Will revert.

mr-c · 2014-11-20T17:36:12Z

scripts/load-into-counting.py


+    with open(base + '.info', 'a') as info_fp:
+        print >> sys.stderr, "Writing run information to", base + '.info'


Repeat of line 136/172

Oops! Another merge conflict mistake. Will remove it

mr-c · 2014-11-20T17:39:38Z

retest this please

mr-c · 2014-11-20T17:52:40Z

I can confirm that the test is fixed

We now use `with open(X) as y:` which automatically handles the closing of files.

This adds a CLI flag to load-into-counting.py allowing it to output the FPR, total number of distinct k-mers (cm-sketch occupancy) and input files in a machine/R/excel readable format to save us from resorting to regex hacks to pull this info out of out.kh.info etc. I've also make the info files contain all files used as input. I've also also added a test of the tsv creation behaviour. Changes: modified: scripts/load-into-counting.py: new cli flag, write all inputs to .info file modified: tests/test_scripts.py: add test of the new cli flag

Writing it to the info file had the wrong sentence.

This changes the CLI flag that enables machine readable info so that to enalbe this behaviour, you give it an argument of either 'json' or 'tsv'. This allows for a user choice of a versioned, dat-friendly json or something easy to whack into R and plot.

Add a file format version key-val pair

We now use the choices kwarg to parser.add_argument to validate the ---machine-readable-info argument, instead of manually checking it. Also, updates the tests. modified: scripts/load-into-counting.py modified: tests/test_scripts.py

modified: ChangeLog

New API feature in ChangeLog on 2014-11-06. Also solves merge conflict (hopefully) modified: scripts/load-into-counting.py

This changes the overly long flag we added in PR dib-lab#649 to be a little shorter.

Forgot to update the expected stderr after a rebase changed it. modified: tests/test_scripts.py

@mr-c

Accidentally changed a few unrelated issues while fixing merge conflcits. Reverted as asked to by @mr-c. Also remove duplicated status line introduced in a similar manner. modified: scripts/load-into-counting.py

kdm9 · 2014-11-21T00:33:38Z

@mr-c's bugs fixed, rebased to remove merge conflict in ChangeLog.

This changes the overly long flag we added in PR #649 to be a little shorter.

mr-c · 2014-12-01T18:53:01Z

@kdmurray91 Congratulations on your first commit to the khmer project! Your name will be included in the release notes for the next version and you'll be listed amongst our other contributors in the next software release paper.

mr-c · 2014-12-01T18:53:29Z

Merged as 820710c

mr-c reviewed Nov 6, 2014
View reviewed changes

kdm9 force-pushed the feature/load-into-counting_tsv-output branch from ceea16d to af47c33 Compare November 15, 2014 03:36

kdm9 force-pushed the feature/load-into-counting_tsv-output branch from 593c10b to 455c856 Compare November 18, 2014 12:20

ctb reviewed Nov 19, 2014
View reviewed changes

kdm9 added a commit to kdm9/khmer that referenced this pull request Nov 20, 2014

Change --machine-readable-info to --summary-info

c35f13e

This changes the overly long flag we added in PR dib-lab#649 to be a little shorter.

kdm9 force-pushed the feature/load-into-counting_tsv-output branch from c35f13e to d8e8aeb Compare November 20, 2014 04:35

kdm9 added a commit to kdm9/khmer that referenced this pull request Nov 20, 2014

Change --machine-readable-info to --summary-info

d8e8aeb

This changes the overly long flag we added in PR dib-lab#649 to be a little shorter.

kdm9 force-pushed the feature/load-into-counting_tsv-output branch from d8e8aeb to 5d47b86 Compare November 20, 2014 04:48

kdm9 added a commit to kdm9/khmer that referenced this pull request Nov 20, 2014

Change --machine-readable-info to --summary-info

5d47b86

This changes the overly long flag we added in PR dib-lab#649 to be a little shorter.

mr-c reviewed Nov 20, 2014
View reviewed changes

kdm9 added 12 commits November 21, 2014 11:32

Close open file handles in load-into-counting.py

d2dc3c7

We now use `with open(X) as y:` which automatically handles the closing of files.

load-into-counting: Fix typo in "total kmers" line

74bebd7

Writing it to the info file had the wrong sentence.

load-into-counting: Less confusing info print line

90103a4

Add format version to load-in2-counting json info

48338b5

Add a file format version key-val pair

Use argparse's choices param to validate arg

f3c5eca

We now use the choices kwarg to parser.add_argument to validate the ---machine-readable-info argument, instead of manually checking it. Also, updates the tests. modified: scripts/load-into-counting.py modified: tests/test_scripts.py

Update changelog with load-into-counting changes

e6c796e

modified: ChangeLog

load-in22-count: use new ht.n_unique_kmers method

4a3dc46

New API feature in ChangeLog on 2014-11-06. Also solves merge conflict (hopefully) modified: scripts/load-into-counting.py

Change --machine-readable-info to --summary-info

064c1d0

This changes the overly long flag we added in PR dib-lab#649 to be a little shorter.

Fix broken test in test_scripts.py

d73153a

Forgot to update the expected stderr after a rebase changed it. modified: tests/test_scripts.py

Revert erroneous merge conflict resolution

8fb778c

Accidentally changed a few unrelated issues while fixing merge conflcits. Reverted as asked to by @mr-c. Also remove duplicated status line introduced in a similar manner. modified: scripts/load-into-counting.py

kdm9 force-pushed the feature/load-into-counting_tsv-output branch from d564155 to 8fb778c Compare November 21, 2014 00:33

mr-c pushed a commit that referenced this pull request Dec 1, 2014

Change --machine-readable-info to --summary-info

72036df

This changes the overly long flag we added in PR #649 to be a little shorter.

mr-c closed this Dec 1, 2014

kdm9 deleted the feature/load-into-counting_tsv-output branch March 3, 2015 03:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CLI flag to load-into-counting.py allowing TSV output of basic stats #649

Add CLI flag to load-into-counting.py allowing TSV output of basic stats #649

kdm9 commented Nov 6, 2014

ged-jenkins commented Nov 6, 2014

ged-jenkins commented Nov 6, 2014

kdm9 commented Nov 6, 2014

mr-c commented Nov 6, 2014

mr-c Nov 6, 2014

kdm9 Nov 7, 2014

ctb commented Nov 7, 2014

luizirber commented Nov 7, 2014

kdm9 commented Nov 8, 2014

mr-c commented Nov 15, 2014

kdm9 commented Nov 15, 2014

mr-c commented Nov 17, 2014

kdm9 commented Nov 18, 2014

kdm9 commented Nov 18, 2014

kdm9 commented Nov 18, 2014

ctb Nov 19, 2014

ctb commented Nov 19, 2014

kdm9 commented Nov 20, 2014

mr-c commented Nov 20, 2014

kdm9 commented Nov 20, 2014

kdm9 commented Nov 20, 2014

kdm9 commented Nov 20, 2014

kdm9 commented Nov 20, 2014

mr-c Nov 20, 2014

kdm9 Nov 21, 2014

mr-c Nov 20, 2014

kdm9 Nov 21, 2014

mr-c commented Nov 20, 2014

mr-c commented Nov 20, 2014

kdm9 commented Nov 21, 2014

mr-c commented Dec 1, 2014

mr-c commented Dec 1, 2014


		with open(base + '.info', 'a') as info_fp:
		print >> sys.stderr, "Writing run information to", base + '.info'

Add CLI flag to load-into-counting.py allowing TSV output of basic stats #649

Add CLI flag to load-into-counting.py allowing TSV output of basic stats #649

Conversation

kdm9 commented Nov 6, 2014

ged-jenkins commented Nov 6, 2014

ged-jenkins commented Nov 6, 2014

kdm9 commented Nov 6, 2014

mr-c commented Nov 6, 2014

mr-c Nov 6, 2014

Choose a reason for hiding this comment

kdm9 Nov 7, 2014

Choose a reason for hiding this comment

ctb commented Nov 7, 2014

luizirber commented Nov 7, 2014

kdm9 commented Nov 8, 2014

mr-c commented Nov 15, 2014

kdm9 commented Nov 15, 2014

mr-c commented Nov 17, 2014

kdm9 commented Nov 18, 2014

kdm9 commented Nov 18, 2014

kdm9 commented Nov 18, 2014

ctb Nov 19, 2014

Choose a reason for hiding this comment

ctb commented Nov 19, 2014

kdm9 commented Nov 20, 2014

mr-c commented Nov 20, 2014

kdm9 commented Nov 20, 2014

kdm9 commented Nov 20, 2014

kdm9 commented Nov 20, 2014

kdm9 commented Nov 20, 2014

mr-c Nov 20, 2014

Choose a reason for hiding this comment

kdm9 Nov 21, 2014

Choose a reason for hiding this comment

mr-c Nov 20, 2014

Choose a reason for hiding this comment

kdm9 Nov 21, 2014

Choose a reason for hiding this comment

mr-c commented Nov 20, 2014

mr-c commented Nov 20, 2014

kdm9 commented Nov 21, 2014

mr-c commented Dec 1, 2014

mr-c commented Dec 1, 2014