Add get_kmers() and get_kmer_counts() functions #1049

ctb · 2015-05-30T21:32:48Z

Closes #1047.

ctb · 2015-05-30T22:07:57Z

Is it mergeable?
Did it pass the tests?
If it introduces new functionality in scripts/ is it tested?
Check for code coverage with make clean diff-cover
Is it well formatted? Look at make pep8, make diff_pylint_report,
make cppcheck, and make doc output. Use make format and manual
fixing as needed.
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Is it documented in the ChangeLog?
http://en.wikipedia.org/wiki/Changelog#Format
Was a spellchecker run on the source code and documentation after
changes were made?
Is the Copyright year up to date?

ctb · 2015-05-30T22:08:22Z

ready for review & merge @luizirber @camillescott

…ounts Conflicts: ChangeLog

camillescott · 2015-06-05T02:00:42Z

lib/hashtable.cc

+                          std::vector<std::string> &kmers_vec) const
+{
+    if (s.length() < _ksize) {
+        return;


In other cases where the sequence is shorter than K, we raise an exception; is this a case of letting an error pass silently?

camillescott · 2015-06-05T02:03:27Z

I like get_kmer_counts; I'm not sure how I feel about get_kmers. This seems like something that should be done with an iterator and some pointer arithmetic, rather than generating L-K+1 new string objects. Thoughts?

ctb · 2015-06-05T02:14:09Z

Agree in theory. In practice I've been writing it in Python a lot so took the opportunity to implement it here :). Can we make it more independent of underlying representation (ie not vector of strings)? But either way Python will need the whole copy thing... Ergh. We could have it do the hashing to numbers, maybe?

Titus Brown, ctbrown@ucdavis.edu

On Jun 4, 2015, at 10:03 PM, Camille Scott notifications@github.com wrote:

I like get_kmer_counts; I'm not sure how I feel about get_kmers. This seems like something that should be done with an iterator and some pointer arithmetic, rather than generating L-K+1 new string objects. Thoughts?

—
Reply to this email directly or view it on GitHub.

camillescott · 2015-06-05T02:17:12Z

To clarify: the value of get_kmers is obvious for internal use, but I think this is done inefficiently; for python-land stuff, we should really add a generator function to the Read objects returned by ReadParser, which can then return views of the underlying arrays using the buffer/memoryview interface (ie without a bunch of copying).

camillescott · 2015-06-05T02:20:27Z

(note: that doesn't actually fix the problem for screed; we should add a similar iterator to screed records as well. after all, our speciality is k-mer analysis ;))

camillescott · 2015-06-05T02:28:59Z

Moar: see https://docs.python.org/3.4/c-api/buffer.html#complex-arrays; I believe we can use the strides parameter to avoid copying the underlying data while returning views on the string. Alternatively, I can put away the clippers and just merge it...

ctb · 2015-06-05T12:17:34Z

Added get_kmer_hashes(); I think for now we should leave in get_kmers(), and if it becomes a performance issue we can revisit. (I don't like the idea of adding a lot of complexity around something that's new and unused!).

Remaining issue is whether or not get_kmers(), get_kmer_hashes(), and get_kmer_counts() should error out on strings of length < len(ksize). Since we're returning lists, I think it's OK to just return an empty list, as opposed to raising an error, which is what we should do when the return value is nonsensical. By this logic, things like 'consume' should not error out, but we can leave that for a different PR.

@camillescott review code & logic? :)

@luizirber any comments on the Python 3 implications raised in 6de148b?

luizirber · 2015-06-05T14:21:15Z

I had it almost fixed on refactor/py3 branch:
https://github.com/dib-lab/khmer/pull/978/files#diff-944b784821ddf1048fc73488ee2ee675R943

'Almost' because:

I didn't check for PyLong
The 'else' error message is wrong (should say int/long as well)

I noticed because one of the tests was calling hashtable.get() with
Unicode arguments, but the result was 0 (instead of 1).

I don't remember seeing missing else statements, but might be worth to
check it better.

ctb · 2015-06-05T14:39:32Z

On Fri, Jun 05, 2015 at 07:21:15AM -0700, Luiz Irber wrote:

I had it almost fixed on refactor/py3 branch:
https://github.com/dib-lab/khmer/pull/978/files#diff-944b784821ddf1048fc73488ee2ee675R943

'Almost' because:

I didn't check for PyLong

The 'else' error message is wrong (should say int/long as well)

well, and no test to make sure it didn't happen again!

I noticed because one of the tests was calling hashtable.get() with
Unicode arguments, but the result was 0 (instead of 1).

yep :)

I don't remember seeing missing else statements, but might be worth to
check it better.

I checked in _khmermodule.cc, nothing else there.

ctb · 2015-06-08T21:04:21Z

ping @camillescott ready for review and merge

camillescott · 2015-06-09T03:12:26Z

I still find the implementation very distasteful, but I'll give it an LGTM

ctb · 2015-06-09T10:26:32Z

...and merge. AND MERGE... pretty please?

luizirber · 2015-06-09T14:59:38Z

I will fix the conflict and merge.

ctb · 2015-06-09T15:04:25Z

tnx

On Tue, Jun 09, 2015 at 07:59:38AM -0700, Luiz Irber wrote:

I will fix the conflict and merge.

Reply to this email directly or view it on GitHub:

#1049 (comment)

C. Titus Brown, ctbrown@ucdavis.edu

Add get_kmers() and get_kmer_counts() functions

mr-c · 2015-06-14T09:27:09Z

tests/test_counting_hash.py

+
+    hi.consume("AAAAAA")
+    counts = hi.get_kmer_counts("A")
+    assert len(counts) == 0


@kdmurray91 points out that this appears to be a repeated test.

ctb added 3 commits May 30, 2015 16:23

cleaned up the hashtable_methods table a bit

d42177c

add get_kmers and get_kmer_counts to Hashtable objects

ae175c0

minor refactoring

4a66496

ctb added this to the 2.0 milestone May 31, 2015

ctb added 2 commits June 4, 2015 14:58

Merge branch 'master' of github.com:ged-lab/khmer into feature/kmer_c…

d1cf0b2

…ounts Conflicts: ChangeLog

Merge branch 'master' of github.com:ged-lab/khmer into feature/kmer_c…

527dac5

…ounts Conflicts: ChangeLog

camillescott reviewed Jun 5, 2015
View reviewed changes

ctb added 4 commits June 5, 2015 07:30

added test for Hashtable.get on numerical hash

56e1990

fixed nasty bug in Hashtable.get() at CPython layer

6de148b

added get_kmer_hashes()

1de9d5a

update ChangeLog

045e5f9

added some tests for short sequences

1f41c51

Merge branch 'master' into feature/kmer_counts

7be805b

luizirber added a commit that referenced this pull request Jun 9, 2015

Merge pull request #1049 from dib-lab/feature/kmer_counts

596b180

Add get_kmers() and get_kmer_counts() functions

luizirber merged commit 596b180 into master Jun 9, 2015

luizirber deleted the feature/kmer_counts branch June 9, 2015 15:28

mr-c reviewed Jun 14, 2015
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add get_kmers() and get_kmer_counts() functions #1049

Add get_kmers() and get_kmer_counts() functions #1049

ctb commented May 30, 2015

ctb commented May 30, 2015

ctb commented May 30, 2015

camillescott Jun 5, 2015

camillescott commented Jun 5, 2015

ctb commented Jun 5, 2015

camillescott commented Jun 5, 2015

camillescott commented Jun 5, 2015

camillescott commented Jun 5, 2015

ctb commented Jun 5, 2015

luizirber commented Jun 5, 2015

ctb commented Jun 5, 2015

ctb commented Jun 8, 2015

camillescott commented Jun 9, 2015

ctb commented Jun 9, 2015

luizirber commented Jun 9, 2015

ctb commented Jun 9, 2015

#1049 (comment)

mr-c Jun 14, 2015

Add get_kmers() and get_kmer_counts() functions #1049

Add get_kmers() and get_kmer_counts() functions #1049

Conversation

ctb commented May 30, 2015

ctb commented May 30, 2015

ctb commented May 30, 2015

camillescott Jun 5, 2015

Choose a reason for hiding this comment

camillescott commented Jun 5, 2015

ctb commented Jun 5, 2015

camillescott commented Jun 5, 2015

camillescott commented Jun 5, 2015

camillescott commented Jun 5, 2015

ctb commented Jun 5, 2015

luizirber commented Jun 5, 2015

ctb commented Jun 5, 2015

ctb commented Jun 8, 2015

camillescott commented Jun 9, 2015

ctb commented Jun 9, 2015

luizirber commented Jun 9, 2015

ctb commented Jun 9, 2015

#1049 (comment)

mr-c Jun 14, 2015

Choose a reason for hiding this comment