Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_kmers() and get_kmer_counts() functions #1049

Merged
merged 11 commits into from
Jun 9, 2015
Merged

Conversation

ctb
Copy link
Member

@ctb ctb commented May 30, 2015

Closes #1047.

@ctb
Copy link
Member Author

ctb commented May 30, 2015

  • Is it mergeable?
  • Did it pass the tests?
  • If it introduces new functionality in scripts/ is it tested?
    Check for code coverage with make clean diff-cover
  • Is it well formatted? Look at make pep8, make diff_pylint_report,
    make cppcheck, and make doc output. Use make format and manual
    fixing as needed.
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Is it documented in the ChangeLog?
    http://en.wikipedia.org/wiki/Changelog#Format
  • Was a spellchecker run on the source code and documentation after
    changes were made?
  • Is the Copyright year up to date?

@ctb
Copy link
Member Author

ctb commented May 30, 2015

ready for review & merge @luizirber @camillescott

@ctb ctb added this to the 2.0 milestone May 31, 2015
std::vector<std::string> &kmers_vec) const
{
if (s.length() < _ksize) {
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other cases where the sequence is shorter than K, we raise an exception; is this a case of letting an error pass silently?

@camillescott
Copy link
Member

I like get_kmer_counts; I'm not sure how I feel about get_kmers. This seems like something that should be done with an iterator and some pointer arithmetic, rather than generating L-K+1 new string objects. Thoughts?

@ctb
Copy link
Member Author

ctb commented Jun 5, 2015

Agree in theory. In practice I've been writing it in Python a lot so took the opportunity to implement it here :). Can we make it more independent of underlying representation (ie not vector of strings)? But either way Python will need the whole copy thing... Ergh. We could have it do the hashing to numbers, maybe?

Titus Brown, ctbrown@ucdavis.edu

On Jun 4, 2015, at 10:03 PM, Camille Scott notifications@github.com wrote:

I like get_kmer_counts; I'm not sure how I feel about get_kmers. This seems like something that should be done with an iterator and some pointer arithmetic, rather than generating L-K+1 new string objects. Thoughts?


Reply to this email directly or view it on GitHub.

@camillescott
Copy link
Member

To clarify: the value of get_kmers is obvious for internal use, but I think this is done inefficiently; for python-land stuff, we should really add a generator function to the Read objects returned by ReadParser, which can then return views of the underlying arrays using the buffer/memoryview interface (ie without a bunch of copying).

@camillescott
Copy link
Member

(note: that doesn't actually fix the problem for screed; we should add a similar iterator to screed records as well. after all, our speciality is k-mer analysis ;))

@camillescott
Copy link
Member

Moar: see https://docs.python.org/3.4/c-api/buffer.html#complex-arrays; I believe we can use the strides parameter to avoid copying the underlying data while returning views on the string. Alternatively, I can put away the clippers and just merge it...

@ctb
Copy link
Member Author

ctb commented Jun 5, 2015

Added get_kmer_hashes(); I think for now we should leave in get_kmers(), and if it becomes a performance issue we can revisit. (I don't like the idea of adding a lot of complexity around something that's new and unused!).

Remaining issue is whether or not get_kmers(), get_kmer_hashes(), and get_kmer_counts() should error out on strings of length < len(ksize). Since we're returning lists, I think it's OK to just return an empty list, as opposed to raising an error, which is what we should do when the return value is nonsensical. By this logic, things like 'consume' should not error out, but we can leave that for a different PR.

@camillescott review code & logic? :)

@luizirber any comments on the Python 3 implications raised in 6de148b?

@luizirber
Copy link
Member

I had it almost fixed on refactor/py3 branch:
https://github.com/dib-lab/khmer/pull/978/files#diff-944b784821ddf1048fc73488ee2ee675R943

'Almost' because:

  • I didn't check for PyLong
  • The 'else' error message is wrong (should say int/long as well)

I noticed because one of the tests was calling hashtable.get() with
Unicode arguments, but the result was 0 (instead of 1).

I don't remember seeing missing else statements, but might be worth to
check it better.

@ctb
Copy link
Member Author

ctb commented Jun 5, 2015

On Fri, Jun 05, 2015 at 07:21:15AM -0700, Luiz Irber wrote:

I had it almost fixed on refactor/py3 branch:
https://github.com/dib-lab/khmer/pull/978/files#diff-944b784821ddf1048fc73488ee2ee675R943

'Almost' because:

  • I didn't check for PyLong
  • The 'else' error message is wrong (should say int/long as well)

well, and no test to make sure it didn't happen again!

I noticed because one of the tests was calling hashtable.get() with
Unicode arguments, but the result was 0 (instead of 1).

yep :)

I don't remember seeing missing else statements, but might be worth to
check it better.

I checked in _khmermodule.cc, nothing else there.

@ctb
Copy link
Member Author

ctb commented Jun 8, 2015

ping @camillescott ready for review and merge

@camillescott
Copy link
Member

I still find the implementation very distasteful, but I'll give it an LGTM

@ctb
Copy link
Member Author

ctb commented Jun 9, 2015

...and merge. AND MERGE... pretty please?

@luizirber
Copy link
Member

I will fix the conflict and merge.

@ctb
Copy link
Member Author

ctb commented Jun 9, 2015

tnx

On Tue, Jun 09, 2015 at 07:59:38AM -0700, Luiz Irber wrote:

I will fix the conflict and merge.


Reply to this email directly or view it on GitHub:

#1049 (comment)

C. Titus Brown, ctbrown@ucdavis.edu

luizirber added a commit that referenced this pull request Jun 9, 2015
Add get_kmers() and get_kmer_counts() functions
@luizirber luizirber merged commit 596b180 into master Jun 9, 2015
@luizirber luizirber deleted the feature/kmer_counts branch June 9, 2015 15:28

hi.consume("AAAAAA")
counts = hi.get_kmer_counts("A")
assert len(counts) == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdmurray91 points out that this appears to be a repeated test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add graph function that returns list of k-mer counts in a sequence
4 participants