Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Creating an inflate function to copy abundances from one MinHash to another #1620

Merged
merged 28 commits into from
Aug 8, 2021

Conversation

keyabarve
Copy link
Contributor

@keyabarve keyabarve commented Jun 21, 2021

Fixes #1463

Done:

  • Replaces the code block in commands.gather().
  • Creates a new method called inflate in the MinHash class of src/sourmash/minhash.py.
  • This method takes one argument in addition to self, a MinHash object called from_mh with track_abundance=True.
  • It also creates a new MinHash object that borrows the abundances from from_mh using only the hashes from self.
  • Writes one test in tests/test_minhash.py for testing the function.
  • Write the same method in the Frozen MinHash class of src/sourmash/minhash.py.
  • Add some code for error checking.
  • Write tests for the new code.
  • Check if the code can apply to multigather as well.

@codecov-commenter
Copy link

codecov-commenter commented Jun 21, 2021

Codecov Report

Merging #1620 (b361cbd) into latest (79cb35e) will increase coverage by 7.36%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1620      +/-   ##
==========================================
+ Coverage   82.64%   90.00%   +7.36%     
==========================================
  Files         114       87      -27     
  Lines       12189     8391    -3798     
  Branches     1554     1555       +1     
==========================================
- Hits        10073     7552    -2521     
+ Misses       1855      578    -1277     
  Partials      261      261              
Flag Coverage Δ
python 90.00% <100.00%> (+<0.01%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/commands.py 87.29% <100.00%> (-0.09%) ⬇️
src/sourmash/minhash.py 92.53% <100.00%> (+0.15%) ⬆️
src/core/src/index/sbt/mod.rs
src/core/src/index/storage.rs
src/core/src/from.rs
src/core/src/sketch/nodegraph.rs
src/core/src/sketch/minhash.rs
src/core/src/index/search.rs
src/core/src/sketch/hyperloglog/mod.rs
src/core/src/sketch/hyperloglog/estimators.rs
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79cb35e...b361cbd. Read the comment docs.

@keyabarve
Copy link
Contributor Author

@ctb Please review. Does the function seem fine in both the MinHash and FrozenMinHash classes? I'm not sure what this means: replace the code block in commands.gather() with using this new function (and see if it can apply to multigather); Could you please clarify?

@ctb
Copy link
Contributor

ctb commented Jun 21, 2021

@ctb Please review. Does the function seem fine in both the MinHash and FrozenMinHash classes?

well, see below:

I'm not sure what this means: replace the code block in commands.gather() with using this new function (and see if it can apply to multigather); Could you please clarify?

Do you see the code block from #1463 in src/sourmash/commands.py?

@keyabarve
Copy link
Contributor Author

@ctb Please review. Does the function seem fine in both the MinHash and FrozenMinHash classes?

well, see below:

I'm not sure what this means: replace the code block in commands.gather() with using this new function (and see if it can apply to multigather); Could you please clarify?

Do you see the code block from #1463 in src/sourmash/commands.py?

No, I don't see it.

@ctb
Copy link
Contributor

ctb commented Jun 21, 2021

@ctb Please review. Does the function seem fine in both the MinHash and FrozenMinHash classes?

well, see below:

I'm not sure what this means: replace the code block in commands.gather() with using this new function (and see if it can apply to multigather); Could you please clarify?

Do you see the code block from #1463 in src/sourmash/commands.py?

No, I don't see it.

Please see block starting here: https://github.com/sourmash-bio/sourmash/blob/latest/src/sourmash/commands.py#L804

@keyabarve
Copy link
Contributor Author

keyabarve commented Jun 21, 2021

@ctb Please review. Does the function seem fine in both the MinHash and FrozenMinHash classes?

well, see below:

I'm not sure what this means: replace the code block in commands.gather() with using this new function (and see if it can apply to multigather); Could you please clarify?

Do you see the code block from #1463 in src/sourmash/commands.py?

No, I don't see it.

Please see block starting here: https://github.com/sourmash-bio/sourmash/blob/latest/src/sourmash/commands.py#L804

This chunk of code is not present in my current code. Is there a way that I can get this code by pulling from a certain branch perhaps? Should I maybe update from latest?

@ctb
Copy link
Contributor

ctb commented Jun 21, 2021

This chunk of code is not present in my current code. Is there a way that I can get this code by pulling from a certain branch perhaps? Should I maybe update from latest?

yes, that link is a pointer to the latest branch.

@keyabarve
Copy link
Contributor Author

This chunk of code is not present in my current code. Is there a way that I can get this code by pulling from a certain branch perhaps? Should I maybe update from latest?

yes, that link is a pointer to the latest branch.

I still can't see that particular chunk of code, even after merging the branches.

@ctb
Copy link
Contributor

ctb commented Jun 21, 2021

This chunk of code is not present in my current code. Is there a way that I can get this code by pulling from a certain branch perhaps? Should I maybe update from latest?

yes, that link is a pointer to the latest branch.

I still can't see that particular chunk of code, even after merging the branches.

I'm not sure what to tell you - it's in the pull request now,

https://github.com/sourmash-bio/sourmash/blob/KB_1463/src/sourmash/commands.py#L804

Did you need to reload the file in your editor, perhaps?

@keyabarve
Copy link
Contributor Author

I reloaded it as well, but I still can't see it.

@ctb
Copy link
Contributor

ctb commented Jun 21, 2021

can you copy and paste the output of:

git fetch
git status

here, please? thanks!

@keyabarve
Copy link
Contributor Author

keyabarve commented Jun 21, 2021

This is all for git fetch:

remote: Enumerating objects: 10, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 10 (delta 7), reused 10 (delta 7), pack-reused 0
Unpacking objects: 100% (10/10), 1.36 KiB | 99.00 KiB/s, done.
From https://github.com/dib-lab/sourmash
   e677908c..a35bfa90  add-taxonomy -> origin/add-taxonomy

This is all for git status:

On branch KB_1463
Your branch is up to date with 'origin/KB_1463'.

nothing to commit, working tree clean

@ctb
Copy link
Contributor

ctb commented Jun 21, 2021

ok - on line 804 of src/sourmash/commands.py in branch KB_1463 on your computer, do you see this code?

           if is_abundance:
                # remaining_query is flattened; reinflate abundances            
                hashes = set(remaining_query.minhash.hashes)
                orig_abunds = orig_query_mh.hashes
                abunds = { h: orig_abunds[h] for h in hashes }

                abund_query_mh = orig_query_mh.copy_and_clear()
                # orig_query might have been downsampled...                     
                abund_query_mh.downsample(scaled=gather_iter.scaled)
                abund_query_mh.set_abundances(abunds)
                remaining_query.minhash = abund_query_mh

@keyabarve
Copy link
Contributor Author

ok - on line 804 of src/sourmash/commands.py in branch KB_1463 on your computer, do you see this code?

           if is_abundance:
                # remaining_query is flattened; reinflate abundances            
                hashes = set(remaining_query.minhash.hashes)
                orig_abunds = orig_query_mh.hashes
                abunds = { h: orig_abunds[h] for h in hashes }

                abund_query_mh = orig_query_mh.copy_and_clear()
                # orig_query might have been downsampled...                     
                abund_query_mh.downsample(scaled=gather_iter.scaled)
                abund_query_mh.set_abundances(abunds)
                remaining_query.minhash = abund_query_mh

okay, yes I see it now.

@keyabarve
Copy link
Contributor Author

keyabarve commented Jun 22, 2021

           if is_abundance:
                # remaining_query is flattened; reinflate abundances            
                hashes = set(remaining_query.minhash.hashes)
                orig_abunds = orig_query_mh.hashes
                abunds = { h: orig_abunds[h] for h in hashes }

                abund_query_mh = orig_query_mh.copy_and_clear()
                # orig_query might have been downsampled...                     
                abund_query_mh.downsample(scaled=gather_iter.scaled)
                abund_query_mh.set_abundances(abunds)
                remaining_query.minhash = abund_query_mh

So, this chunk of code is to be removed from here and written in the function inflate that I wrote in src/sourmash/minhash.py. Is this correct?

@ctb
Copy link
Contributor

ctb commented Jun 22, 2021 via email

@keyabarve
Copy link
Contributor Author

do I still need to keep the code that I've written for the inflate function already?

@ctb
Copy link
Contributor

ctb commented Jun 23, 2021 via email

@keyabarve
Copy link
Contributor Author

I wrote a separate function in src/sourmash/commands.py with the code from the if block and it passed all the tests. Should I now remove that function from there and put it in the MinHash class in src/sourmash/minhash.py?

@ctb
Copy link
Contributor

ctb commented Jun 25, 2021

I wrote a separate function in src/sourmash/commands.py with the code from the if block and it passed all the tests. Should I now remove that function from there and put it in the MinHash class in src/sourmash/minhash.py?

yes, I think so - some suggestions,

  • note that gather_iter.scaled is an integer and should be the same value as remaining_query.minhash.scaled
  • the call from within commands.py should be writeable as something like abund_query_mh = remaining_query.minhash.inflate(orig_query_mh). Here, remaining_query.minhash, orig_query_mh, and abund_query_mh are all MinHash objects.

@keyabarve
Copy link
Contributor Author

When I remove the function from src/sourmash/commands.py, should I even remove the if block where the function was being called?

@ctb
Copy link
Contributor

ctb commented Jun 25, 2021

When I remove the function from src/sourmash/commands.py, should I even remove the if block where the function was being called?

get something working before fine tuning it :)

@keyabarve
Copy link
Contributor Author

I had a couple of questions:

  • I'm confused about how to do this: "this method should take one argument in addition to self, a MinHash object with track_abundance=True;". So should the function look like:
def inflate (self, obj):
...

where the track_abundance of obj is true?

  • Should I check if self.track_abundance or if obj.track_abundance before I do anything in the function?
  • In my current function in src/sourmash/commands.py, the arguments are: remaining_query, orig_query_mh and gather_iter. How do I replace these?
  • Where would these variables above be defined then?

@ctb
Copy link
Contributor

ctb commented Jul 21, 2021

So, it looks like tests/test_sourmash.py::test_gather_output_unassigned_with_abundance is failing because track_abundance is False on the MinHash object loaded from unassigned.sig. Could you tell me where in the codebase this MinHash object is originally created before being saved to unassigned.sig?

By 'this MinHashobject', do you mean nomatch?

The MinHash object on the signature loaded from unassigned.sig, yes.

@keyabarve
Copy link
Contributor Author

keyabarve commented Jul 23, 2021

@ctb I believe it is this:

orig_query_mh = query.minhash

in def gather in src/sourmash/commands.py.

@ctb
Copy link
Contributor

ctb commented Jul 23, 2021

alas, no :).

let's trace that back - the signature is loaded from unassigned.sig, which is produced by --output-unassigned' unassigned.sig in the test code. Where is the code in the gather function in commands.py that saves something to the filename specified by the --output-unassigned argument to gather?

@keyabarve
Copy link
Contributor Author

I believe it is this code:

if args.output_unassigned:
        remaining_query = gather_iter.query
        if not (remaining_query.minhash or noident_mh):
            notify('no unassigned hashes to save with --output-unassigned!')
        else:
            notify(f"saving unassigned hashes to '{args.output_unassigned}'")

            if noident_mh:
                remaining_mh = remaining_query.minhash.to_mutable()
                remaining_mh += noident_mh
                remaining_query.minhash = remaining_mh

            if is_abundance:
                abund_query_mh = remaining_query.minhash.inflate(orig_query_mh)

            with FileOutput(args.output_unassigned, 'wt') as fp:
                sig.save_signatures([ remaining_query ], fp)

@ctb
Copy link
Contributor

ctb commented Jul 23, 2021

Yep. When you look at the diff from the code that was there before (and was passing the test), do you see anything missing from the current code in comparison to the old code that might be triggering the test failure?

@keyabarve
Copy link
Contributor Author

keyabarve commented Jul 23, 2021

This was the original code:

if is_abundance:
                # remaining_query is flattened; reinflate abundances
                hashes = set(remaining_query.minhash.hashes)
                orig_abunds = orig_query_mh.hashes
                abunds = { h: orig_abunds[h] for h in hashes }

                abund_query_mh = orig_query_mh.copy_and_clear()
                # orig_query might have been downsampled...
                abund_query_mh.downsample(scaled=gather_iter.scaled)
                abund_query_mh.set_abundances(abunds)
                remaining_query.minhash = abund_query_mh

And this is the new code in the inflate function:

if (self.track_abundance == False) and (from_mh.track_abundance == True):
            hashes = set(self.hashes)
            orig_abunds = from_mh.hashes
            abunds = { h: orig_abunds[h] for h in hashes }

            abund_query_mh = from_mh.copy_and_clear()

            abund_query_mh.downsample(scaled=self.scaled)
            abund_query_mh.set_abundances(abunds)
            self.minhash = abund_query_mh

            return abund_query_mh
else:
            raise ValueError('value of track_abundance for self should be false and from_mh should be true') 

The only differences are that remaining_query is now self and orig_query_mh is from_mh. Apart from that, the if condition is also different: if is_abundance: is now if (self.track_abundance == False) and (from_mh.track_abundance == True):

@ctb
Copy link
Contributor

ctb commented Jul 24, 2021

Here, remaining_query is a SourmashSignature, not a MinHash object. So you'll need to set remaining_query.minhash to the minhash that was created by the inflate method, as in the original code.

@keyabarve
Copy link
Contributor Author

keyabarve commented Jul 24, 2021

Here, remaining_query is a SourmashSignature, not a MinHash object. So you'll need to set remaining_query.minhash to the minhash that was created by the inflate method, as in the original code.

Oh I see, so then it would be self.minhash.hashes instead of self.hashes. But when I change that, it still fails this test, and in addition, it also fails one of the tests that I wrote.
This is the error for tests/test_sourmash.py::test_gather_output_unassigned_with_abundance:

        if self.last_result.status:
>           raise ValueError(self)
E           ValueError: Last command run:
E           'sourmash gather /Users/keya/sourmash/tests/test-data/gather-abund/reads-s10x10-s11.sig /Users/keya/sourmash/tests/test-data/gather-abund/genome-s10.fa.gz.sig --output-unassigned /var/folders/6l/gx238bkn0qdbvjyl3zxxnm_40000gn/T/sourmashtest_qo65mdfk/unassigned.sig --no-linear --no-prefetch'
E           
E           LAST RESULT:
E           - exit code: -1
E           
E           - stdout:
E           ---
E           
E           
E           overlap     p_query p_match avg_abund
E           ---------   ------- ------- ---------
E           0.5 Mbp       91.0%  100.0%      14.5    tests/test-data/genome-s10.fa.gz
E           
E           found 1 matches total;
E           the recovered matches hit 91.0% of the query
E           
E           ---
E           - stderr:
E           ---
select query k=21 automatically.
loaded query: 1-1... (k=21, DNA)
loaded 1 signatures.
found less than 50.0 kbp in common. => exiting
saving unassigned hashes to '/var/folders/6l/gx238bkn0qdbvjyl3zxxnm_40000gn/T/sourmashtest_qo65mdfk/unassigned.sig'
E           Traceback (most recent call last):
E             File "/Users/keya/sourmash/tests/sourmash_tst_utils.py", line 105, in runscript
E               status = _runscript(scriptname)
E             File "/Users/keya/sourmash/tests/sourmash_tst_utils.py", line 47, in _runscript
E               pkg_resources.load_entry_point("sourmash", 'console_scripts', scriptname)()
E             File "/Users/keya/sourmash/src/sourmash/__main__.py", line 13, in main
E               return mainmethod(args)
E             File "/Users/keya/sourmash/src/sourmash/cli/gather.py", line 86, in main
E               return sourmash.commands.gather(args)
E             File "/Users/keya/sourmash/src/sourmash/commands.py", line 803, in gather
E               abund_query_mh = remaining_query.minhash.inflate(orig_query_mh)
E             File "/Users/keya/sourmash/src/sourmash/minhash.py", line 797, in inflate
E               hashes = set(self.minhash.hashes)
E           AttributeError: 'FrozenMinHash' object has no attribute 'minhash'
E           ---

tests/sourmash_tst_utils.py:189: ValueError

@ctb
Copy link
Contributor

ctb commented Jul 28, 2021

please see #1691

@keyabarve
Copy link
Contributor Author

I have merged the branch created by @ctb to resolve some of the issues and was wondering if there is anything more to be done. I have written 2 tests already to test the inflate function.

@ctb
Copy link
Contributor

ctb commented Aug 6, 2021

I have merged the branch created by @ctb to resolve some of the issues and was wondering if there is anything more to be done. I have written 2 tests already to test the inflate function.

please read and fix and/or resolve all of the issues mentioned in previous reviews; you can see them in the file tab.

when a PR is ready for review, please change its name to [MRG] and ask for a review.

@keyabarve keyabarve changed the title [WIP] Creating an inflate function to copy abundances from one MinHash to another [MRG] Creating an inflate function to copy abundances from one MinHash to another Aug 6, 2021
@keyabarve
Copy link
Contributor Author

@ctb This is ready for review. I had one question though: You had mentioned this comment in the original issue: "Check if the code can apply to multigather as well." What should I be doing exactly in the multigather function? Would it be the same as what is currently there in the gather function?

@ctb
Copy link
Contributor

ctb commented Aug 7, 2021

@ctb This is ready for review. I had one question though: You had mentioned this comment in the original issue: "Check if the code can apply to multigather as well." What should I be doing exactly in the multigather function? Would it be the same as what is currently there in the gather function?

multigather currently flattens unmatched hatches - see the comment CTB: note, multigather does not save abundances in commands.py, near line 980.

At this point I suggest opening a new issue to add saving of abundances for unassigned hashes to multigather. Then, if you want to tackle the issue in the future, start by writing a test based on the gather tests for inflate, and then make sure that fails on multigather; then fix multigather's implementation to pass the test.

orig_abunds = from_mh.hashes
abunds = { h: orig_abunds[h] for h in hashes }

abund_query_mh = from_mh.copy_and_clear()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do change this to abund_mh unless you have a reason otherwise.

@keyabarve
Copy link
Contributor Author

@ctb Please review.

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

create a MinHash function to copy abundances from one MinHash to another
3 participants