Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locate nearest clusters for given data #214

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

pulquero
Copy link

Fixes #208.

@codecov
Copy link

codecov bot commented Feb 14, 2021

Codecov Report

Merging #214 (bf9d55e) into master (56dbd66) will increase coverage by 0.19%.
The diff coverage is 86.04%.

❗ Current head bf9d55e differs from pull request most recent head 0bd7b00. Consider uploading reports for the commit 0bd7b00 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #214      +/-   ##
==========================================
+ Coverage   79.62%   79.82%   +0.19%     
==========================================
  Files          11       11              
  Lines         854      892      +38     
  Branches      186      199      +13     
==========================================
+ Hits          680      712      +32     
- Misses        141      144       +3     
- Partials       33       36       +3     
Impacted Files Coverage Δ
kmapper/kmapper.py 88.67% <82.35%> (-0.79%) ⬇️
kmapper/cover.py 88.88% <100.00%> (+0.46%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56dbd66...0bd7b00. Read the comment docs.

@sauln
Copy link
Member

sauln commented Mar 15, 2021

@pulquero , you have 2 PRs (this and #213). Should we review both, or does this one subsume the other?

@pulquero
Copy link
Author

This subsumes both, so you could just review this one. Or if you want to do it in parts, start with the other.

Copy link
Collaborator

@deargle deargle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a mini example to /examples like a mini vignette?

Edit: example here: https://github.com/sphinx-gallery/sphinx-gallery/blob/master/examples/no_output/just_code.py

----------
newdata : Numpy array
New dataset. Accepts both 1-D and 2-D array.
nodes : dict
Copy link
Collaborator

@deargle deargle Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, this could be all of the nodes from a graph, right? It would be horribly inefficient because there's no way that a point would be close to a cluster that wasn't even in its open set, but still.

If I'm thinking correctly, if it's just for efficiency, then nearest_clusters could itself receive the full graph, and itself call clusters_from_cover before looping over the nodes to get cluster_members.

Am I thinking correctly?

Also, I lean towards standardizing on replacing clusters with nodes throughout.

And eventually, but not now, replacing cube with openset throughout;

if so, that would look something like:

def nearest_nodes(self, newdata, graph, cover, data, nn):
   cube_ids = cover.find(newdata)
   nodes = self.find_nodes(graph, cube_ids)
   
   # then the rest unchanged...
   for cluster_id, cluster_members in nodes.items():
        cluster_data = data[cluster_members]
        nn_data.append(cluster_data)
        nn_cluster_ids.append([cluster_id]*len(cluster_data))
   nn_data = np.vstack(nn_data)
   nn_cluster_ids = np.concatenate(nn_cluster_ids)
   nn.fit(nn_data)
   nn_ids = nn.kneighbors(newdata, return_distance=False)
   return np.unique(nn_cluster_ids[nn_ids])

@@ -827,6 +827,63 @@ def data_from_cluster_id(self, cluster_id, graph, data):
else:
return np.array([])

def clusters_from_cover(self, cube_ids, graph):
"""Returns the clusters and their members from the subset of the cover spanned by the given cube_ids
Copy link
Collaborator

@deargle deargle Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud. I'm trying to think of another name. Kmapper has a separate Cover class, so calling this clusters_from_cover suggests to me that a cover should be passed, but it isn't.

But a Cover doesn't have clusters, so I don't think this should go in the Cover class.

If graph were a class, this would go in there as graph.find_clusters_by_cube_ids(cube_ids) or something.

Sort-of following the pattern from the last PR, maybe we rename this to find_clusters find_nodes

@deargle
Copy link
Collaborator

deargle commented Mar 18, 2021

In general in the above, I argue for converting nearest_nodes (formerly nearest_clusters) into a one-stop function that calls the other two created by this and the previous pr.

@deargle
Copy link
Collaborator

deargle commented Mar 18, 2021

Side note, @sauln I didn't realize that sphinx-gallery would even render files that don't plot_ anything, see here. I'll go back and link to the rendered version of make-circles in our docs, it currently links out to github.

@pulquero pulquero force-pushed the closest_clusters branch 2 times, most recently from 458e094 to 4ae98ea Compare March 19, 2021 19:50
@pulquero
Copy link
Author

Hit a bit of a snag, but we should be good now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make it possible to find the closest cluster/node for new data
3 participants