Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanks for your work. But it's a bit strange here in the coreset sampling part #25

Open
mengxianghan123 opened this issue Sep 28, 2021 · 3 comments

Comments

@mengxianghan123
Copy link

for _ in range(N):
if self.already_selected is None:
# Initialize centers with a randomly selected datapoint
ind = np.random.choice(np.arange(self.n_obs))
else:
ind = np.argmax(self.min_distances)
# New examples should not be in already selected since those points
# should have min_distance of zero to a cluster center.
assert ind not in already_selected
self.update_distances([ind], only_new=True, reset_dist=False)
new_batch.append(ind)
print('Maximum distance from cluster centers is %0.2f'
% max(self.min_distances))
self.already_selected = already_selected

I think the wanted logic here should be initialing a centre point by random choice (the first if-branch) and all the entire centre points are chosen by argmax the distance (the second if-branch).
But for the "self.already_selected" is initialized as an empty list in line 49(not None), so we'll never get into the first if-branch. As for the centre point initializing process, the expression "np.argmax(self.min_distances)" will return 0 for "np.argmax(None)" will return 0.
So, as a result, the program selects index-0-feature as algorithm initialization everytime instead of random selecting one as a common practice.

@JefferyChiang
Copy link

I agree with your point. I think the first if-branch should be modify into "if not self.already_selected: " for more accurate to the kcenter_greedy algorithm. Have you modify the code and test the difference ?

@mengxianghan123
Copy link
Author

mengxianghan123 commented Oct 9, 2021

Yeah I modifyed the code like the following:

for _ in range(N):
            if not self.already_selected:
                # Initialize centers with a randomly selected datapoint
                ind = np.random.choice(np.arange(self.n_obs))
            else:
                ind = np.argmax(self.min_distances)
           
            assert ind not in self.already_selected

            self.update_distances([ind], only_new=True, reset_dist=False)
            new_batch.append(ind)
            print("Maximum distance from cluster centers is %0.2f" % max(self.min_distances))

            self.already_selected.append(ind)

I didn't test the difference with thorough experiments.
I conjectured that there will not be significant difference between the two implementation but our modification is more up to standard and more consistent with the intention of the algorithm

@HoseinHashemi
Copy link

This point makes total sense. I also modified the code but haven't thoroughly tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants