Thanks for your work. But it's a bit strange here in the coreset sampling part #25

mengxianghan123 · 2021-09-28T13:28:43Z

PatchCore_anomaly_detection/sampling_methods/kcenter_greedy.py

Lines 104 to 120 in 59b67d9

    
           for _ in range(N): 
        
             if self.already_selected is None: 
        
               # Initialize centers with a randomly selected datapoint 
        
               ind = np.random.choice(np.arange(self.n_obs)) 
        
             else: 
        
               ind = np.argmax(self.min_distances) 
        
             # New examples should not be in already selected since those points 
        
             # should have min_distance of zero to a cluster center. 
        
             assert ind not in already_selected 
        
             self.update_distances([ind], only_new=True, reset_dist=False) 
        
             new_batch.append(ind) 
        
           print('Maximum distance from cluster centers is %0.2f' 
        
                   % max(self.min_distances)) 
        
           self.already_selected = already_selected

I think the wanted logic here should be initialing a centre point by random choice (the first if-branch) and all the entire centre points are chosen by argmax the distance (the second if-branch).
But for the "self.already_selected" is initialized as an empty list in line 49(not None), so we'll never get into the first if-branch. As for the centre point initializing process, the expression "np.argmax(self.min_distances)" will return 0 for "np.argmax(None)" will return 0.
So, as a result, the program selects index-0-feature as algorithm initialization everytime instead of random selecting one as a common practice.

JefferyChiang · 2021-10-04T01:31:12Z

I agree with your point. I think the first if-branch should be modify into "if not self.already_selected: " for more accurate to the kcenter_greedy algorithm. Have you modify the code and test the difference ?

mengxianghan123 · 2021-10-09T01:45:13Z

Yeah I modifyed the code like the following:

for _ in range(N):
            if not self.already_selected:
                # Initialize centers with a randomly selected datapoint
                ind = np.random.choice(np.arange(self.n_obs))
            else:
                ind = np.argmax(self.min_distances)
           
            assert ind not in self.already_selected

            self.update_distances([ind], only_new=True, reset_dist=False)
            new_batch.append(ind)
            print("Maximum distance from cluster centers is %0.2f" % max(self.min_distances))

            self.already_selected.append(ind)

I didn't test the difference with thorough experiments.
I conjectured that there will not be significant difference between the two implementation but our modification is more up to standard and more consistent with the intention of the algorithm

HoseinHashemi · 2022-01-05T05:45:37Z

This point makes total sense. I also modified the code but haven't thoroughly tested.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanks for your work. But it's a bit strange here in the coreset sampling part #25

Thanks for your work. But it's a bit strange here in the coreset sampling part #25

mengxianghan123 commented Sep 28, 2021

JefferyChiang commented Oct 4, 2021

mengxianghan123 commented Oct 9, 2021 •

edited

Loading

HoseinHashemi commented Jan 5, 2022

Thanks for your work. But it's a bit strange here in the coreset sampling part #25

Thanks for your work. But it's a bit strange here in the coreset sampling part #25

Comments

mengxianghan123 commented Sep 28, 2021

JefferyChiang commented Oct 4, 2021

mengxianghan123 commented Oct 9, 2021 • edited Loading

HoseinHashemi commented Jan 5, 2022

mengxianghan123 commented Oct 9, 2021 •

edited

Loading