Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DisSim implementation correct? #109

Open
davnn opened this issue Mar 9, 2023 · 5 comments
Open

DisSim implementation correct? #109

davnn opened this issue Mar 9, 2023 · 5 comments

Comments

@davnn
Copy link

davnn commented Mar 9, 2023

Hi! Thanks for you work with the package. I'm currently trying to understand the implementation of DisSim, which seems to differ from the paper. If idx and dist contain the indices and distances to the n-1 neighbors, equation 8 from the paper would be something like:

def dissim(x, idx, dist):
    centroids = x[idx].mean(axis=1)
    dist_x, dist_y = x - centroids, x[idx] - centroids[idx]
    norm_dist_x = np.einsum("ij,ij -> i", dist_x, dist_x)[:,np.newaxis]
    norm_dist_y = np.einsum("ijk,ijk -> ij", dist_y, dist_y)
    return dist - np.sqrt(norm_dist_x) - np.sqrt(norm_dist_y)

The implemented version returns something like np.sqrt(dist - norm_dist_x - norm_dist_y), where the norms are generated slightly differently. Is this intended and only a documentation problem?

@davnn davnn changed the title DisSim Implementation correct? DisSim implementation correct? Mar 9, 2023
@VarIr
Copy link
Owner

VarIr commented Mar 9, 2023

Hi David, thanks for your interest in the package. I am not very familiar with the einsum notation, so I cannot comment on these parts. What differences do you see in how the norms are calculated? Note, btw, that the three terms in (8) are squared Euclidean distance, so I assume there should be no np.sqrts. (Also the final sqrt in our implementation is a non-default option of let's say questionable usefulness).

@davnn
Copy link
Author

davnn commented Mar 10, 2023

Sorry, I posted the Euclidean distance version as that was the version I am interested in, but you are right, the default should be squared (and it is!).

I am mainly concerned about the squared=False option, because I would expect that Euclidean distances are used if squared=False, but instead np.sqrt is applied over the whole expression leading to a questionable result.
Regarding the norm calculation, it's quite tricky to identify where the calculation begins to differ, but it does differ from my proposal of which I am also not certain if it is correct. A small nitpick is that the reference implementation (I only checked that one) mutates the input dist on transform, but I can send a PR if you would like to change that.

@VarIr
Copy link
Owner

VarIr commented Mar 10, 2023

I wonder if there's really something like a "Euclidean distance version". If there is, I assume it should be something like

np.sqrt(squared_dist - sq_norm_dist_x - sq_norm_dist_y)

instead of

dist - np.sqrt(norm_dist_x) - np.sqrt(norm_dist_y)

In the end, in genereal sqrt(a - b) != sqrt(a) - sqrt(b). Now I really do not remember the derivation of DisSim in detail, but I think there was a reason, why Hara et al. used squared Eucl distances specifically. Switching to Euclidean distances before subtracting the local centroids could have an effect on how effectively hubness is reduced. (For details on that, the authors of that paper might be of more help).

Could you point me to where to find this squared=False, ideally a link to the exact line? I seem to only find this, where it's True.

PRs always welcome :)

@davnn
Copy link
Author

davnn commented Mar 10, 2023

The authors likely use squared Euclidean distance because it's minimally cheaper / easier to compute, otherwise I don't see an argument. In my case, the input distance matrix is Euclidean, thus it makes more sense to treat the centroid distances as Euclidean as well. Your assumption is also what is implemented in your library, but I believe the assumption might be wrong.

dist - np.sqrt(norm_dist_x) - np.sqrt(norm_dist_y)

should be right, if dist is Euclidean, because then you subtract Euclidean distances from a Euclidean distance. In your reference algorithm, you subtract squared distances and np.sqrt the final result, yielding something not really interpretable as a Euclidean, nor squared Euclidean.

I meant the squared=False option in https://github.com/VarIr/scikit-hubness/blob/main/skhubness/reduction/tests/reference_algorithms.py#L379, which performs sqrt(a - b), but probably should perform sqrt(a) - sqrt(b), because

In the end, in genereal sqrt(a - b) != sqrt(a) - sqrt(b).

@VarIr
Copy link
Owner

VarIr commented Mar 13, 2023

Interesting. It's been a while since I last read the paper and, unfortunately, I don't really have time to look into that in much detail. If the squared Euclidean distances are indeed just a matter of a tiny performance boost, I totally follow your argument. Also, we're on the same page in terms of if the implemented formula can be interpreted as "Euclidean" distance. As in: Not really.

In this link the default also is squared=True, isn't it?

In any case, have you tried to evaluate your implementation? Say if it reduces hubness better or leads to higher accuracy in some task?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants