Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NGT performance #29

Open
2 of 3 tasks
VarIr opened this issue Sep 9, 2019 · 7 comments
Open
2 of 3 tasks

NGT performance #29

VarIr opened this issue Sep 9, 2019 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@VarIr
Copy link
Owner

VarIr commented Sep 9, 2019

Approx. neighbor search with ngtpy can be accelerated:

  • Enable AVX on MacOS (temporarily disabled due to upstream bug in NGT. It is already enabled on Linux.)
  • Use NGT's optimization step (until then, the method is actually (P)ANNG, not ONNG, I assume). Currently, this seems to be possible only via the cmd line tools, not via the Python API.
  • Set good default parameters for ONNG
@VarIr VarIr added the enhancement New feature or request label Sep 9, 2019
@VarIr VarIr self-assigned this Sep 9, 2019
@VarIr
Copy link
Owner Author

VarIr commented Sep 23, 2019

It seems ONNG can be enabled in ngtpy, but it is currently not documented. However, there is an example here: yahoojapan/NGT#30

@VarIr
Copy link
Owner Author

VarIr commented Sep 26, 2019

New NGT release 1.7.10 should fix this: https://github.com/yahoojapan/NGT/releases/tag/v1.7.10

@VarIr
Copy link
Owner Author

VarIr commented Nov 11, 2019

1.8.0 brought docs for ONNG. It is already activate here, but index building is extremely slow due to difficult parameterization. Need to check.

@jaytimbadia
Copy link

Approx. neighbor search with ngtpy can be accelerated:

  • Enable AVX on MacOS (temporarily disabled due to upstream bug in NGT. It is already enabled on Linux.)
  • Use NGT's optimization step (until then, the method is actually (P)ANNG, not ONNG, I assume). Currently, this seems to be possible only via the cmd line tools, not via the Python API.
  • Set good default parameters for ONNG

Hi,
Seems like really good work.

I am using bert to find semantic similarity using cosine distance, but it may lead to high dimension problem.
So can I use hubness here, I mean will it make bert embedding any better?

Thankyou!

@VarIr
Copy link
Owner Author

VarIr commented Jan 24, 2021

Thanks for your interest. That's something I've been thinking about, but never found time to actually check.

BERT embeddings are typically high-dimensional, so hubness might play a role.
You could first estimate the intrinsic dimension of these embeddings (b/c this actually drives hubness), e.g. with this method. If this is much lower than the embedding dimension, it's unlikely that hubness reduction leads to improvements.
Alternatively, you could directly compare performance in your tasks with and without hubness reduction.
If there's a performance improvement, I'd be curious to know.

@jaytimbadia
Copy link

Thanks for your interest. That's something I've been thinking about, but never found time to actually check.

BERT embeddings are typically high-dimensional, so hubness might play a role.
You could first estimate the intrinsic dimension of these embeddings (b/c this actually drives hubness), e.g. with this method. If this is much lower than the embedding dimension, it's unlikely that hubness reduction leads to improvements.
Alternatively, you could directly compare performance in your tasks with and without hubness reduction.
If there's a performance improvement, I'd be curious to know.

Thank you so much for the reply.
I calculated the intrinsic dimension for bert and its coming out to be 18, very low than I expected.
Anyways, one question, can we use intrinsic dimensionality to check the quality of embeddings we generate?
For Eg: bert -> (100, 768) has pretty low, so what does it mean, while some random matrix -> (100, 768) I gave had around 155. So what it means is bert quite well trained?

If yes, we can use this, I mean whenever we generate embeddings we can check its intrinsic dimension if less, so less constraint it has, easier to fine-tune further, right?

I would love to know your thoughts!!

@VarIr
Copy link
Owner Author

VarIr commented Jan 24, 2021

18 isn't particularly high, but we've seen datasets, where this came with high hubness (see e.g. p. 2885/6 of this previous paper.
I am not aware of research directly linking intrinsic dimension to the quality (however this would be defined, anyway) of embeddings. Interesting research questions you pose there :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants