-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IVF-PQ tutorial notebook #1544
IVF-PQ tutorial notebook #1544
Conversation
Let's dust this off for the coming release. I think the only thing missing is the last section about the parameters for index building. Since we will have an IVF-Flat notebook that discusses clustering parameters (n-clusters, trainset ratio), we can focus here on additional parameters for the product quantization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem for the notebook! This looks great, a really detailed discussion of IVF-PQ search and training. While this content is great for the deep dive blog that we are planning, I am wondering whether this is the right level of detail for the example notebook. I feel that there would be a value in making it more concise. Let's wait what the team says.
} | ||
], | ||
"source": [ | ||
"bench_k = np.exp2(np.arange(10)).astype(np.int32)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we need a specific benchmark section here on k, It is good to point out in the text, that the number of neighbors are controlled by k, but actually benchmarking that could be postponed to the blog.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -0,0 +1,1226 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most notebooks that we publish for RAPIDS have longer descriptions before the first cell. I think this is a great lead-in sentence but can you please provide more details about what the notebook is going to do? Can you move the links and descriptions for the datasets up here as well so it's more immediately obvious where they came from and that the notebook is going to use them?
Reply via ReviewNB
@@ -0,0 +1,1226 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #2. def show_properties(obj):
This is really nice. I almost wonder if this should eventually be added to pylibraft itself (and for the corresponding __str__()
methods be overloaded to print it.
Reply via ReviewNB
@@ -0,0 +1,1226 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This is useful when in conjugation with add_data_on_build = True"
I think "This can also be useful when used in conjunction with...". Or I would suggest just removing it all together. The add_data_on_build is surely a convenience, but just because it's set to false doesn't mean you can assume anything more about how many vectors are ultimately going to be added to the index.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a funny typo :)
Regarding the usefulness: I meant to say that if add_data_on_build = False
, then kmeans_trainset_fraction
doesn't make much sense: a user can just pass a smaller dataset to the same effect.
I've added a small note to clarify this.
@achirkin ReviewNB didn't allow me to submit a message w/ the review. It's really a delight reading through the tutorial. It's very in-depth and I think it's going to be a great resource for users. I was able to make my way through most of it, but not all of it. I'm going to try and read through the rest next week. |
Co-authored-by: Tamas Bela Feher <tfeher@nvidia.com>
@achirkin minor feedback from me at this point. End-to-end, this really is a great tutorial and I'm actually thinking we might want to consider keeping this as it is even after the blog is published (and maybe providing an additional notebook with more concise and simple usage examples for the blog?) I think developers and users alike can benefit from this notebook as it gives a pretty thorough but readable overview of IVF-PQ. Thanks again for creating this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem! I think this notebook is good to go for 23.08 and we can revisit some portions for 23.10. Overall I think it's a great deep dive into ivf pq.
Add a new folder
notebooks
and a notebook with an example of tweaking ivf-pq using a dataset from ann-benchmarks.com