-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate vectors in LAION 100M dataset #357
Comments
@zhuwenxing could you take a look at this issue? |
@greenhal
reproduce code
|
@greenhal
|
Thank you for confirming that this is a known characteristic of this dataset. These duplicates are impacting the ability to correctly measure recall when one of the test vectors or it neighbors contain more duplicates than the k value requested. For example, the vector represented by id 795579 (query id 5 in test.parquet), has 114 exact matches. The k100 results from this query can not be guaranteed to match the k100 from the ground truth file, even though they are correct. In this dataset, there are 9 queries that have more than 100 exact matches and 21 that have more than 21. I would expect this to be higher with the larger LAION datasets. This issue results in an inaccurate recall measurement when using this dataset. We propose that the distance should be included in the ground truth file and when if there is a tie at the end of the ground truth set, the set is extended to include all ties, the results are then compared to the extended ground truth. (This is how big-ann-benchmarks calculates recall.) Using the example above, for query 5 & k100 , the ground truth passed to |
@greenhal Excellent suggestions! Including all We apologize that we did not notice the "duplicate vectors" feature when we selected the dataset previously, and we did not account for this in the design. As a result, we did not store the distance information when preparing the groundtruth file. It will require some time to re-prepare the groundtruth. |
We were able to add distances to the ground truth file and then added the code to calculate the recall based on distance ties. ` if gt_has_distance: recalls.append(calc_recall(self.k, gt[: gt_length], results)) |
The LAION 100M dataset used for benchmarking has a large number of duplicate vectors, which is impacting the recall results of several test queries and making it impossible to achieve 0.99 recall. Whenever a query has multiple results with same distance, the results are not in a order. algorithm does not have a way to order it as all the ids are with the same distance. accuracy calculation expects the ids to be in the same order.
Is this to be expected in this dataset ?
For example, in the first datafile the vector below appears 947 times.
Vector id 2783
The text was updated successfully, but these errors were encountered: