Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate vectors in LAION 100M dataset #357

Open
greenhal opened this issue Aug 6, 2024 · 6 comments
Open

Duplicate vectors in LAION 100M dataset #357

greenhal opened this issue Aug 6, 2024 · 6 comments

Comments

@greenhal
Copy link
Contributor

greenhal commented Aug 6, 2024

The LAION 100M dataset used for benchmarking has a large number of duplicate vectors, which is impacting the recall results of several test queries and making it impossible to achieve 0.99 recall. Whenever a query has multiple results with same distance, the results are not in a order. algorithm does not have a way to order it as all the ids are with the same distance. accuracy calculation expects the ids to be in the same order.

Is this to be expected in this dataset ?

For example, in the first datafile the vector below appears 947 times.

Vector id 2783
[0.012306213,0.057373047,-0.019821167,-0.026184082,0.0066490173,0.04647827,0.042755127,-0.023773193,-0.0077285767,-0.047546387,-0.050048828,-0.023651123,-0.0038471222,-0.004505157
5,0.010421753,0.007423401,0.0045814514,0.031707764,0.013145447,0.011917114,-0.0022068024,0.01878357,-0.02078247,0.04095459,-0.049743652,-0.0317688,0.03488159,-0.061065674,-0.004287
7197,-0.018661499,0.016220093,0.008796692,-0.014205933,-0.012191772,0.030303955,0.027328491,-0.0068511963,-0.027511597,0.0037670135,-0.045440674,0.020370483,0.033050537,-0.03341674
8,0.0018873215,-0.0009403229,-0.017028809,-0.006641388,0.00044631958,-0.01838684,0.0158844,-0.008522034,-0.026016235,0.029800415,0.0013132095,-0.005104065,0.009666443,0.021850586,0
.003774643,0.02368164,0.0077285767,-0.026123047,0.049987793,-0.052703857,0.003917694,0.0013389587,0.034698486,0.007293701,-0.02368164,-0.015586853,0.0009965897,0.016204834,-0.00793
457,-0.016479492,-0.007396698,0.021011353,0.00579834,-0.014205933,-0.0049552917,0.0132751465,-0.006816864,0.0014657974,-0.005645752,-0.002840042,-0.01576233,0.028747559,-0.01071167
,-0.020584106,0.016525269,0.037322998,0.010879517,0.051452637,-0.0039634705,0.027038574,0.035980225,-0.019592285,0.0049591064,0.038482666,-0.014770508,0.013191223,-0.0012178421,0.0
16052246,0.012680054,-0.0064888,-0.03427124,0.011932373,-0.009994507,0.021087646,0.009132385,-0.010070801,0.030593872,0.040618896,-0.0041618347,0.02003479,-0.044952393,0.012886047,
0.015731812,-0.0519104,0.023239136,-0.034820557,-0.01838684,-0.00044584274,-0.018249512,-0.016998291,0.005016327,0.001036644,-0.04425049,-0.03793335,0.013191223,-0.011169434,-0.022
842407,0.0011997223,-0.018005371,-0.008682251,-0.01940918,-0.018051147,-0.022109985,-0.014701843,-0.01084137,0.019638062,-0.018096924,-0.01902771,0.004096985,-0.006832123,0.0361938
48,0.0073242188,-0.01953125,-0.026473999,0.0077323914,-0.0057525635,-0.038024902,-0.036010742,0.0005121231,0.023391724,0.04425049,-0.017196655,-0.02758789,-0.018249512,-0.02130127,
-0.009155273,8.940697e-07,0.0015888214,-0.008743286,0.025421143,0.011817932,0.03555298,-0.02029419,-0.03579712,0.016708374,0.017837524,0.003698349,0.01576233,0.014282227,-0.0218353
27,0.0020523071,0.0006170273,0.0018396378,8.6426735e-06,-0.027038574,-0.0050239563,-8.606911e-05,0.009429932,0.0017881393,-0.031311035,0.005607605,0.0014467239,-0.010643005,-0.0098
95325,0.002248764,-0.0029525757,0.0077323914,-0.005836487,0.0071258545,0.002538681,0.03186035,0.01121521,0.08294678,0.011924744,0.0003569126,0.007949829,-0.011756897,0.025878906,0.
035705566,0.007785797,0.051971436,-0.017990112,0.0036888123,0.00011360645,-0.032470703,0.0015163422,-0.0060768127,-0.011482239,-0.04638672,0.01134491,0.030914307,0.014732361,0.0097
42737,-0.015403748,0.050964355,-0.004535675,0.03479004,-0.019943237,0.011856079,-0.021728516,-0.00674057,-0.011444092,0.009803772,-0.01889038,0.007827759,0.019836426,-0.03253174,0.
008934021,-0.0087890625,-0.0109939575,0.006587982,0.017196655,0.003332138,-0.035003662,0.0077209473,-0.0037212372,0.012779236,-0.02104187,-0.019134521,-0.033447266,-0.0062294006,-0
.0037117004,0.04168701,0.006126404,0.017593384,0.009254456,-0.04284668,0.035461426,0.006046295,-0.012191772,-0.0362854,0.0068855286,-0.012832642,0.02279663,0.0035209656,0.000308752
06,-0.01927185,0.0025520325,-0.01576233,0.0118255615,0.02772522,-0.026275635,-0.019561768,-0.05130005,0.011161804,0.0053710938,0.01209259,-0.01537323,0.027954102,0.018936157,0.0002
1755695,-0.0022563934,0.019058228,-0.027191162,0.017425537,-0.016418457,-0.021560669,0.0047187805,-0.07556152,0.0024051666,-0.036865234,-0.004924774,-0.012382507,-0.0040512085,0.03
0929565,0.09100342,-0.015640259,-0.010482788,-0.026565552,0.008346558,-0.047424316,-0.019836426,-0.004760742,-0.0061950684,0.0018606186,0.020523071,-0.016586304,-0.020874023,-0.012
939453,-0.023910522,-0.005607605,-0.035949707,-0.030670166,-0.017578125,0.013870239,0.005302429,-0.008956909,0.057556152,-0.0034542084,-0.006969452,-0.018005371,-0.0047721863,4.202
1275e-05,-0.07434082,-0.030670166,-0.011924744,-0.002483368,0.0234375,-0.02557373,0.03250122,-0.0047721863,0.0071411133,0.010894775,-0.029754639,-0.01512146,0.03930664,-0.027954102
,-0.026519775,-0.014083862,0.008216858,-0.0063285828,0.026977539,0.049682617,0.034057617,-0.020339966,0.028137207,0.023254395,0.013954163,-0.03982544,-0.0035705566,0.007850647,-0.0
07259369,0.014450073,-0.03857422,-0.032318115,0.0011663437,-0.03152466,0.0073394775,0.02116394,0.018692017,-0.0362854,0.015335083,0.03375244,0.005809784,-0.0066070557,0.025650024,0
.025878906,-0.010520935,-0.012710571,0.06011963,-0.08239746,0.0073394775,-0.045440674,0.039916992,-0.06750488,0.024246216,-0.022903442,-0.03302002,-0.01361084,-0.03253174,0.0102462
77,-0.003566742,-0.005241394,-0.011703491,-0.005897522,-0.013023376,0.021072388,-0.04067993,0.008407593,0.039367676,0.013198853,0.00045251846,-0.028961182,0.0060653687,-0.0178833,0
.011566162,0.0045318604,0.025650024,-0.034698486,0.064208984,-0.01499176,-0.008773804,-0.0022735596,-0.049865723,0.00096178055,-0.005344391,-0.019729614,-0.019332886,0.025360107,0.
011024475,-0.021820068,-0.1763916,0.011001587,0.012046814,-0.012481689,-0.024795532,-0.007411957,-0.014190674,0.016235352,-0.026901245,-0.4633789,0.0075149536,0.010391235,0.0056991
577,0.009162903,-0.010063171,-0.0126953125,-0.007575989,0.012054443,-0.009864807,-0.0070343018,-0.016464233,0.02268982,0.022277832,-0.030700684,0.006214142,0.021118164,0.0048942566
,0.003452301,-0.0052833557,-0.01977539,-0.03817749,0.07336426,0.014083862,-0.024276733,-0.032043457,-0.0066070557,0.07055664,-0.013427734,0.041870117,0.015686035,0.0018825531,0.001
666069,0.052581787,0.0014324188,0.00856781,0.0058898926,-0.0014448166,-0.027557373,-0.00166893,0.019592285,0.010444641,-0.015426636,0.00036096573,-0.0020751953,-0.05368042,-0.02940
3687,0.020339966,-0.0020713806,-0.046661377,-0.018157959,-0.06341553,0.010925293,0.01663208,0.040893555,0.011711121,-0.03463745,0.0059127808,0.014533997,0.01625061,0.019714355,-0.0
1033783,-0.008453369,0.028060913,-0.07714844,0.0368042,-0.009651184,0.03201294,0.018661499,-0.012329102,-0.00819397,0.00012022257,-0.024429321,-0.017181396,-0.0096206665,-0.0052604
675,-0.014251709,0.019989014,0.0085372925,-0.051757812,0.0021686554,-0.010070801,0.0002863407,-0.0017375946,0.029403687,0.004863739,-0.0067863464,0.008758545,0.035186768,-0.0806274
4,0.004497528,-0.005596161,0.00047302246,-0.000726223,0.00881958,-0.013473511,0.0030403137,-0.040252686,0.018203735,-0.02281189,-0.018600464,-0.02406311,-0.013916016,0.007648468,0.
00037908554,0.018844604,0.009429932,0.038604736,0.023788452,0.038360596,0.03213501,0.011116028,-0.015609741,-0.016174316,-0.001159668,-0.030380249,0.040618896,0.048980713,0.0034942
627,0.013458252,0.050079346,0.011390686,-0.021072388,0.022720337,0.006252289,-0.023376465,0.021835327,-0.0041275024,0.007411957,0.03994751,0.031402588,0.014862061,0.016433716,-0.00
4009247,0.0053749084,0.014587402,0.024612427,-0.0015525818,-0.008758545,-0.004470825,0.0004892349,0.0129776,-0.008026123,0.018463135,0.008125305,0.0065460205,0.025344849,0.02943420
4,0.041015625,-0.0019893646,0.05508423,-0.010513306,0.01524353,-0.034820557,-0.015464783,-0.00091409683,0.029464722,0.001502037,-0.038208008,0.036376953,-0.028503418,-0.014717102,-
0.015823364,0.02809143,0.018463135,0.022521973,-0.00027227402,0.008720398,-0.0009608269,-0.014015198,-0.018615723,0.04815674,0.021102905,0.012924194,-0.016525269,0.027618408,0.0162
5061,-0.039764404,-0.0078048706,0.015167236,-0.011001587,0.023834229,0.008392334,0.005558014,-0.0077056885,-0.024536133,-0.0029773712,-0.0014982224,0.0040626526,0.02444458,-0.01625
061,-0.009254456,-0.12109375,0.0076675415,0.0059661865,0.02835083,0.009048462,-0.02067566,0.008331299,-0.0020618439,0.0016307831,0.00029301643,-0.05319214,0.019500732,0.028640747,-
0.016677856,0.005077362,-0.008934021,-0.0058403015,-0.011734009,-0.01739502,0.028793335,0.004688263,0.020248413,0.008369446,-0.018493652,0.018859863,-0.0037250519,-0.001996994,0.01
612854,0.0011148453,0.0031719208,0.015625,-0.0038433075,0.03488159,-0.0068626404,0.00687027,0.005332947,0.005393982,0.039093018,0.013282776,0.50146484,0.003873825,0.008224487,-0.00
33073425,-0.0007162094,-0.025741577,-0.021484375,0.026000977,-0.010948181,-0.010101318,-0.0010461807,0.03945923,-0.0060653687,-0.006290436,-0.028213501,0.016799927,0.023254395,-0.0
3439331,-0.013519287,0.025939941,0.01134491,0.014915466,0.0023441315,0.016204834,0.009902954,0.08673096,-0.011199951,0.00868988,-0.03640747,-0.012962341,-0.032684326,-0.0040130615,
0.029769897,-0.013427734,0.032806396,0.024795532,-0.01940918,-0.030883789,-0.010444641,-0.0068588257,0.030960083,0.0003476143,-0.00724411,0.009155273,0.013420105,-0.011238098,0.004
180908,0.00894928,0.21569824,-0.012619019,-0.074645996,-0.012817383,-0.021438599,-0.012664795,-0.0011129379,0.001750946,-0.015083313,-0.04547119,0.022232056,0.003780365,-0.00875854
5,-0.0021038055,0.05987549,-0.00010198355,0.033294678,-0.009429932,0.036621094,-0.035461426,-0.016906738,0.033447266,0.008796692,-0.020584106,0.0026664734,-0.019332886,-0.018234253
,-0.0068511963,-0.040100098,0.017227173,-0.03567505,-0.026153564,0.030014038,-0.0501709,0.027038574,0.0154953,0.003200531,0.0046424866,0.0053863525,-0.0062065125,-0.017791748,-0.00
7080078,0.00674057,-0.062408447,0.015930176,0.019821167,-0.012832642,-0.006629944,0.022155762,-0.00415802,-0.017288208,0.005580902,-0.01210022,0.018692017,0.020324707,0.007850647,0
.02456665,-0.008148193,0.014427185,-0.0029525757,-0.010925293,-0.011009216,0.0087509155,-0.103759766,-0.0022792816,0.015205383,-0.009689331,-0.04156494,-0.0059394836,0.010932922,0.
019088745,0.022659302,0.0011548996,0.0211792,0.007774353,-0.014884949]
@alwayslove2013
Copy link
Collaborator

@zhuwenxing could you take a look at this issue?

@zhuwenxing
Copy link
Collaborator

@greenhal
We are extremely grateful that you have identified and pointed out this issue. The original data of the LAION 100M dataset comes from https://the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-multi/img_emb/. Upon analyzing the original data, we found that it already contains duplicate data itself.

Statistics of duplicate rows:
Number of duplicate groups: 3014
The most repeated row appears 902 times
The least repeated row appears 2 times
Average number of repetitions: 5.35
Total number of affected rows: 16134
Proportion of total rows: 1.70%

reproduce code

wegt https://the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-multi/img_emb/img_emb_0000.npy

import numpy as np

# Assume your array is named data

def find_and_display_duplicate_ids(arr):
    # Create an array representing the ID of each row
    row_ids = np.arange(arr.shape[0])
    
    # Combine the original array and the ID array
    arr_with_ids = np.column_stack((arr, row_ids))
    
    # Use np.unique to find unique rows, their indices and counts
    _, inverse_indices, counts = np.unique(arr_with_ids[:, :-1], axis=0, return_inverse=True, return_counts=True)
    
    # Find duplicate rows (rows with count greater than 1)
    duplicate_mask = counts > 1
    duplicate_counts = counts[duplicate_mask]
    
    # Get the IDs of duplicate rows
    duplicate_groups = [row_ids[inverse_indices == i] for i in range(len(counts)) if counts[i] > 1]
    
    if len(duplicate_groups) > 0:
        print(f"Found {len(duplicate_groups)} groups of duplicate rows:")
        for i, (group, count) in enumerate(zip(duplicate_groups, duplicate_counts), 1):
            print(f"\nDuplicate group {i}:")
            print(f"ID: {group}")
            print(f"Number of duplicates: {count}")
    else:
        print("No duplicate rows found")
    
    return duplicate_groups, duplicate_counts

data = np.load('img_emb_0000.npy')
# Use the function
duplicate_groups, duplicate_counts = find_and_display_duplicate_ids(data)

# Additional statistics
if len(duplicate_groups) > 0:
    print(f"\nStatistics of duplicate rows:")
    print(f"Number of duplicate groups: {len(duplicate_groups)}")
    print(f"The most repeated row appears {duplicate_counts.max()} times")
    print(f"The least repeated row appears {duplicate_counts.min()} times")
    print(f"Average number of repetitions: {duplicate_counts.mean():.2f}")
    
    # Calculate the total number of affected rows
    total_duplicate_rows = sum(len(group) for group in duplicate_groups)
    print(f"Total number of affected rows: {total_duplicate_rows}")
    print(f"Proportion of total rows: {total_duplicate_rows / len(data):.2%}")

@alwayslove2013
Copy link
Collaborator

@greenhal duplicate data is mentioned in their blog as well.
https://laion.ai/blog/laion-400-open-dataset/

There is a certain degree of duplication because we used URL+text as deduplication criteria. The same image with the same caption may sit at different URLs, causing duplicates.

@greenhal greenhal changed the title Duplicate vectosr in LAION 100M dataset Duplicate vectors in LAION 100M dataset Aug 7, 2024
@greenhal
Copy link
Contributor Author

greenhal commented Aug 7, 2024

Thank you for confirming that this is a known characteristic of this dataset.

These duplicates are impacting the ability to correctly measure recall when one of the test vectors or it neighbors contain more duplicates than the k value requested.

For example, the vector represented by id 795579 (query id 5 in test.parquet), has 114 exact matches. The k100 results from this query can not be guaranteed to match the k100 from the ground truth file, even though they are correct. In this dataset, there are 9 queries that have more than 100 exact matches and 21 that have more than 21. I would expect this to be higher with the larger LAION datasets.

This issue results in an inaccurate recall measurement when using this dataset.

We propose that the distance should be included in the ground truth file and when if there is a tie at the end of the ground truth set, the set is extended to include all ties, the results are then compared to the extended ground truth. (This is how big-ann-benchmarks calculates recall.)

Using the example above, for query 5 & k100 , the ground truth passed to calc_recall would be first 114 ids, not just the first 100.

@alwayslove2013
Copy link
Collaborator

@greenhal Excellent suggestions! Including all groundtruth IDs that satisfy the distance criteria in the recall calculation is a sensible approach.

We apologize that we did not notice the "duplicate vectors" feature when we selected the dataset previously, and we did not account for this in the design. As a result, we did not store the distance information when preparing the groundtruth file. It will require some time to re-prepare the groundtruth.

@Xavierantony1982
Copy link

Xavierantony1982 commented Oct 31, 2024

We were able to add distances to the ground truth file and then added the code to calculate the recall based on distance ties.
Should i post the pull request for the changes just for calculating recall based on the distance?
its just three lines of code.
results containing ties, may not be in the same order as the ground truth
if the ground truth has distance,we check for ties and return gt[:self.k] + ties.
if there is a distance tie at the end, include it in ground truth.

`
gt_has_distance = True if 'distance' in ground_truth.columns else False

if gt_has_distance:
distance=ground_truth["distance"][idx]
while distance[gt_length-1] == distance[gt_length]:
gt_length += 1

recalls.append(calc_recall(self.k, gt[: gt_length], results))
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants