Duplicates and multiple versions of samples #10

pfischer-nvidia · 2023-05-09T16:33:34Z

Dear authors,
while processing the MMC4 dataset, we found some anomalies and we hope you can comment on or explain these.

Our Expectations

There is one full large dataset (mmc4) that includes samples with face detections and there are several subsets of that large dataset that have been filtered:
- One subset that contains only the samples without face detections (mmc4-ff) (public)
- One subset that contains only the "core" i.e. samples with strict filtering (mmc4-core)
- One subset that contains only the intersection of all these (mmc4-core-ff) (public)
We assume that those are true subsets, e.g. every sample in mmc4-core-ff would also be contained in mmc4-ff etc.
We assume that within each of the subsets, every sample is unique
- Means each web page on the internet resulted in at most one sample
- Of course different web pages under the same domain could result in multiple samples

Our Findings

We found that

each of the subsets seems to contain many exact duplicate samples up to a rate of 1-2% of all samples
some samples occur multiple times in different subsets but slightly changed, for example with more images or with different similarity measures
some subsets don't seem to be true subsets but instead contain samples that are not part of the corresponding larger set or the larger set contains a variant of those

Exact Duplicates

At first, we matched samples by the MD5 hash of the JSON string to find exact duplicates.

For example for mmc4-core-ff, we found 5598117 total samples (i.e. json lines) among all shards, but only 5506430 unique samples.
This means that 1.6% within that subset are exact duplicates.

Other Duplicates

If we match just by the document URL string, the duplicate rate is higher, in the case of mmc4-core-ff we then obtain only 5492699 unique samples, so 1.9% are duplicates.
Interestingly, the duplicates appear not just twice but up to 88 times each.

Here are the top ten duplicate URLs with the number of appearances:

('https://site.clubrunner.ca/page/clubrunner-mobile-app-now-available', 88),
('https://www.amazon.com.au/All-New-Kindle-With-Front-Light-Black/dp/B07FQ4DJ83', 59),
('https://www.plentygram.com/blog/how-to-make-your-instagram-account-famous/', 46),
('http://www.fuelly.com/', 41),
('https://www.bhhsnv.com/', 39),
('https://www.kikocosmetics.com/en-us/', 34),
('http://www.manchesteruniversitypress.co.uk/articles/freedom-and-the-fifth-commandment-qa-with-brian-heffernan/', 31),
('http://www.manchesteruniversitypress.co.uk/articles/mup-advent-calendar-starts-thursday/', 31),
('https://emeraldcoastbyowner.com/', 29),
('https://www.ait.com/web-development/?typhon', 29)

We took a closer look at the first sample with 88 duplicates and found that 87 of those are exact duplicates but 1 is slightly different.
For that 1 sample, the image similarities and the similarity matrix are different altough the text and images match with those of the other 87 samples.

Faces vs. No Faces

We assumed that fewer faces dataset is simply a filtered version of the sets with faces.
We filtered the set with faces ourselves, keeping only the samples that have face_detections: None.
However, this does not result in the same set as the published fewer faces set.
This effect is related to the similar but slightly different samples mentioned above.
One example is this:
Compare mmc4_core_faces/docs_shard_4943_v3.jsonl.zip sample 113 with mmc4_full_faces/docs_shard_4943_v2.jsonl.zip sample 1523.
Both have the same URL and the core set should be a subset of the full set. However, the second sample contains an additional image with face detections, while all other images contain no face detections.

Questions

How were the 4 sets constructed by the authors?
Are our assumptions/expectations correct?
If there are multiple different versions of a sample (e.g. one with more images) which one is the correct one?

The text was updated successfully, but these errors were encountered:

jmhessel · 2023-05-09T18:05:17Z

Hi @pfischer-nvidia ! Thanks for your interest in the dataset, and for going through the corpus in detail! Our goal was to release the corpus as a v1 exactly so we can get community input about quality issues, and so this input is super helpful. I will go through this in more detail soon, but wanted to get back to you with some quick answers ASAP:

Are the "fewer faces" subsets true subsets?

For a strict definition of "subset", they aren't true subsets, and they aren't intended to be --- my apologies for the confusing naming. If you imagine a document with images, some of them contain faces, and some of them don't. If you simply remove the images with detected faces, the resulting image-text alignment might not have as high-of similarity compared to if you re-ran the assignment procedure. So, we remove images with detected faces, and then re-run the assignment algorithm, which might result in different assignments globally.

Are the "core" subsets true subsets?

These are also not true subsets for a strict definition of subset. As described in the paper, there are additional filters we apply that can affect which images are available within each document: these include document thresholds (like min/max number of images/sentences), but also things that can affect within-document properties like more strict deduplicaiton that (as mentioned in the paper) can create some false positives which are discarded.

Regarding ~1-2% of duplicates:

This is something we are aware of, and is a concern with lots of pretraining datasets out there. Our assumption was that the deduplication efforts of c4 were sufficient for us to not run deduplication but we have also recently realized a small number of duplicate URLs. We removed a /ton/ of duplicate images from our original 1.4B set, but it looks like we missed these in v1 of the release. We'll check it out with your findings.

jmhessel · 2023-05-11T19:00:08Z

Hi @pfischer-nvidia --- thanks for this report! Along with fixing some of the alignments mentioned in #11 , we are working on a v1.1 of the corpus now which aims to address the ~1% duplicate url issue.

pfischer-nvidia · 2023-05-15T06:23:35Z

Thanks. Are you going to make the samples unique wrt. the URL?
And how can we interpret the current _v2 and _v3 suffixes of the files?

pfischer-nvidia · 2023-05-15T06:28:00Z

Oh and one more question: A large part of the images referenced in the dataset is not available on the internet anymore. Would it be possible to get these from you?

jmhessel · 2023-05-16T19:02:17Z

I am closing this issue as resolved by #13 --- the update we made was to do probabilistic deduplication such that, in expectation, each URL appears once. But, if you want a more strictly url deduplicated set, you can discard any docs marked by the new could_have_url_duplicate field (see main readme). Thanks for your help on this!

to answer your questions:

v2 and v3 suffixes are internal and can be ignored. I guess the file naming is a bit messy because we have v1.1 as the public versions of things. Apologies for this. If you're curious, v2 and v3 roughly correspond to "full" and "core" and align with the preprocessing rounds we did internally.
For raw images, the main readme has a raw image interest list. For copyright/legal reasons, I can't directly distribute images. Can you provide some statistics about what percentage of missing images you're finding? If a very high number are missing, I can do more thinking about potential solutions.

jmhessel added the bug Something isn't working label May 11, 2023

jmhessel closed this as completed May 16, 2023

pfischer-nvidia mentioned this issue May 23, 2023

Missing or broken images (due to stale URLs) #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicates and multiple versions of samples #10

Duplicates and multiple versions of samples #10

pfischer-nvidia commented May 9, 2023

jmhessel commented May 9, 2023

jmhessel commented May 11, 2023 •

edited

Loading

pfischer-nvidia commented May 15, 2023 •

edited

Loading

pfischer-nvidia commented May 15, 2023

jmhessel commented May 16, 2023 •

edited

Loading

Duplicates and multiple versions of samples #10

Duplicates and multiple versions of samples #10

Comments

pfischer-nvidia commented May 9, 2023

Our Expectations

Our Findings

Exact Duplicates

Other Duplicates

Faces vs. No Faces

Questions

jmhessel commented May 9, 2023

jmhessel commented May 11, 2023 • edited Loading

pfischer-nvidia commented May 15, 2023 • edited Loading

pfischer-nvidia commented May 15, 2023

jmhessel commented May 16, 2023 • edited Loading

jmhessel commented May 11, 2023 •

edited

Loading

pfischer-nvidia commented May 15, 2023 •

edited

Loading

jmhessel commented May 16, 2023 •

edited

Loading