Compare 2 pdf files pixel by pixel to find all the differences #584

jiezhoujenny · 2020-08-04T15:28:09Z

Hi, is it possible using pymupdf to compare 2 pdf files by pixel and highlight if there is difference?

JorjMcKie · 2020-08-04T16:05:14Z

Sure, many posibilities to do something like that. Depends on what you to know.

PDFs are special ASCII files with interspersed binary sections (representing images, fonts and more). So you can use some diff like comparison which outputs differing "lines".
You can check whether they just look alike. Make images from the pages and compare those - details further down. May lead to false equality conclusions if one of them is a "flattened" version of the other one (i.e. only consting of page images).

If you do want a visual comparison, make pixmaps of pages and compare the raw pixels contained in the two Pixmap.samples objects. These can be regarded as tables of width columns and height rows, where each item (the pixel) is a tuple of Pixmap.n integers in range(256).
You could define a memoryview reflecting this and compare pixel by pixel.
If you have chosen standard resolution for the pixmaps, you have trivial correlation between each pixel's coordinate (x, y) and the position on the page rectangle.
To speed things up, you first should compare the samples as a whole, and only fall back to a by-pixel comparison if these objects are unequal.
Talking about performance vs. precision: if a rough indication of differences is sufficient, you may want to choose a lower image resolution. There is quadratic relation here: scaling down by 50% will reduce runtime to 25%, etc.
The next question would be, how to visibly show any differences. You can set single pixels to some "alarm" color (or revert the current color), or you can add an annotation, which covers a larger region of differences, etc.
What do you want to do if the page rectangles are different? What about different page counts? Different page rotations?
This is probably the point where your creativity will be challenged ...

jiezhoujenny · 2020-08-05T18:38:47Z

Thank you so much for your quick reply, @JorjMcKie . I am trying to complete my task by using pixmap you suggested. Hope I could get the result I am looking for.

jiezhoujenny · 2020-08-13T19:56:39Z

Hi JoriMcKie,
I tried the following code, it doesn't give me any error, but the result.pdf doesn't shows any changes, I am wondering if I use setPixel correctly, could you please provide any sample on this? many thanks

def pixmap():
doc1 = fitz.open("test1.pdf")
doc2 = fitz.open("test2.pdf")
page1=doc1[113]
page2=doc2[113]
red = (1, 0, 0)
pix=page1.getPixmap(alpha=False)
pix2=page2.getPixmap(alpha=False)
pix_ht=pix.height
pix_wdt=pix2.width
bild = np.ndarray((pix_wdt,pix_ht, 3), dtype=np.uint8)
bild2 = np.ndarray((pix_wdt,pix_ht, 3), dtype=np.uint8)
for i in range(pix_ht):
for j in range(pix_wdt):
bild[j,i] = pix.pixel(j,i)
bild2[j,i]=pix2.pixel(j,i)
comparison = bild == bild2
equal_arrays = comparison.all()
if not equal_arrays:
for i in range(len(comparison)):
for j in range(len(comparison[i])):
if not comparison[i,j].all():
pix2.setPixel(i, j, red)
doc2.save("result.pdf")

JorjMcKie · 2020-08-13T22:55:58Z

You do not need numpy at all. You can directly compare single pixels: pix.pixel(i, j) == pix2.pixel(i, j), and you can compare the set of all pixels: pix1.samples == pix2.samples.
When you change the pixels via pix.setPixel(), you should make sure you use a tuple that is compatible with the pixmap. In your example "red" is a 3-tuple, so I am assuming it is an RGB pixmap with no alpha ... but otherwise the method would react with an exception anyway.
Remember that a pixmap is not part of the PDF, so saving the PDF makes no sense at all. You can only write the changed pixmap to some image pix2.writeImage("pix2.png").

Oherwise your example should work.

JorjMcKie · 2020-08-14T09:28:43Z

Oherwise your example should work.

Oh, I just realized you use tuple (1, 0, 0) for color red: this is incorrect: in pixel notation this color must be coded as (255, 0, 0).
I just compared two pages with this script. Page2 has some additional text ("More text").

...
t0 = time.perf_counter()
pix_count = 0
for i in range(pix1.width):
    for j in range(pix1.height):
        if pix1.pixel(i, j) != pix2.pixel(i, j):
            pix2.setPixel(i, j, red)
            pix_count += 1
t1 = time.perf_counter()
pix2.writeImage("x.png")
print("Modified %i pixels" % pix_count)
print("Duration %g seconds" % round(t1 - t0, 2))

Here is the modified page picture (x.png). Please note the "ragged" look of the text which is now colored red. The reason is, that taking pixmaps depends on the chosen resolution: the lower the resolution, the coarser is the image of the text.
And because every different pixels is colored red, we will see a lot of red pixels, which diffuse the shape of the text.
If you want more precision, you must increase resolution (using a fitz.Matrix with a higher zoom factor). But of course, the number of pixels will then increase by a quadratic order of magnitude: doubling the zoom factor means four times larger image.

JorjMcKie · 2020-08-18T06:43:37Z

Closing this for the time being. Please do not hesitate to reopen or submit another issue.

jiezhoujenny · 2020-08-18T19:47:07Z

Thank you so much for your help,
I tried mark by changing color and save as png, but the result looks blur. Besides, considering the document has thousands page, saving as png is not good for user to check result.
I change it to if pixel different then add annotation, it works pretty well and very fast.
Thank you very much again for your great work.

jiezhoujenny added the question label Aug 4, 2020

jiezhoujenny assigned JorjMcKie Aug 4, 2020

jiezhoujenny closed this as completed Aug 5, 2020

jiezhoujenny reopened this Aug 13, 2020

JorjMcKie closed this as completed Aug 18, 2020

danwos mentioned this issue Jul 28, 2023

Test framework for sphinx-simplepdf useblocks/sphinx-simplepdf#83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare 2 pdf files pixel by pixel to find all the differences #584

Compare 2 pdf files pixel by pixel to find all the differences #584

jiezhoujenny commented Aug 4, 2020

JorjMcKie commented Aug 4, 2020

jiezhoujenny commented Aug 5, 2020

jiezhoujenny commented Aug 13, 2020

JorjMcKie commented Aug 13, 2020 •

edited

Loading

JorjMcKie commented Aug 14, 2020

JorjMcKie commented Aug 18, 2020

jiezhoujenny commented Aug 18, 2020

Compare 2 pdf files pixel by pixel to find all the differences #584

Compare 2 pdf files pixel by pixel to find all the differences #584

Comments

jiezhoujenny commented Aug 4, 2020

JorjMcKie commented Aug 4, 2020

jiezhoujenny commented Aug 5, 2020

jiezhoujenny commented Aug 13, 2020

JorjMcKie commented Aug 13, 2020 • edited Loading

JorjMcKie commented Aug 14, 2020

JorjMcKie commented Aug 18, 2020

jiezhoujenny commented Aug 18, 2020

JorjMcKie commented Aug 13, 2020 •

edited

Loading