Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare 2 pdf files pixel by pixel to find all the differences #584

Closed
jiezhoujenny opened this issue Aug 4, 2020 · 7 comments
Closed
Assignees
Labels

Comments

@jiezhoujenny
Copy link

Hi, is it possible using pymupdf to compare 2 pdf files by pixel and highlight if there is difference?

@JorjMcKie
Copy link
Collaborator

Sure, many posibilities to do something like that. Depends on what you to know.

  • PDFs are special ASCII files with interspersed binary sections (representing images, fonts and more). So you can use some diff like comparison which outputs differing "lines".
  • You can check whether they just look alike. Make images from the pages and compare those - details further down. May lead to false equality conclusions if one of them is a "flattened" version of the other one (i.e. only consting of page images).

If you do want a visual comparison, make pixmaps of pages and compare the raw pixels contained in the two Pixmap.samples objects. These can be regarded as tables of width columns and height rows, where each item (the pixel) is a tuple of Pixmap.n integers in range(256).
You could define a memoryview reflecting this and compare pixel by pixel.
If you have chosen standard resolution for the pixmaps, you have trivial correlation between each pixel's coordinate (x, y) and the position on the page rectangle.
To speed things up, you first should compare the samples as a whole, and only fall back to a by-pixel comparison if these objects are unequal.
Talking about performance vs. precision: if a rough indication of differences is sufficient, you may want to choose a lower image resolution. There is quadratic relation here: scaling down by 50% will reduce runtime to 25%, etc.
The next question would be, how to visibly show any differences. You can set single pixels to some "alarm" color (or revert the current color), or you can add an annotation, which covers a larger region of differences, etc.
What do you want to do if the page rectangles are different? What about different page counts? Different page rotations?
This is probably the point where your creativity will be challenged ...

@jiezhoujenny
Copy link
Author

Thank you so much for your quick reply, @JorjMcKie . I am trying to complete my task by using pixmap you suggested. Hope I could get the result I am looking for.

@jiezhoujenny
Copy link
Author

Hi JoriMcKie,
I tried the following code, it doesn't give me any error, but the result.pdf doesn't shows any changes, I am wondering if I use setPixel correctly, could you please provide any sample on this? many thanks

def pixmap():
doc1 = fitz.open("test1.pdf")
doc2 = fitz.open("test2.pdf")
page1=doc1[113]
page2=doc2[113]
red = (1, 0, 0)
pix=page1.getPixmap(alpha=False)
pix2=page2.getPixmap(alpha=False)
pix_ht=pix.height
pix_wdt=pix2.width
bild = np.ndarray((pix_wdt,pix_ht, 3), dtype=np.uint8)
bild2 = np.ndarray((pix_wdt,pix_ht, 3), dtype=np.uint8)
for i in range(pix_ht):
for j in range(pix_wdt):
bild[j,i] = pix.pixel(j,i)
bild2[j,i]=pix2.pixel(j,i)
comparison = bild == bild2
equal_arrays = comparison.all()
if not equal_arrays:
for i in range(len(comparison)):
for j in range(len(comparison[i])):
if not comparison[i,j].all():
pix2.setPixel(i, j, red)
doc2.save("result.pdf")

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Aug 13, 2020

  1. You do not need numpy at all. You can directly compare single pixels: pix.pixel(i, j) == pix2.pixel(i, j), and you can compare the set of all pixels: pix1.samples == pix2.samples.
  2. When you change the pixels via pix.setPixel(), you should make sure you use a tuple that is compatible with the pixmap. In your example "red" is a 3-tuple, so I am assuming it is an RGB pixmap with no alpha ... but otherwise the method would react with an exception anyway.
  3. Remember that a pixmap is not part of the PDF, so saving the PDF makes no sense at all. You can only write the changed pixmap to some image pix2.writeImage("pix2.png").

Oherwise your example should work.

@JorjMcKie
Copy link
Collaborator

Oherwise your example should work.

Oh, I just realized you use tuple (1, 0, 0) for color red: this is incorrect: in pixel notation this color must be coded as (255, 0, 0).
I just compared two pages with this script. Page2 has some additional text ("More text").

...
t0 = time.perf_counter()
pix_count = 0
for i in range(pix1.width):
    for j in range(pix1.height):
        if pix1.pixel(i, j) != pix2.pixel(i, j):
            pix2.setPixel(i, j, red)
            pix_count += 1
t1 = time.perf_counter()
pix2.writeImage("x.png")
print("Modified %i pixels" % pix_count)
print("Duration %g seconds" % round(t1 - t0, 2))

Here is the modified page picture (x.png). Please note the "ragged" look of the text which is now colored red. The reason is, that taking pixmaps depends on the chosen resolution: the lower the resolution, the coarser is the image of the text.
And because every different pixels is colored red, we will see a lot of red pixels, which diffuse the shape of the text.
If you want more precision, you must increase resolution (using a fitz.Matrix with a higher zoom factor). But of course, the number of pixels will then increase by a quadratic order of magnitude: doubling the zoom factor means four times larger image.
x

@JorjMcKie
Copy link
Collaborator

Closing this for the time being. Please do not hesitate to reopen or submit another issue.

@jiezhoujenny
Copy link
Author

Thank you so much for your help,
I tried mark by changing color and save as png, but the result looks blur. Besides, considering the document has thousands page, saving as png is not good for user to check result.
I change it to if pixel different then add annotation, it works pretty well and very fast.
Thank you very much again for your great work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants