Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save_image_file should set DPI for derived images #343

Open
bertsky opened this issue Nov 11, 2019 · 9 comments
Open

save_image_file should set DPI for derived images #343

bertsky opened this issue Nov 11, 2019 · 9 comments
Assignees
Milestone

Comments

@bertsky
Copy link
Collaborator

bertsky commented Nov 11, 2019

Currently, any information on image resolution provided in the original image (and made available via OcrdExif in Workspace.image_from_page) is ignored when saving derived images in the workspace (via Workspace.save_image_file). Due to PIL.Image format internals, the PNG then contains a setting of 72 DPI however. This might create problems for processors that look at the derived image files alone.

But this is hard to fix in core: the image passed to save_image_file could come from anywhere (and usually does not have an info['dpi']; even simple PIL.Image operations omit that in the result).

Realistically though, it will have been created some way from the source image file under the same pageId, and since rescaling is currently not permitted in the spec, one could assume the same DPI for all derived images.

@wrznr
Copy link
Contributor

wrznr commented Nov 11, 2019

and usually does not have an info['dpi']

Why can't we check whether such information is available? Copy it, if that's the case and delete the default value if not? We should definitely not write default values which we know to be wrong (I added the label bug).

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 11, 2019

and usually does not have an info['dpi']

Why can't we check whether such information is available?

We could, but as I said: usually it is not. Or should I say: almost always. Any non-mutating PIL.Image operation (even those that wouldn't affect the meta-data) omits info from the result. We cannot change that. And we cannot force processors to give us much more than what PIL.Image does.

We should definitely not write default values which we know to be wrong (I added the label bug).

Okay, but that's probably a bug in Pillow's save for the PNG format.

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 11, 2019

We should definitely not write default values which we know to be wrong (I added the label bug).

Okay, but that's probably a bug in Pillow's save for the PNG format.

Oh, just found that this appears to have been fixed in Pillow 6.2.1. Since we cannot likely get any better, I will close for now.

@bertsky bertsky closed this as completed Nov 11, 2019
@kba
Copy link
Member

kba commented Nov 11, 2019

https://github.com/python-pillow/Pillow/blob/master/CHANGES.rst#621-2019-10-21

Should we update to 6.2.1 then? Do you have a sample for me to test? Thanks!

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 11, 2019

Sorry, I was imprecise: It could have been fixed in an earlier version already.

I saw the following with 5.4.1 (leading up to this issue):

python -c "import PIL.Image; PIL.Image.open('repo/data/assets/scribo-test/data/OCR-D-IMG/OCR-D-IMG-orig_tiff.tif').save('test.png')"
identify -verbose test.png | grep Resolution:
 Resolution: 72x72

However, this does not happen any more with 6.2.1 (where correctly no DPI is saved).

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 7, 2020

@wrznr just convinced me that we should indeed take action to ensure core and modules comply with the spec (which requires PPI information to be kept for derived images, cf. OCR-D/spec#137).

@bertsky bertsky reopened this Jan 7, 2020
@bertsky
Copy link
Collaborator Author

bertsky commented Jan 7, 2020

Not so sure about the time frame for this though. Since it involves patching all modules, I guess the final workshop is out of the question. Setting 3.0 to take out the pressure.

@bertsky bertsky added this to the 3.0.0 milestone Jan 7, 2020
@bertsky
Copy link
Collaborator Author

bertsky commented Sep 21, 2020

Since it involves patching all modules

If we just assume the derived image passed to save_image_file must come from the (cropped/deskewed/dewarped/binarized/denoised/...) original without rescaling somehow, then we'll only need to patch core (finding the original image with the same pageId and looking at its OcrdExif). Of course we could still allow processors to specify a DPI themselves (in case they did rescale, or just to speed up).

@bertsky
Copy link
Collaborator Author

bertsky commented Jun 26, 2024

If we just assume the derived image passed to save_image_file must come from the (cropped/deskewed/dewarped/binarized/denoised/...) original without rescaling somehow, then we'll only need to patch core (finding the original image with the same pageId and looking at its OcrdExif). Of course we could still allow processors to specify a DPI themselves (in case they did rescale, or just to speed up).

Alternatively, we could inject DPI info into the coords dict (along with affine transform and image features) at the top image_from_page and then make sure it gets passed down with each image_from_segment. Since images are only ever useful along with their coordinates, and image operations usually have to update that dict consistently (for example when rescaling), any caller of Workspace.save_image_file should have the DPI info nearby.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants