Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substitute Non-Breaking-Space with Normal-Space for PDF font character lookup #21

Closed
rototor opened this issue Apr 18, 2016 · 9 comments

Comments

@rototor
Copy link
Contributor

rototor commented Apr 18, 2016

I´ve investigated the "#" problem I described in #19 a bit future. The problem is, that   is renderd as '#'. The # comes from the default xhtmlrenderer.conf:

# When rendering text, not all fonts support all character glyphs. When set to true, this
# will replace any missing characters with the specified character to aid in the debugging
# of your PDF.  Currently only supported for PDF rendering.
xr.renderer.replace-missing-characters=false
xr.renderer.missing-character-replacement=#

The character is used as replacement even if xr.renderer.replace-missing-characters=false. It seem no font has a   character. This makes somehow sense, as its visual the same character as a normal space.

Just replacing   (character 160) with ' ' would fix the problem - but it does not feel like a correct fix to me:

--- a/openhtmltopdf-pdfbox/src/main/java/com/openhtmltopdf/pdfboxout/PdfBoxOutputDevice.java
+++ b/openhtmltopdf-pdfbox/src/main/java/com/openhtmltopdf/pdfboxout/PdfBoxOutputDevice.java
@@ -381,6 +332,8 @@ public class PdfBoxOutputDevice extends AbstractOutputDevice implements OutputDe
         for (int i = 0; i < str.length(); ) {
             int unicode = str.codePointAt(i);
             i += Character.charCount(unicode);
+            if( unicode == 160 )
+                unicode = ' ';
             String ch = String.valueOf(Character.toChars(unicode));
             boolean gotChar = false;

Especially because their are more spaces then just space and non-breaking-space. For examples see here https://www.cs.tut.fi/~jkorpela/chars/spaces.html

danfickle added a commit that referenced this issue Apr 19, 2016
…h normal space if not present in font.

Thanks @rototor
@danfickle
Copy link
Owner

I think (hope) using Character::isSpaceChar is the correct fix. We also need to make it easier to change the replacement character. Thanks @rototor for the patch, Daniel.

@rototor
Copy link
Contributor Author

rototor commented Apr 19, 2016

@danfickle I think using Character::isSpaceChar is really enough for now. If someone wants different "space-widths" he just should use a <span> with the needed styles (i.e. inline-block, and width: 0.5em etc).

@scoldwell
Copy link

scoldwell commented Jun 17, 2016

@danfickle is there a timeframe for having this fixed (in a non-snapshot version)? We have an application using your library that is supposed to go into production, but the customer ran into this problem in user acceptance testing and is not likely to approve this moving to production the way it is. Thanks!

@danfickle
Copy link
Owner

Can you give me the weekend to clean up some svg code before deploying a release or do you need it immediately? It's nice to hear that people are using this.

@scoldwell
Copy link

Yeah that's no problem. Thanks for the quick response!

@scoldwell
Copy link

Just FYI, I came across another character that causes a "#" to show up. ​ which is classified as a zero-width space: https://en.wikipedia.org/wiki/Zero-width_space

I've put in some character replacement in our code to deal with this for the time being, but thought you'd like to know. Thanks again for the fast turnaround.

@danfickle
Copy link
Owner

At least we're not the only ones having trouble with this.
http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4190860
https://josm.openstreetmap.de/ticket/8918

@scoldwell
Copy link

Sorry that was supposed to be &#8203;

danfickle added a commit that referenced this issue Jun 23, 2016
The dangers of copy and pasting code! Our string width and string
replacement routines had drifted out of sync.
@danfickle
Copy link
Owner

danfickle commented Jun 23, 2016

@scoldwell - If you are pre-filtering as a temporary fix, you may wish to use this function:

    /**
     * Checks if a code point is printable. If false, it can be safely discarded at the 
     * rendering stage, else it should be replaced with the replacement character,
     * if a suitable glyph can not be found.
     * @param codePoint
     * @return whether codePoint is printable
     */
    public static boolean isCodePointPrintable(int codePoint) {
        if (Character.isISOControl(codePoint))
            return false;

        int category = Character.getType(codePoint);

        return !(category == Character.CONTROL ||
                 category == Character.FORMAT ||
                 category == Character.UNASSIGNED ||
                 category == Character.PRIVATE_USE ||
                 category == Character.SURROGATE);
    }

As an implementation note, behavior will differ between Java 6 and later versions as the unicode version was changed and Character::isWhitespace no longer returns true for zero-width spaces.

I'll close this issue now, as I think it is finally solved. Feel free to re-open if you find any other issues.

burka pushed a commit to burka/openhtmltopdf that referenced this issue Apr 26, 2024
* Change groupid to reflect the transition into organization

* Doing builds and especially releases both on push and PR leads into duplicate builds. We should choose on of them, and I think PR should suffice

* Release process (danfickle#21)

* The first commit in the repo is from 2004, so I find it correct to state that as the inception year

* Updated the Maven compiler plugin as well

* Updated the Maven source and javadocs plugins

* Minor tweaks

* First take on a release pipeline

* Getting there

* Switching to using semver instead

* Updated groupid to adhere with what Maven central expects and accepts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants