Substitute Non-Breaking-Space with Normal-Space for PDF font character lookup #21

rototor · 2016-04-18T11:24:56Z

I´ve investigated the "#" problem I described in #19 a bit future. The problem is, that   is renderd as '#'. The # comes from the default xhtmlrenderer.conf:

# When rendering text, not all fonts support all character glyphs. When set to true, this
# will replace any missing characters with the specified character to aid in the debugging
# of your PDF.  Currently only supported for PDF rendering.
xr.renderer.replace-missing-characters=false
xr.renderer.missing-character-replacement=#

The character is used as replacement even if xr.renderer.replace-missing-characters=false. It seem no font has a   character. This makes somehow sense, as its visual the same character as a normal space.

Just replacing (character 160) with ' ' would fix the problem - but it does not feel like a correct fix to me:

--- a/openhtmltopdf-pdfbox/src/main/java/com/openhtmltopdf/pdfboxout/PdfBoxOutputDevice.java
+++ b/openhtmltopdf-pdfbox/src/main/java/com/openhtmltopdf/pdfboxout/PdfBoxOutputDevice.java
@@ -381,6 +332,8 @@ public class PdfBoxOutputDevice extends AbstractOutputDevice implements OutputDe
         for (int i = 0; i < str.length(); ) {
             int unicode = str.codePointAt(i);
             i += Character.charCount(unicode);
+            if( unicode == 160 )
+                unicode = ' ';
             String ch = String.valueOf(Character.toChars(unicode));
             boolean gotChar = false;

Especially because their are more spaces then just space and non-breaking-space. For examples see here https://www.cs.tut.fi/~jkorpela/chars/spaces.html

The text was updated successfully, but these errors were encountered:

@rototor

…h normal space if not present in font. Thanks @rototor

danfickle · 2016-04-19T08:34:30Z

I think (hope) using Character::isSpaceChar is the correct fix. We also need to make it easier to change the replacement character. Thanks @rototor for the patch, Daniel.

rototor · 2016-04-19T09:24:16Z

@danfickle I think using Character::isSpaceChar is really enough for now. If someone wants different "space-widths" he just should use a <span> with the needed styles (i.e. inline-block, and width: 0.5em etc).

scoldwell · 2016-06-17T16:37:53Z

@danfickle is there a timeframe for having this fixed (in a non-snapshot version)? We have an application using your library that is supposed to go into production, but the customer ran into this problem in user acceptance testing and is not likely to approve this moving to production the way it is. Thanks!

danfickle · 2016-06-17T17:27:59Z

Can you give me the weekend to clean up some svg code before deploying a release or do you need it immediately? It's nice to hear that people are using this.

scoldwell · 2016-06-17T17:36:03Z

Yeah that's no problem. Thanks for the quick response!

scoldwell · 2016-06-22T19:25:05Z

Just FYI, I came across another character that causes a "#" to show up. which is classified as a zero-width space: https://en.wikipedia.org/wiki/Zero-width_space

I've put in some character replacement in our code to deal with this for the time being, but thought you'd like to know. Thanks again for the fast turnaround.

danfickle · 2016-06-23T02:40:38Z

At least we're not the only ones having trouble with this.
http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4190860
https://josm.openstreetmap.de/ticket/8918

scoldwell · 2016-06-23T03:10:29Z

Sorry that was supposed to be 

The dangers of copy and pasting code! Our string width and string replacement routines had drifted out of sync.

danfickle · 2016-06-23T05:25:14Z

@scoldwell - If you are pre-filtering as a temporary fix, you may wish to use this function:

    /**
     * Checks if a code point is printable. If false, it can be safely discarded at the 
     * rendering stage, else it should be replaced with the replacement character,
     * if a suitable glyph can not be found.
     * @param codePoint
     * @return whether codePoint is printable
     */
    public static boolean isCodePointPrintable(int codePoint) {
        if (Character.isISOControl(codePoint))
            return false;

        int category = Character.getType(codePoint);

        return !(category == Character.CONTROL ||
                 category == Character.FORMAT ||
                 category == Character.UNASSIGNED ||
                 category == Character.PRIVATE_USE ||
                 category == Character.SURROGATE);
    }

As an implementation note, behavior will differ between Java 6 and later versions as the unicode version was changed and Character::isWhitespace no longer returns true for zero-width spaces.

I'll close this issue now, as I think it is finally solved. Feel free to re-open if you find any other issues.

* Change groupid to reflect the transition into organization * Doing builds and especially releases both on push and PR leads into duplicate builds. We should choose on of them, and I think PR should suffice * Release process (danfickle#21) * The first commit in the repo is from 2004, so I find it correct to state that as the inception year * Updated the Maven compiler plugin as well * Updated the Maven source and javadocs plugins * Minor tweaks * First take on a release pipeline * Getting there * Switching to using semver instead * Updated groupid to adhere with what Maven central expects and accepts

danfickle added a commit that referenced this issue Apr 19, 2016

For #21 - Replace non-breaking space (and other space characters) wit…

1e5d831

…h normal space if not present in font. Thanks @rototor

danfickle added a commit that referenced this issue Jun 23, 2016

For #26 and #21 - Issues relating to character substitution and width

b04e998

The dangers of copy and pasting code! Our string width and string replacement routines had drifted out of sync.

danfickle closed this as completed Jun 23, 2016

danfickle mentioned this issue Jun 23, 2016

Spacing ignored when surround by <b> tag #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Substitute Non-Breaking-Space with Normal-Space for PDF font character lookup #21

Substitute Non-Breaking-Space with Normal-Space for PDF font character lookup #21

rototor commented Apr 18, 2016

danfickle commented Apr 19, 2016

rototor commented Apr 19, 2016

scoldwell commented Jun 17, 2016 •

edited

Loading

danfickle commented Jun 17, 2016

scoldwell commented Jun 17, 2016

scoldwell commented Jun 22, 2016

danfickle commented Jun 23, 2016

scoldwell commented Jun 23, 2016

danfickle commented Jun 23, 2016 •

edited

Loading

Substitute Non-Breaking-Space with Normal-Space for PDF font character lookup #21

Substitute Non-Breaking-Space with Normal-Space for PDF font character lookup #21

Comments

rototor commented Apr 18, 2016

danfickle commented Apr 19, 2016

rototor commented Apr 19, 2016

scoldwell commented Jun 17, 2016 • edited Loading

danfickle commented Jun 17, 2016

scoldwell commented Jun 17, 2016

scoldwell commented Jun 22, 2016

danfickle commented Jun 23, 2016

scoldwell commented Jun 23, 2016

danfickle commented Jun 23, 2016 • edited Loading

scoldwell commented Jun 17, 2016 •

edited

Loading

danfickle commented Jun 23, 2016 •

edited

Loading