#549 Fix soft hyphen characters are visible #550

StephanSchrader · 2020-09-08T13:09:07Z

Fix for issue #549, all invisible characters are filtered and not printed in any way.

syjer · 2020-09-09T11:55:25Z

looks like the PR has a failing test: https://travis-ci.org/github/danfickle/openhtmltopdf/builds/725230397 .

Running com.openhtmltopdf.visualregressiontests.TextVisualRegressionTest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.021 sec <<< FAILURE!
com.openhtmltopdf.visualregressiontests.TextVisualRegressionTest  Time elapsed: 0.02 sec  <<< ERROR!
java.lang.NullPointerException
	at java.util.Objects.requireNonNull(Objects.java:203)
	at java.nio.file.Files.copy(Files.java:2984)
	at com.openhtmltopdf.visualtest.TestSupport.makeFontFile(TestSupport.java:52)
	at com.openhtmltopdf.visualtest.TestSupport.makeFontFiles(TestSupport.java:64)
	at com.openhtmltopdf.visualregressiontests.TextVisualRegressionTest.makeFontFiles(TextVisualRegressionTest.java:35)

(Btw, it seems the travis-ci <-> github integration is not configured completely ? Normally I would have expected the feedback directly in the comments )

fix build

StephanSchrader · 2020-09-10T19:28:34Z

It's fixed now. Forgot to remove a font used during testing, which wasn't needed at the end.

Thanks for looking into!

danfickle · 2020-09-14T10:53:41Z

Hi @StephanSchrader,

Firstly, a big thanks for contributing to this project!

I think you want allMatch rather than noneMatch (I could be wrong). Also, we need to strip out non-printables in PdfBoxTextRenderer.getWidth() as well so that text justification calculations will not be thrown out.

The bigger question is the performance impact of running every character through the unicode character database (probably multiple times). I'll have to do some profiling before I accept this PR. If it does prove to be a bottleneck, maybe we could just test for a few common problematic characters? Can you think of any others?

Thanks again,
Daniel.

StephanSchrader · 2020-09-14T18:07:22Z

Hey Daniel aka @danfickle

I think you want allMatch rather than noneMatch (I could be wrong).

The expression: s.chars().noneMatch(OpenUtil::isCodePointPrintable) should match when the string contains no non-printable characters at all. In that case the fast text rendering should be used. Do I have a thought error?

The bigger question is the performance impact of running every character through the unicode character database (probably multiple times).

Good point. I can provide a JMH benchmark for the text rendering.

I'll have to do some profiling before I accept this PR. If it does prove to be a bottleneck, maybe we could just test for a few common problematic characters? Can you think of any others?

That's a nice idea. Actually the soft-hyphen is the only case I have. Other chars which I could think of, are: Unit Separator, No Break Here. Source: https://en.wikipedia.org/wiki/C0_and_C1_control_codes

- fix PdfBoxTextRenderer string width calculation

StephanSchrader · 2020-09-17T21:08:48Z

Hi @danfickle

I've pushed the changes for com.openhtmltopdf.pdfboxout.PdfBoxTextRenderer#getWidth and the mistake with noneMatch you mentioned.

I also added a benchmark rendering a simple text and another one with hyphens. Please review if I've missed something. To run the benchmarks, build the project and run:

java -jar openhtmltopdf-benchmark/target/benchmarks.jar

Use -f 2 for fewer forks to reduce the amount time, -h for a list of all parameters.

The branch: benchmark in my fork also contains the benchmarks for comparison.

My results are:

Branch 'benchmark':
Benchmark                                   Mode  Cnt  Score   Error  Units
RenderTextBenchmark.renderText_Plain        avgt   25  1,069 ± 0,021  ms/op
RenderTextBenchmark.renderText_SoftHyphens  avgt   25  1,897 ± 0,043  ms/op

Branch 'soft-hypens-visible':
Benchmark                                   Mode  Cnt  Score   Error  Units
RenderTextBenchmark.renderText_Plain        avgt   10  1,090 ± 0,060  ms/op
RenderTextBenchmark.renderText_SoftHyphens  avgt   10  1,112 ± 0,017  ms/op

Thanks for reviewing and all the work!

Cheers, Stephan

danfickle · 2020-09-18T14:16:28Z

@StephanSchrader

Thanks Stephan,

When manually iterating unicode 32 bit code points (as opposed to Java's 16bit chars), one has to be careful to increment the loop counter correctly to handle surrogate pairs (two 16 bit chars that make one code point).

for (int i = 0; i < str.length(); /* No-op */) {
   int codePoint = str.codePointAt(i);
   i += Character.charCount(codePoint); /* Returns 1 or 2 */
}

However, I'd recommend the use of str.codePoints() method instead which returns an IntStream of code points. Then you can append to a StringBuilder with appendCodePoint method.

Other than that:

I'm not sure about adding yet another module, perhaps we could add jmh dependency to openhtmltopdf-examples instead?
We're going to remove the slow output device in a couple of versions anyway, but we don't want to break it until then!
What is the rationale of replacing font.getStringWidth with font.encode? You are most likely right, but it would be nice to have rationale on record.
In drawString, could we replace the call to areAllCharactersPrintable with a call directly to getEffectivePrintableString, like you've done in getWidth?

Thanks again,
Daniel.

StephanSchrader · 2020-10-12T14:28:10Z

Hi @danfickle

When manually iterating unicode 32 bit code points (as opposed to Java's 16bit chars), one has to be careful to increment the loop counter correctly to handle surrogate pairs (two 16 bit chars that make one code point).

You're absolutely right and I wasn't aware of the method: String.codePoints() .

I'm not sure about adding yet another module, perhaps we could add jmh dependency to openhtmltopdf-examples instead?

My intention was to have this aspect simply visible and because the examples module already contains aspects with different responsibilities: more or less integration tests, documentation generation, performance things ;) I moved the JMH and dependency to openhtmltopdf-examples. To run the benchmarks, execute the command:

java -jar openhtmltopdf-examples/target/benchmarks.jar

The package com.openhtmltopdf.performance contains also some classes which could be refactored to use JMH. I could do this in a separate PR.

We're going to remove the slow output device in a couple of versions anyway, but we don't want to break it until then!

Changes to PdfBoxSlowOutputDevice are reverted.

What is the rationale of replacing font.getStringWidth with font.encode? You are most likely right, but it would be nice to have rationale on record.

My first thought was font.encode is suffice to check whether a string can completely printed with some font. So width calculation is an overhead and could be skipped. But sadly this not true for all font implementations. I reverted that change.

In drawString, could we replace the call to areAllCharactersPrintable with a call directly to getEffectivePrintableString, like you've done in getWidth?

Done.

Cheers Stephan

Mostly complete except for index page and testing.

…skip]

optimize formatting of diagnostic messages

…in [ci skip]

…ccount for negative margins.

StephanSchrader · 2020-10-12T14:38:44Z

will create a new PR

…known problematic characters Can not be too aggressive as some fonts contain private area code points, etc and expect them to be output.

danfickle#549 Fix soft hyphen characters are visible

b4a61a8

danfickle#549 Fix soft hyphen characters are visible

3385d8b

fix build

schrader added 2 commits September 16, 2020 10:46

danfickle#549 Fix soft hyphen characters are visible

8a903dd

- fix PdfBoxTextRenderer string width calculation

danfickle#549 fix compilation JMH classes

83ae4e9

danfickle#549 Fix soft hyphen characters are visible

e8db381

schrader and others added 11 commits October 12, 2020 16:32

danfickle#549 removed openhtmltopdf-benchmark

f502aaa

PROJECT: Templates for openhtmltopdf. Initial commit.

10960ce

Mostly complete except for index page and testing.

index.html for templates site and minor changes to details page. [ci …

50f03bd

…skip]

Test and proof for arboshiki-invoice template.

aa4e774

Github action for publishing templates website

806795a

danfickle#544 Update README with link to templates site. [ci skip]

5470152

optimize the log formatter

1be5c22

optimize formatting of diagnostic messages

log message id format: use static final + add test

2b6a2a5

danfickle#8 Update README with new sandbox [ci skip]

60d8160

danfickle#553 Failing test for paginated table with negative top marg…

2dc0dc4

…in [ci skip]

Fixes danfickle#553 - Allow content limit to adjust initial page to a…

5747bcb

…ccount for negative margins.

StephanSchrader closed this Oct 12, 2020

StephanSchrader mentioned this pull request Oct 12, 2020

#549 Fix soft hyphen characters are visible #568

Merged

danfickle added a commit that referenced this pull request Oct 30, 2020

Fixes #588, fixes #564, changes #568, relates #550 - Only filter out …

9ffd0e4

…known problematic characters Can not be too aggressive as some fonts contain private area code points, etc and expect them to be output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#549 Fix soft hyphen characters are visible #550

#549 Fix soft hyphen characters are visible #550

StephanSchrader commented Sep 8, 2020

syjer commented Sep 9, 2020

StephanSchrader commented Sep 10, 2020

danfickle commented Sep 14, 2020

StephanSchrader commented Sep 14, 2020

StephanSchrader commented Sep 17, 2020

danfickle commented Sep 18, 2020

StephanSchrader commented Oct 12, 2020

StephanSchrader commented Oct 12, 2020

#549 Fix soft hyphen characters are visible #550

#549 Fix soft hyphen characters are visible #550

Conversation

StephanSchrader commented Sep 8, 2020

syjer commented Sep 9, 2020

StephanSchrader commented Sep 10, 2020

danfickle commented Sep 14, 2020

StephanSchrader commented Sep 14, 2020

StephanSchrader commented Sep 17, 2020

danfickle commented Sep 18, 2020

StephanSchrader commented Oct 12, 2020

StephanSchrader commented Oct 12, 2020