Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#549 Fix soft hyphen characters are visible #550

Conversation

StephanSchrader
Copy link

Fix for issue #549, all invisible characters are filtered and not printed in any way.

@syjer
Copy link
Contributor

syjer commented Sep 9, 2020

looks like the PR has a failing test: https://travis-ci.org/github/danfickle/openhtmltopdf/builds/725230397 .

Running com.openhtmltopdf.visualregressiontests.TextVisualRegressionTest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.021 sec <<< FAILURE!
com.openhtmltopdf.visualregressiontests.TextVisualRegressionTest  Time elapsed: 0.02 sec  <<< ERROR!
java.lang.NullPointerException
	at java.util.Objects.requireNonNull(Objects.java:203)
	at java.nio.file.Files.copy(Files.java:2984)
	at com.openhtmltopdf.visualtest.TestSupport.makeFontFile(TestSupport.java:52)
	at com.openhtmltopdf.visualtest.TestSupport.makeFontFiles(TestSupport.java:64)
	at com.openhtmltopdf.visualregressiontests.TextVisualRegressionTest.makeFontFiles(TextVisualRegressionTest.java:35)

(Btw, it seems the travis-ci <-> github integration is not configured completely ? Normally I would have expected the feedback directly in the comments )

@StephanSchrader
Copy link
Author

It's fixed now. Forgot to remove a font used during testing, which wasn't needed at the end.

Thanks for looking into!

@danfickle
Copy link
Owner

Hi @StephanSchrader,

Firstly, a big thanks for contributing to this project!

I think you want allMatch rather than noneMatch (I could be wrong). Also, we need to strip out non-printables in PdfBoxTextRenderer.getWidth() as well so that text justification calculations will not be thrown out.

The bigger question is the performance impact of running every character through the unicode character database (probably multiple times). I'll have to do some profiling before I accept this PR. If it does prove to be a bottleneck, maybe we could just test for a few common problematic characters? Can you think of any others?

Thanks again,
Daniel.

@StephanSchrader
Copy link
Author

Hey Daniel aka @danfickle

I think you want allMatch rather than noneMatch (I could be wrong).

The expression: s.chars().noneMatch(OpenUtil::isCodePointPrintable) should match when the string contains no non-printable characters at all. In that case the fast text rendering should be used. Do I have a thought error?

The bigger question is the performance impact of running every character through the unicode character database (probably multiple times).

Good point. I can provide a JMH benchmark for the text rendering.

I'll have to do some profiling before I accept this PR. If it does prove to be a bottleneck, maybe we could just test for a few common problematic characters? Can you think of any others?

That's a nice idea. Actually the soft-hyphen is the only case I have. Other chars which I could think of, are: Unit Separator, No Break Here. Source: https://en.wikipedia.org/wiki/C0_and_C1_control_codes

@StephanSchrader
Copy link
Author

Hi @danfickle

I've pushed the changes for com.openhtmltopdf.pdfboxout.PdfBoxTextRenderer#getWidth and the mistake with noneMatch you mentioned.

I also added a benchmark rendering a simple text and another one with hyphens. Please review if I've missed something. To run the benchmarks, build the project and run:

java -jar openhtmltopdf-benchmark/target/benchmarks.jar

Use -f 2 for fewer forks to reduce the amount time, -h for a list of all parameters.

The branch: benchmark in my fork also contains the benchmarks for comparison.

My results are:

Branch 'benchmark':
Benchmark                                   Mode  Cnt  Score   Error  Units
RenderTextBenchmark.renderText_Plain        avgt   25  1,069 ± 0,021  ms/op
RenderTextBenchmark.renderText_SoftHyphens  avgt   25  1,897 ± 0,043  ms/op
Branch 'soft-hypens-visible':
Benchmark                                   Mode  Cnt  Score   Error  Units
RenderTextBenchmark.renderText_Plain        avgt   10  1,090 ± 0,060  ms/op
RenderTextBenchmark.renderText_SoftHyphens  avgt   10  1,112 ± 0,017  ms/op

Thanks for reviewing and all the work!

Cheers, Stephan

@danfickle
Copy link
Owner

@StephanSchrader

Thanks Stephan,

When manually iterating unicode 32 bit code points (as opposed to Java's 16bit chars), one has to be careful to increment the loop counter correctly to handle surrogate pairs (two 16 bit chars that make one code point).

for (int i = 0; i < str.length(); /* No-op */) {
   int codePoint = str.codePointAt(i);
   i += Character.charCount(codePoint); /* Returns 1 or 2 */
}

However, I'd recommend the use of str.codePoints() method instead which returns an IntStream of code points. Then you can append to a StringBuilder with appendCodePoint method.

Other than that:

  • I'm not sure about adding yet another module, perhaps we could add jmh dependency to openhtmltopdf-examples instead?
  • We're going to remove the slow output device in a couple of versions anyway, but we don't want to break it until then!
  • What is the rationale of replacing font.getStringWidth with font.encode? You are most likely right, but it would be nice to have rationale on record.
  • In drawString, could we replace the call to areAllCharactersPrintable with a call directly to getEffectivePrintableString, like you've done in getWidth?

Thanks again,
Daniel.

@StephanSchrader
Copy link
Author

Hi @danfickle

When manually iterating unicode 32 bit code points (as opposed to Java's 16bit chars), one has to be careful to increment the loop counter correctly to handle surrogate pairs (two 16 bit chars that make one code point).

You're absolutely right and I wasn't aware of the method: String.codePoints() .

I'm not sure about adding yet another module, perhaps we could add jmh dependency to openhtmltopdf-examples instead?

My intention was to have this aspect simply visible and because the examples module already contains aspects with different responsibilities: more or less integration tests, documentation generation, performance things ;) I moved the JMH and dependency to openhtmltopdf-examples. To run the benchmarks, execute the command:

java -jar openhtmltopdf-examples/target/benchmarks.jar

The package com.openhtmltopdf.performance contains also some classes which could be refactored to use JMH. I could do this in a separate PR.

  • We're going to remove the slow output device in a couple of versions anyway, but we don't want to break it until then!

Changes to PdfBoxSlowOutputDevice are reverted.

  • What is the rationale of replacing font.getStringWidth with font.encode? You are most likely right, but it would be nice to have rationale on record.

My first thought was font.encode is suffice to check whether a string can completely printed with some font. So width calculation is an overhead and could be skipped. But sadly this not true for all font implementations. I reverted that change.

  • In drawString, could we replace the call to areAllCharactersPrintable with a call directly to getEffectivePrintableString, like you've done in getWidth?

Done.

Cheers Stephan

@StephanSchrader
Copy link
Author

will create a new PR

danfickle added a commit that referenced this pull request Oct 30, 2020
…known problematic characters

Can not be too aggressive as some fonts contain private area code points, etc and expect them to be output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants