Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support right-to-left (RTL) text and mixed direction text. #9

Open
8 tasks done
danfickle opened this issue Mar 13, 2016 · 10 comments
Open
8 tasks done

Support right-to-left (RTL) text and mixed direction text. #9

danfickle opened this issue Mar 13, 2016 · 10 comments

Comments

@danfickle
Copy link
Owner

Support RTL text such as Arabic. I've divided this issue up into the following tasks. Note that Bidi and ArabicShaping classes come from ICU4J. The aim is to provide identical output to Chrome browser with dir="auto" set on html element. The implementation uses interfaces with a do nothing implementation by default to avoid a compulsory dependency on ICU4J.

  • Split document up into paragraphs. The owner of each text node is said to be the nearest parent block element.
  • Split paragraphs up into directional runs with Bidi::setPara, Bidi::countRuns and Bidi::getVisualRun.
  • Determine if a line is predominantly RTL and right align it if it is.
  • Also if a line is predominantly RTL, lay its children out from right to left instead of left to right.
  • Shape text with ArabicShaping::shape. This will turn isolate characters into begin, middle or end forms depending on position in word.
  • Reorder text from RTL to LTR so we can output it with the standard showText instead of character by character backwards. This uses Bidi::writeReverse.
  • Deshape characters if the shaped versions are missing in the font.
  • Use fallback font in case character still doesn't exist in font. For example latin characters, when using an Arabic font.
@danfickle
Copy link
Owner Author

Implementation details:

  1. RTL support is only implemented for the PDF renderer at this stage. Java2D will take some more work as drawString tries to do its own bi directional output. We can get around this by splitting the text into glyphs before output.
  2. Text is split into directional runs using IBM's ICU4J. This is put in an optional module so that those not needing bi-directional text layout can avoid pulling in the large ICU4J library.
  3. The value start is added to the valid values for text-align. This means that text will be aligned to the right for predominantly(more than half the characters on the line) RTL lines and to the left for LTR lines. start is now the default value for text-align.
  4. Layout is right-to-left for predominantly RTL lines and left-to-right for predominantly LTR lines. This approximates the behavior of dir="auto" on modern browsers.
  5. It is not currently possible to explicitly define a layout direction. This would be reasonably easy to implement however.
  6. To use the bi-directional layout algorithm include openhtmltopdf-rtl-support module in your maven project and include these lines:
               PdfBoxRenderer renderer = new PdfBoxRenderer(false);

               renderer.setBidiSplitter(new ICUBidiSplitter.ICUBidiSplitterFactory());
               renderer.setDefaultTextDirection(false); // false for LTR, true for RTL.
               renderer.setBidiReorderer(new ICUBidiReorderer());

@danfickle
Copy link
Owner Author

You will also need an appropriate font for all right-to-left scripts. You can use one or more of the noto family of fonts from Google. For Arabic, your stylesheet might look like this:

@font-face {
    font-family: noto;
    src: url(NotoNaskhArabic-Regular.ttf);
}
body, body * {
  font-family: 'noto', sans-serif;
}

@danfickle
Copy link
Owner Author

OK, with the implementation of multiple font fallback for the PDF renderer - #10 - the RTL support can now be used for PDF output.

@omidp
Copy link

omidp commented Oct 24, 2016

👍

danfickle added a commit that referenced this issue Oct 25, 2016
danfickle added a commit that referenced this issue Oct 29, 2016
Implements the dir attribute and also the bdi element. Fixes several
bugs in the original RTL implementation.
@danfickle
Copy link
Owner Author

I have now implemented the dir attribute and the bdi element. To approximate the previous behavior, just add dir="auto" to the html element. However, if you know some or all of the text direction of your document you should markup appropriately as the BIDI algorithm is not perfect.
bidi-screenshot

@omidp
Copy link

omidp commented Oct 30, 2016

Hi,

I got separated words for unicode even with the correct font, please check this pdf file

@danfickle the sample source code is available here

@danfickle
Copy link
Owner Author

Hi @omidp
I think it is a font issue. If you paste the following into the sandbox you get much better results.

<div class="arabic" style="font-size:30px;">
ديگر وب سايت‌هاي شركت كسب و كار نوين ايرانيان
</div>

I'm no font expert but I think that all fonts don't contain the presentational forms that PDF requires as now on screen renderers can do the job themselves. In the sandbox, I'm using NotoSansCJKtc-Regular.ttf if that is any help.

@omidp
Copy link

omidp commented Nov 1, 2016

@danfickle after some straggling, I realized that it's a bug in Apache pdfbox.
according to this issue, fully RTL supports is impossible at this moment. I wish you used iText or another pdf generator.

@danfickle
Copy link
Owner Author

Hi @omidp
Could you confirm that the example in the showcase is incorrect? To my untrained eye, it looks the same as that produced by the Chrome web browser. Is Chrome wrong also, or am I missing something?

I agree totally that better font support is essential, I'm just not sure if it works or not with fonts with the presentational forms built-in.

Thanks,
Daniel.

@danfickle danfickle reopened this Nov 2, 2016
@omidp
Copy link

omidp commented Nov 5, 2016

I confirm that showcase is correct even though I'm unable to reproduce it. sounds like Arabic text is working with the font you have mentioned earlier but Persian text is not working. some of Persian characters still fail to join. I'll get back for more comprehensive feedback ASAP.

burka pushed a commit to burka/openhtmltopdf that referenced this issue Apr 26, 2024
Push to anything but main will trigger a dev release, every push to main
a proper GitHub release.

Not sure if you've discussed the release process yet?

Might be worth looking into [JReleaser](https://jreleaser.org/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants