Is it possible to prettyprint document without adding newlines after a certain tag? #2141

FishHawk · 2024-06-10T16:13:29Z

I use jsoup to create epub. Here is the formatted xhtml file in the epub file. But some readers require that there should be no extra spaces or line breaks after <dc:language>. Is there a way to solve this?

<?xml version="1.0" encoding="utf-8"?>
<package xmlns="http://www.idpf.org/2007/opf" version="3.0" unique-identifier="pub-id">
 <metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
  <dc:identifier id="pub-id">
   id
  </dc:identifier>
  <dc:title>
   title
  </dc:title>
  <dc:language>
   ja
  </dc:language>
  <dc:description>
   balabala
  </dc:description>
 </metadata>
 <manifest></manifest>
 <spine toc="toc.ncx"></spine>
</package>

The text was updated successfully, but these errors were encountered:

jhy · 2024-07-01T06:22:55Z

Currently there's no configurable way to change this. The default for unknown tags is to parse them as blocks.

I think it might make sense to parse unknown tags as "block + format-as-inline". That would prevent the line wrap when pretty-printing.

Or, perhaps, try and infer it by checking if the children are defined block tags or not.

FishHawk · 2024-07-02T05:49:44Z

Got it. I'll just use string replacement for now.

jhy · 2024-07-09T00:24:44Z

I took a pass at this (particularly, attempting to introduce a rule that only indents if an element contains other elements, but not if (all the other existing rules)). But it got too gnarly, as the rule needs to be implemented differently in each of Element outerHead, Element outerTail, and TextNode outerHead.

So, I'm declaring bankruptcy on the current pretty-printer implementation, and won't make other changes to it. I plan on refactoring it at some point to a strategy type object (which will provide more configurability) that will also maintain more state during the pass (a stack of what was indented, what requires white-space preservation, etc). With the goal of substantially simplifying the implementation and allowing better control, both by default and by users.

But for now, yeah, string replacement is your best bet...

anonyein · 2024-07-15T03:59:29Z

I took a pass at this (particularly, attempting to introduce a rule that only indents if an element contains other elements, but not if (all the other existing rules)). But it got too gnarly, as the rule needs to be implemented differently in each of Element outerHead, Element outerTail, and TextNode outerHead.

So, I'm declaring bankruptcy on the current pretty-printer implementation, and won't make other changes to it. I plan on refactoring it at some point to a strategy type object (which will provide more configurability) that will also maintain more state during the pass (a stack of what was indented, what requires white-space preservation, etc). With the goal of substantially simplifying the implementation and allowing better control, both by default and by users.

But for now, yeah, string replacement is your best bet...

I support removing pretty print as default, this feature makes lots of trouble, such as
org.jsoup.Jsoup.parse("<a>a a</a>").select("a").text()
which makes the doule whitespaces in "a a" joined into only single one,
and this may bring in big trouble in practical applications!

jhy · 2024-07-15T06:14:16Z

@anonyein I have covered this before in other issues: I prefer the pretty printer being on by default for HTML, and don't intend on changing that. It is off by default when using the XML parser, as we don't know the whitespace significance in XML. But we do in HTML and it generally appropriate and safe when serializing for downstream HTML parsers.

Let's keep this issue for the specific output change noted.

anonyein · 2024-07-15T06:23:28Z

@anonyein I have covered this before in other issues: I prefer the pretty printer being on by default for HTML, and don't intend on changing that. It is off by default when using the XML parser, as we don't know the whitespace significance in XML. But we do in HTML and it generally appropriate and safe when serializing for downstream HTML parsers.

Let's keep this issue for the specific output change noted.

I see. THX

anonyein · 2024-07-15T07:15:48Z

@anonyein I have covered this before in other issues: I prefer the pretty printer being on by default for HTML, and don't intend on changing that. It is off by default when using the XML parser, as we don't know the whitespace significance in XML. But we do in HTML and it generally appropriate and safe when serializing for downstream HTML parsers.

Let's keep this issue for the specific output change noted.
@jhy
Jsoup.parse("<a>a a</a>", "", Parser.xmlParser()).select("a").text()
I still cannot get "a a"

jhy · 2024-07-15T09:14:47Z

@anonyein like I said, please keep this issue focussed on the specific issue, not general pretty-print issues.

At any rate, .text() is completely unrelated from the pretty-printer. So any settings or discussions here are irrelevant. The pretty-printer is used only in the .html() methods.

If you read the Element.text() documentation you'll see it acts as described, and how to get non-normalized text.

anonyein · 2024-07-15T10:07:44Z

@anonyein like I said, please keep this issue focussed on the specific issue, not general pretty-print issues.

At any rate, .text() is completely unrelated from the pretty-printer. So any settings or discussions here are irrelevant. The pretty-printer is used only in the .html() methods.

If you read the Element.text() documentation you'll see it acts as described, and how to get non-normalized text.

I get it, using "wholeText" instead. Thanks for your help!

FishHawk mentioned this issue Jun 10, 2024

苹果 ibooks 图书应用 epub 阅读异常 FishHawk/auto-novel#85

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to prettyprint document without adding newlines after a certain tag? #2141

Is it possible to prettyprint document without adding newlines after a certain tag? #2141

FishHawk commented Jun 10, 2024

jhy commented Jul 1, 2024

FishHawk commented Jul 2, 2024

jhy commented Jul 9, 2024

anonyein commented Jul 15, 2024 •

edited

Loading

jhy commented Jul 15, 2024

anonyein commented Jul 15, 2024

anonyein commented Jul 15, 2024 •

edited

Loading

jhy commented Jul 15, 2024

anonyein commented Jul 15, 2024 •

edited

Loading

Is it possible to prettyprint document without adding newlines after a certain tag? #2141

Is it possible to prettyprint document without adding newlines after a certain tag? #2141

Comments

FishHawk commented Jun 10, 2024

jhy commented Jul 1, 2024

FishHawk commented Jul 2, 2024

jhy commented Jul 9, 2024

anonyein commented Jul 15, 2024 • edited Loading

jhy commented Jul 15, 2024

anonyein commented Jul 15, 2024

anonyein commented Jul 15, 2024 • edited Loading

jhy commented Jul 15, 2024

anonyein commented Jul 15, 2024 • edited Loading

anonyein commented Jul 15, 2024 •

edited

Loading

anonyein commented Jul 15, 2024 •

edited

Loading

anonyein commented Jul 15, 2024 •

edited

Loading