Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PrettyPrinter strips newlines from text in nodes, even pcdata #4303

Closed
scabug opened this issue Feb 28, 2011 · 10 comments
Closed

PrettyPrinter strips newlines from text in nodes, even pcdata #4303

scabug opened this issue Feb 28, 2011 · 10 comments

Comments

@scabug
Copy link

scabug commented Feb 28, 2011

=== What steps will reproduce the problem ===

scala> <foo>{"hi\nthere"}</foo>
res6: scala.xml.Elem =
<foo>hi
there</foo>

scala> new PrettyPrinter(9999,2).format(<foo>{"hi\nthere"}</foo>)
res7: String = <foo>hi there</foo>

scala> new PrettyPrinter(9999,2).format(<foo>{PCData("hi\nthere")}</foo>)
res8: String = <foo><![CDATA[hi there]]></foo>
@scabug
Copy link
Author

scabug commented Feb 28, 2011

Imported From: https://issues.scala-lang.org/browse/SI-4303?orig=1
Reporter: Ittay Dror (ittayd)

@scabug
Copy link
Author

scabug commented Mar 1, 2011

@axel22 said:
The correct behaviour needs to be checked by someone in the xml specification. Contributions are, of course, always welcome.

@scabug
Copy link
Author

scabug commented Feb 18, 2014

Francois Armand (fanf) said:
For people with that problem, it seems to simply changing the "doPreserve" method of PrettyPrinter to always returning true make what we want. I don't have the least knowledge about what is expecting by XML spec or DTD.

So bad that the doPreserve method is private...

@scabug
Copy link
Author

scabug commented Dec 22, 2014

Michael Beckerle (mbeckerle.dfdl) said:
I would like to comment on this issue of the XML specificatiion, and what the right behavior is.

XML 1.1 spec is very clear that if you insert a CR into text using via an "entity value literal" then that character must be preserved. This suggests to me that the only reasonable implementation would not do any whitespace normalization on output, as all the various unicode line-ending characters can be inserted by this same mechanism.

This from the XML 1.1 spec (this clarification is not in the original XML 1.0 spec, but I suggest it is the "right thing" to do for XML 1.0 implementations anyway)

2.3 Common Syntactic Constructs

This section defines some symbols used widely in the grammar.

S (white space) consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs.
White Space
[3] S ::= (#x20 | #x9 | #xD | #xA)+

Note:The presence of #xD in the above production is maintained purely for backward compatibility with the First Edition. As explained in 2.11 End-of-Line Handling, all #xD characters literally present in an XML document are either removed or replaced by #xA characters before any other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal.

@scabug
Copy link
Author

scabug commented Dec 22, 2014

@som-snytt said:

scala> import xml._
import xml._

scala> val n = new PCData("hi there.")
n: scala.xml.PCData = <![CDATA[hi there.]]>

scala> val p = new PrettyPrinter(80,5)
p: scala.xml.PrettyPrinter = scala.xml.PrettyPrinter@c86b9e3

scala> p format n
res0: String = <![CDATA[hi there.]]>

scala> val n = new PCData("""hi there,
     |   is there any way to fix this?""")
n: scala.xml.PCData =
<![CDATA[hi there,
  is there any way to fix this?]]>

scala> p format n
res1: String =
<![CDATA[hi there,
  is there any way to fix this?]]>

scala> p format <a>{n}</a>
res2: String = <a><![CDATA[hi there, is there any way to fix this?]]></a>

Footnote, you don't get incomplete parses from embedded Scala blocks:

scala> <a>{ PCData("""
<console>:1: error: in XML literal:  expected end of Scala block
       <a>{ PCData("""
                      ^

@scabug
Copy link
Author

scabug commented Dec 23, 2014

@som-snytt said (edited on Dec 23, 2014 9:46:38 PM UTC):
Took a quick look. First, Utility.serialize is the non-formatting option. Second the PrettyPrinter is pretty ugly. It's not obvious whether it's trying to minimize verticality. When is GSOC again?

scala> val xx = <a>{ PCData("Here is some very long text\nto split.") }</a>
xx: scala.xml.Elem =
<a><![CDATA[Here is some very long text
to split.]]></a>

scala> val pp = new PrettyPrinter(1000,2)
pp: scala.xml.PrettyPrinter = scala.xml.PrettyPrinter@13275d8

scala> pp format xx
res7: String = <a><![CDATA[Here is some very long text to split.]]></a>

scala> val pp = new PrettyPrinter(10,2)
pp: scala.xml.PrettyPrinter = scala.xml.PrettyPrinter@673919a7

scala> pp format xx
res8: String =
<a>
  <![CDATA[Here is some very long text
to split.]]>
</a>

scala> val pp = new PrettyPrinter(2,2)
pp: scala.xml.PrettyPrinter = scala.xml.PrettyPrinter@41853299

scala> pp format xx
res9: String =
"<a><![CDATA[Here is some very long text to split.]]></a>
"

@scabug
Copy link
Author

scabug commented Dec 23, 2014

Michael Beckerle (mbeckerle.dfdl) said:
Sorry GSOC means what?

@scabug
Copy link
Author

scabug commented Dec 23, 2014

@som-snytt said:
I was hoping a Google Summer of Code intern wanted to do a project with XML.

Maybe a student co-majoring in History. The "digital humanities" are huge these days.

@scabug
Copy link
Author

scabug commented Jul 17, 2015

@SethTisue said:
The scala-xml library is now community-maintained. Issues with it are now tracked at https://github.com/scala/scala-xml/issues instead of here in the Scala JIRA.

Interested community members: if you consider this issue significant, feel free to open a new issue for it on GitHub, with links in both directions.

@scabug scabug closed this as completed Jul 17, 2015
@scabug
Copy link
Author

scabug commented Jul 29, 2015

Michael Beckerle (mbeckerle.dfdl) said:
Issue migrated to scala/scala-xml#76

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant