[documentation] newlines and whitespace in FoLiA text content (<t>) #34

proycon · 2017-10-10T15:04:47Z

This issue documents a fundamental issue with FoLiA's text content (<t>) that may leads to misunderstanding and requires more extensive documentation. It is especially relevant now FoLiA v1.5 introduces mandatory text validation and may identify problems caused by this.

A FoLiA text content block (<t>) is an XML mixed content node, such a node may consist of both text and elements, the latter being FoLiA text markup elements in this case (t-style, t-gap, br etc...). In practise it's often just text. When associated with a structural element that is not a word or morpheme, the text content expresses untokenised text. This means that spaces and newlines are significant.

Consider the following snippets:

A:

<s><t>This is a sentence</t></s>

B:

<s><t>This is
a sentence</t></s>

C:

<s><t>This is<br/>a sentence</t></s>

The text of sentence A is not equivalent to B or C, the text of B and C are equivalent.

Special caution is in order when spreading text content over multiple lines, this usually does not do mean what you might assume:

D:

<s>
    <t>This is
         a sentence</t>
</s>

Sentence D is not equivalent to B or C, it's text is This is\n\s\s\s\s\s\s\s\s\sa sentence.

This is in line with XML behaviour (quoting http://usingxml.com/Basics/XmlSpace):

.., if the element is declared as having mixed content, both text and element child nodes, then the XML parser must pass on all the white space found within the element.

It does differ from what people are accustomed to in HTML (hence some of the confusion perhaps), which considers whitespace insignificant far more frequently.

FoLiA v1.5 introduced mandatory text validation (#24), which checks if any text redundancy is consistent. This may bring to light issues such as described here. This text validation, however, still proceeds in a more flexible manner as it is insensitive to multiple spaces/newlines and operates on a normalised form. Explicit text offsets (if used), on the other hand, do not operate on a normalised form and are thus very strict, they are also validated as part of text validation.

Note for completeness that this discussion is limited to text content (<t>) and text markup elements therein, whitespaces/newlines in most other context, such as within structural elements, is not significant.

The text was updated successfully, but these errors were encountered:

proycon · 2017-10-10T15:05:29Z

Issue arose from LanguageMachines/ucto#35

…rsion (proycon/folia#34)

proycon · 2017-10-10T16:05:43Z

This issue is also somewhat related to #12 (CDATA), marking for future reference.

kosloot · 2017-10-11T07:26:11Z

I agree with this analysis, and conclusions.
Still I think is it unwise and not recommendable to use this kind of implied formatting in XML and/or FoLiA. But we cannot force people to well-behave in this respect. So we need to do what we can to help them out :)

…rsion (proycon/folia#34)

proycon self-assigned this Oct 10, 2017

proycon mentioned this issue Oct 10, 2017

TEXT VALIDATION ERROR (consistency) LanguageMachines/ucto#35

Closed

proycon added a commit to proycon/pynlpl that referenced this issue Oct 10, 2017

Added whitespace/newline in textcontent behaviour check and bumped ve…

24a1814

…rsion (proycon/folia#34)

proycon added the ready Implemented but not released yet label Oct 10, 2017

proycon closed this as completed Dec 3, 2017

proycon added a commit to proycon/foliapy that referenced this issue Sep 6, 2018

Added whitespace/newline in textcontent behaviour check and bumped ve…

0e6af35

…rsion (proycon/folia#34)

proycon mentioned this issue Dec 8, 2020

Problems with leading/trailing whitespace in text content #88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[documentation] newlines and whitespace in FoLiA text content (<t>) #34

[documentation] newlines and whitespace in FoLiA text content (<t>) #34

proycon commented Oct 10, 2017

proycon commented Oct 10, 2017

proycon commented Oct 10, 2017

kosloot commented Oct 11, 2017

[documentation] newlines and whitespace in FoLiA text content (<t>) #34

[documentation] newlines and whitespace in FoLiA text content (<t>) #34

Comments

proycon commented Oct 10, 2017

proycon commented Oct 10, 2017

proycon commented Oct 10, 2017

kosloot commented Oct 11, 2017