Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[documentation] newlines and whitespace in FoLiA text content (<t>) #34

Closed
proycon opened this issue Oct 10, 2017 · 3 comments
Closed
Assignees
Labels
ready Implemented but not released yet

Comments

@proycon
Copy link
Owner

proycon commented Oct 10, 2017

This issue documents a fundamental issue with FoLiA's text content (<t>) that may leads to misunderstanding and requires more extensive documentation. It is especially relevant now FoLiA v1.5 introduces mandatory text validation and may identify problems caused by this.

A FoLiA text content block (<t>) is an XML mixed content node, such a node may consist of both text and elements, the latter being FoLiA text markup elements in this case (t-style, t-gap, br etc...). In practise it's often just text. When associated with a structural element that is not a word or morpheme, the text content expresses untokenised text. This means that spaces and newlines are significant.

Consider the following snippets:

A:

<s><t>This is a sentence</t></s>

B:

<s><t>This is
a sentence</t></s>

C:

<s><t>This is<br/>a sentence</t></s>

The text of sentence A is not equivalent to B or C, the text of B and C are equivalent.

Special caution is in order when spreading text content over multiple lines, this usually does not do mean what you might assume:

D:

<s>
    <t>This is
         a sentence</t>
</s>

Sentence D is not equivalent to B or C, it's text is This is\n\s\s\s\s\s\s\s\s\sa sentence.

This is in line with XML behaviour (quoting http://usingxml.com/Basics/XmlSpace):

.., if the element is declared as having mixed content, both text and element child nodes, then the XML parser must pass on all the white space found within the element.

It does differ from what people are accustomed to in HTML (hence some of the confusion perhaps), which considers whitespace insignificant far more frequently.

FoLiA v1.5 introduced mandatory text validation (#24), which checks if any text redundancy is consistent. This may bring to light issues such as described here. This text validation, however, still proceeds in a more flexible manner as it is insensitive to multiple spaces/newlines and operates on a normalised form. Explicit text offsets (if used), on the other hand, do not operate on a normalised form and are thus very strict, they are also validated as part of text validation.

Note for completeness that this discussion is limited to text content (<t>) and text markup elements therein, whitespaces/newlines in most other context, such as within structural elements, is not significant.

@proycon proycon self-assigned this Oct 10, 2017
@proycon
Copy link
Owner Author

proycon commented Oct 10, 2017

Issue arose from LanguageMachines/ucto#35

proycon added a commit to proycon/pynlpl that referenced this issue Oct 10, 2017
@proycon proycon added the ready Implemented but not released yet label Oct 10, 2017
@proycon
Copy link
Owner Author

proycon commented Oct 10, 2017

This issue is also somewhat related to #12 (CDATA), marking for future reference.

@kosloot
Copy link
Collaborator

kosloot commented Oct 11, 2017

I agree with this analysis, and conclusions.
Still I think is it unwise and not recommendable to use this kind of implied formatting in XML and/or FoLiA. But we cannot force people to well-behave in this respect. So we need to do what we can to help them out :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready Implemented but not released yet
Projects
None yet
Development

No branches or pull requests

2 participants