Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New problems with leading/trailing whitespace around linebreaks in text content #101

Closed
proycon opened this issue Aug 18, 2021 · 1 comment
Assignees
Labels
bug ready Implemented but not released yet

Comments

@proycon
Copy link
Owner

proycon commented Aug 18, 2021

I'm afraid we may have to add another chapter to our whitespace problems, this is the sequel to issue #88 ...

i have a paragraph with the following text:

    <p xml:id="FP-NOTD00223000001.text.r2">
      <t>
        <t-str id="FP-NOTD00223000001.text.r2.r2l1">s</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l2">Jceddeiinte NP</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l3">J:d WnnnN.. WVierden Novembe</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l4">XviC. teeetnegentigh en eijndiger g</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l5">Antantiee</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l6">etirgh</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l7">Jen Mlers</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l8">J: deWinter N.P.</t-str>
      </t>

This is produced by my latest additions to FoLiA-page (PageXML to FoLiA conversion, pagexml-br branch of foliautils).
In addition, PageXML generates string annotations, which in turn relate back to the original PageXML:

      <str xml:id="FP-NOTD00223000001.text.r2.r2l1">
        <t offset="0">s</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l1" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l2">
        <t offset="2">Jceddeiinte NP</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l2" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l3">
        <t offset="17">J:d WnnnN.. WVierden Novembe</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l3" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l4">
        <t offset="46">XviC. teeetnegentigh en eijndiger g</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l4" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l5">
        <t offset="82">Antantiee</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l5" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l6">
        <t offset="92">etirgh</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l6" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l7">
        <t offset="99">Jen Mlers</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l7" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l8">
        <t offset="109">J: deWinter N.P.</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l8" type="str"/>
        </relation>
      </str>

The problem is, the offsets don't match up because of leading/trailing spaces. foliavalidator and folialint report the same:

TEXT VALIDATION ERROR: Text for String, ID FP-NOTD00223000001.text.r2.r2l2, textclass current, has incorrect offset 2 or invalid reference: Reference (ID FP-NOTD00223000001.text.r2, class=current) found but no text match at specified offset (2)! Expected 'Jceddeiinte NP', got '
 Jceddeiinte '

The full text the library sees, and which is produced by both folia2txt and FoLiA-2text. I marked leading/trailing whitespace with an underscore for visibility:

s_
_Jceddeiinte NP_
_J:d WnnnN.. WVierden Novembe_
_XviC. teeetnegentigh en eijndiger g_
_Antantiee_
_etirgh_
_Jen Mlers_
_J: deWinter N.P._

Note the initial whitespace for all but the first line. So where I'd expect S\nJ we get S\s\n\sJ instead. I think this is unexpected behaviour and qualifies as a bug we'd want to fix. The offsets as reported in the FoLiA-page output seem correct to me.

@proycon
Copy link
Owner Author

proycon commented Aug 18, 2021

If everything in the text content (<t>) is put on a single line (without spaces or newlines), then everything validates fine.

This also shows that the cause of this issue are spaces caused by joining lines, which is behaviour we usually want to have:

  <t-str>foo</t-str>
  <t-str>bar</t-str>

The above should serialize as foo bar, with a space. The libraries do this correctly.

But... if we have an explicit linebreak:

  <t-str>foo</t-str>
  <br/>
  <t-str>bar</t-str>

then this no longer makes sense and we want foo\nbar and not foo\s\n\sbar. I think this is the core of this issue.

proycon added a commit that referenced this issue Aug 19, 2021
proycon added a commit that referenced this issue Aug 19, 2021
proycon added a commit to proycon/foliapy that referenced this issue Aug 19, 2021
proycon added a commit to proycon/foliapy that referenced this issue Aug 19, 2021
proycon added a commit to LanguageMachines/libfolia that referenced this issue Aug 19, 2021
proycon added a commit to LanguageMachines/libfolia that referenced this issue Aug 19, 2021
proycon added a commit to LanguageMachines/foliatest that referenced this issue Aug 19, 2021
@proycon proycon added ready Implemented but not released yet and removed in progress labels Aug 19, 2021
@proycon proycon closed this as completed Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug ready Implemented but not released yet
Projects
None yet
Development

No branches or pull requests

1 participant