text inside <t-hbr> nodes is allowed, but problematic #25

kosloot · 2023-01-16T12:49:39Z

Given this example:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="hyp" generator="libfolia-v2.13" version="2.5.1">
  <metadata type="native">
    <annotations>
      <hyphenation-annotation set="tst"/>
      <style-annotation set="tst"/>
      <paragraph-annotation set="tst"/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
    </annotations>
  </metadata>
  <text xml:id="hyp.text">
    <p xml:id="hyp.p.1">
      <t class="FoLiA-txt">Appel<t-hbr>peren</t-hbr>taart</t>
    </p>
  </text>
</FoLiA>

folia2txt produces:
Appeltaart
Which is maybe consistent with the docs , although it is not explicitly forbidden
(should isn't may not)
FoLiA-2text gives:
Appelperentaart, so does interpret the embedded text.

My problem with this is, dat NOT including the text violates the principle of least surprise. By just looking in the text of the <p> you might expect the peren to show up.

The Best solution imho is to explicitly forbid a text content inside a <t-hbr> an give an error when it is attempted.

The text was updated successfully, but these errors were encountered:

proycon · 2023-01-16T16:04:57Z

I think the idea was that could contain the actual hyphenation symbol that was used in the text, reflecting that it is an actual symbol appearing in the text. You'd need it again if you'd want to serialize the text exactly as it was. If we'd forbid it entirely as you suggest (which makes things simpler indeed), we'd need to assume there is one generic hyphenation symbol (e.g. a hyphen) that we can use in case the user wants the 'original' text exactly.

By default, foliapy indeed resolves the <t-hbr>, I don't think there's currently an implementation that reconstructs the data including hyphens, but at least it's a theoretical possibility.

I'm also a bit hesistant to simplify this now after the fact as it'd break backward compatibility (I don't whether there a are a lot of documents out there using this, probably not, but can't be sure).

kosloot · 2023-01-16T20:01:48Z

I think the idea was that could contain the actual hyphenation symbol that was used in the text, reflecting that it is an actual symbol appearing in the text.

Which was exactly what I tried for FoLiA-abby, leading to this discussion.
As we already worked out, the hyphenation symbol can better be handled in the class feature.
Like: <t-hbr class="¬"/>

By default, foliapy indeed resolves the <t-hbr>,

Where `resolves' implies, IGNORING completely.
FoliaPY doesn't implement a feature to include it. And libfolia lacks a way of expressing to ignore
I suggest amending libfolia to ignore the text value (and maybe give a warning?) and leave a mechanism to include the content as a wish for both foliaPY and libfolia. For whenever the need comes up.

proycon · 2023-01-17T22:07:57Z

Looks good yes!

kosloot · 2023-01-30T19:57:15Z

In hindsight, I think it is much better to NOT ABUSE the class feature for storing the actual hyphen.
I strongly suggest to put the hyphen as a value in the <t-hbr/>, as I previously did. e.g. <t-hbr>¬</t-hbr>
We then have to make an extra text-retrieving option to include that, otherwise ignored, text.
We need such an option anyway if we want to extract the value from the class.
Such an option is very easy to implement. (at least in libfolia).

That an evil user could use that value to store more than just a hyphen is not really a problem imho

As a bonus, we could allow text inside a <t-hspace>, for instance to store all kind of Unicode space characters, or sequences of more than one space. Those will be filtered out of the "normalized" text anyway.

proycon · 2023-02-13T12:14:44Z

I agree with your assessment. What is in the text should be text, so putting them in the xml text was the good decision, as often these hyphens appear verbatim in the text. Even though by default we may want to ignore this text.

That an evil user could use that value to store more than just a hyphen is not really a problem imho

Agreed

As a bonus, we could allow text inside a , for instance to store all kind of Unicode space characters, or sequences of more than one space. Those will be filtered out of the "normalized" text anyway.

That makes sense yes. It does raise some difficulties due to the way XML deals with whitespace if the element were to be split over multiple lines.

kosloot · 2023-02-13T16:33:30Z

Ok, I will stick to that solution then.
``

It does raise some difficulties due to the way XML deals with whitespace if the element were to be split over multiple lines.
``

Well, I was referring to a <t-hspace> , which is documented te be for horizontal spacing. Not <whitespace> which should be used for vertical spacing.
I agree that a malicious user could smuggle linefeed or pagefeed characters into the <t> inside <t-hspace>.
So be it. In "normal use" (including standard text() extraction, we will discard them.
For the specialized text() extraction I made, which tries to restore as much as possible, they will be retrieved "as is",
which should be fine.

I will implement this and create some simple examples for testing

(see proycon/foliapy#25)

kosloot · 2023-02-15T13:47:58Z

I created a simple example to demonstrate what libfolia is capable of now:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="markup" generator="libfolia-v2.14" version="2.5.1">
  <metadata type="native">
    <annotations>
      <sentence-annotation set="FoLiA-txt-set"/>
      <paragraph-annotation set="FoLiA-txt-set"/>
      <linebreak-annotation set="FoLiA-txt-set"/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <hyphenation-annotation set="FoLiA-txt-set"/>
      <hspace-annotation set="FoLiA-txt-set"/>
    </annotations>
  </metadata>
  <text xml:id="markup.text">
    <p xml:id="markup.p.1">
      <s xml:id="markup.p.1.s.1">
        <t>Dit is een test. Met enkele afgebroken zin<t-hbr>-</t-hbr><br space="no"/>nen. Dit is de eerste. Met ook een soft af<t-hbr>¬</t-hbr><br space="no"/>breking.<br/></t>
      </s>
      <s xml:id="markup.p.1.s.2">
        <t>En dit heeft een lange'<t-hspace>       </t-hspace>'space val.<br/></t>
      </s>
      <s xml:id="markup.p.1.s.3">
        <t>En dit heeft een enkele'<t-hspace/>'space val.<br/></t>
      </s>
      <s xml:id="markup.p.1.s.4">
        <t>En dit heeft een speciale'<t-hspace>⍽</t-hspace>'space val.<br/></t>
      </s>
    </p>
  </text>
</FoLiA>

Both folialint and foliavalidator accept this file. OK
Both FoLiA-2text and folia2txt parse this file and produce:

Dit is een test. Met enkele afgebroken zin
nen. Dit is de eerste. Met ook een soft af
breking.
En dit heeft een lange' 'space val.
En dit heeft een enkele' 'space val.
En dit heeft een speciale' 'space val.

NOTE @proycon : folia2text slides in some leading spaces, after the first sentence. A minor bug?
Besides that: OK

BUT FoLiA-2text can now also produce, (using --restore-formatting):

Dit is een test. Met enkele afgebroken zin-
nen. Dit is de eerste. Met ook een soft af¬
breking.
En dit heeft een lange'       'space val.
En dit heeft een enkele' 'space val.
En dit heeft een speciale'⍽'space val.

Which is very cool, I think.

Second NOTE: This example is handcrafted, at the moment there is no DIRECT way to create a FoLiA file with "special" <t-hspace> nodes. The only tool that I have that does this, is FoLiA-abby.

kosloot · 2023-02-15T15:43:19Z

@pirolen Please note: We reverted back to the solution before January 17th.
So Abby files produced with the GIT version are not usable (although still valid)
The implementation should now be OK (but needs more testing)
Also FoLiA-2text seems to work OK with the new --restore-formatting option (as shown above)

sorry for the inconvenience.
Hope you can test the Git version.

kosloot added the enhancement New feature or request label Jan 16, 2023

kosloot assigned proycon and kosloot Jan 16, 2023

kosloot mentioned this issue Jan 16, 2023

Question: Concatenating word parts at soft hyphens LanguageMachines/ucto#90

Closed

kosloot added a commit to LanguageMachines/libfolia that referenced this issue Jan 17, 2023

quick fix for ignoring text inside <t-hbr> proycon/foliapy#25

239ddcf

kosloot added a commit to LanguageMachines/libfolia that referenced this issue Feb 14, 2023

implemented ADD_FORMATTING text policy for HSpace and Hypbreak

e76d4eb

(see proycon/foliapy#25)

kosloot added a commit to LanguageMachines/foliautils that referenced this issue Feb 14, 2023

modified to accomodate proycon/foliapy#25

90c4bb6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text inside <t-hbr> nodes is allowed, but problematic #25

text inside <t-hbr> nodes is allowed, but problematic #25

kosloot commented Jan 16, 2023

proycon commented Jan 16, 2023 •

edited by kosloot

Loading

kosloot commented Jan 16, 2023

proycon commented Jan 17, 2023

kosloot commented Jan 30, 2023

proycon commented Feb 13, 2023

kosloot commented Feb 13, 2023

kosloot commented Feb 15, 2023

kosloot commented Feb 15, 2023

text inside <t-hbr> nodes is allowed, but problematic #25

text inside <t-hbr> nodes is allowed, but problematic #25

Comments

kosloot commented Jan 16, 2023

proycon commented Jan 16, 2023 • edited by kosloot Loading

kosloot commented Jan 16, 2023

proycon commented Jan 17, 2023

kosloot commented Jan 30, 2023

proycon commented Feb 13, 2023

kosloot commented Feb 13, 2023

kosloot commented Feb 15, 2023

kosloot commented Feb 15, 2023

proycon commented Jan 16, 2023 •

edited by kosloot

Loading