Problems with leading/trailing whitespace in text content #88

proycon · 2020-12-08T13:15:40Z

We had an extensive earlier discussion on this in #34, but an issue popped up.

foliatextcontent produces FoLiA likke the follow:

<p xml:id="FH-OllevierGeets-1.tif.text.par_1_21">
       <t>
         <t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str>
       </t>
       <t class="OCR">
         <t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str>
       </t>
       <str xml:id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">
         <t offset="0">INTRODUCTION</t>
         <t offset="0" class="OCR">INTRODUCTION</t>
         <relation format="text/hocr+xml" xlink:href="OllevierGeets-1.hocr" xlink:type="simple">
           <xref id="word_1_233" type="str"/>
         </relation>
       </str>
     </p>

folialint stumbles on this with a text consistency problem:

ticcl_output/OllevierGeets.ticcl.folia.xml failed: Unresolvable text: Text for str(ID=FH-OllevierGeets-1.tif.text.par_1_21.word_1_233, textclass='current'), has incorrect offset 0
        original msg=Unresolvable text: Reference (ID FH-OllevierGeets-1.tif.text.par_1_21,class='current') found, but no text match at offset=0 Expected 'INTRODUCTION' but got '
        INT'

Because of the newline and indentation, the offset is considered wrong, as the text is assumed to be "\n\s\s\s\s\s\s\s\s\INTRODUCTION".

foliavalidator stumbles over something identical but later on (different order of evaluation perhaps?):

TEXT VALIDATION ERROR: Text for String, ID FH-OllevierGeets-4.tif.text.par_1_36.word_1_708, textclass OCR, has incorrect offset 0 or invalid reference
VALIDATION ERROR on full parse by library (stage 2/3), in OllevierGeets.ticcl.folia.xml
UnresolvableTextContent: Reference (ID FH-OllevierGeets-4.tif.text.par_1_36, class=OCR) found but no text match at specified offset (0)! Expected 'DISCUSSION', got '
        D'

The offsets do not do any kind of space normalization by default, as addressed in #34, a text like:

<s>
    <t>This is
         a sentence</t>
</s>

This really means This is\n\s\s\s\s\s\s\s\s\sa sentence. and not This is a sentence.

But, I think we should be able to strip leading and trailing spaces from the text as a whole, I think the following fragment below should be semantically identical to the first fragment. The fact that in turned into the fragment above is probably because of standard XML prettification algorithms.

<p xml:id="FH-OllevierGeets-1.tif.text.par_1_21">
       <t><t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str></t>
       <t class="OCR"><t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str></t>
       <str xml:id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">
         <t offset="0">INTRODUCTION</t>
         <t offset="0" class="OCR">INTRODUCTION</t>
         <relation format="text/hocr+xml" xlink:href="OllevierGeets-1.hocr" xlink:type="simple">
           <xref id="word_1_233" type="str"/>
         </relation>
       </str>
     </p>

Just like we don't allow empty texts, I think we can probably strip leading and trailing spaces (=emptiness) when doing text validation and offset computation (this does not affect any intermediate spaces, also not in multiline content!).

The text was updated successfully, but these errors were encountered:

martinreynaert · 2020-12-08T13:21:51Z

I agree. I think it is far better to normalize these. Thanks!

pirolen · 2020-12-08T13:46:12Z

Dear Maarten, I was just about to point out a whitespace issue when using ucto — not sure, if fully related. There are whitespace insertions and deletions. Where shall I report this? Thanks & cheers, Piroska

…

On Dec 8, 2020, at 2:15 PM, Maarten van Gompel ***@***.***> wrote: We had an extensive earlier discussion on this in #34, but an issue popped up. foliatextcontent produces FoLiA likke the follow: <p xml:id="FH-OllevierGeets-1.tif.text.par_1_21" > < t > < t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str > </ t > < t class="OCR" > < t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str > </ t > < str xml:id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233" > < t offset="0">INTRODUCTION</t > < t offset="0" class="OCR">INTRODUCTION</t > < relation format="text/hocr+xml" xlink:href="OllevierGeets-1.hocr" xlink:type="simple" > < xref id="word_1_233" type="str" /> </ relation > </ str > </ p> folialint stumbles on this with a text consistency problem: ticcl_output/OllevierGeets.ticcl.folia.xml failed: Unresolvable text: Text for str(ID=FH-OllevierGeets-1.tif.text.par_1_21.word_1_233, textclass='current'), has incorrect offset 0 original msg=Unresolvable text: Reference (ID FH-OllevierGeets-1.tif.text.par_1_21,class='current') found, but no text match at offset=0 Expected 'INTRODUCTION' but got ' INT' Because of the newline and indentation, the offset is considered wrong, as the text is assumed to be "\n\s\s\s\s\s\s\s\s\INTRODUCTION". foliavalidator stumbles over something identical but later on (different order of evaluation perhaps?): TEXT VALIDATION ERROR: Text for String, ID FH-OllevierGeets-4.tif.text.par_1_36.word_1_708, textclass OCR, has incorrect offset 0 or invalid reference VALIDATION ERROR on full parse by library (stage 2/3), in OllevierGeets.ticcl.folia.xml UnresolvableTextContent: Reference (ID FH-OllevierGeets-4.tif.text.par_1_36, class=OCR) found but no text match at specified offset (0)! Expected 'DISCUSSION', got ' D' The offsets do not do any kind of space normalization by default, as addressed in #34, a text like: <s > < t >This is a sentence</ t > </ s> This really means This is\n\s\s\s\s\s\s\s\s\sa sentence. and not This is a sentence. But, I think we should be able to strip leading and trailing spaces from the text as a whole, I think the following fragment below should be semantically identical to the first fragment. The fact that in turned into the fragment above is probably because of standard XML prettification algorithms. <p xml:id="FH-OllevierGeets-1.tif.text.par_1_21" > < t><t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str></t > < t class="OCR"><t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str></t > < str xml:id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233" > < t offset="0">INTRODUCTION</t > < t offset="0" class="OCR">INTRODUCTION</t > < relation format="text/hocr+xml" xlink:href="OllevierGeets-1.hocr" xlink:type="simple" > < xref id="word_1_233" type="str" /> </ relation > </ str > </ p> Just like we don't allow empty texts, I think we can probably strip leading and trailing spaces (=emptiness) when doing text validation and offset computation (this does not affect any intermediate spaces, also not in multiline content!). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

proycon · 2020-12-08T13:49:15Z

@pirolen If you think it's a tokenisation issue then it's best to put it in https://github.com/LanguageMachines/ucto/issues . If you're referring to insertion/deletion corrections in FLAT then best to put it in https://github.com/proycon/flat/issues

…e had it is not affected

…on/folia#88)

kosloot · 2020-12-10T16:18:28Z

I tried to reproduce this problem, but folialint failed to fail

Are you sure this isn't already fixed on Nov 17:

commit 64218577550c6f3763dbbc75f668252fd4f3f03d
Author: Ko van der Sloot K.vanderSloot@let.ru.nl
Date: Tue Nov 17 15:38:41 2020 +0100

Fixed problem with text-conststency errors for within

Or maybe it is very related?

UPDATE:
Sorry. :{
I was able to get an error using your example: issue88.2.4.1.folia.xml

…on/folia#88)

proycon · 2020-12-10T18:58:07Z

I think I tackled this now in libfolia as well, I'll continue by testing it in the PICCL context where the issue emerged.

proycon · 2021-03-09T13:58:15Z

I'm afraid our problems with whitespace are not over yet. I take the example @kosloot gave in LanguageMachines/foliautils#56.

This output has been formatted this way by libxml2 itself, but this formatting is not compatible with the FoLiA assumptions we held until now:

       <t>
        <t-str xml:id="text.p.1.t-str.1">
          <t-style>deel<t-hbr/></t-style>
        </t-str>
        <t-str xml:id="text.p.1.t-str.2">
          <t-style>woord</t-style>
        </t-str>
        <t-str>extra</t-str>
      </t>

With the current rules we applied, the text representation that both foliapy and libfolia give is:

deel
        woord
        extra

Also if we simplify the example to:

       <t>
          <t-style>deel<t-hbr/></t-style>
          <t-style>woord</t-style>
          <t-str>extra</t-str>
      </t>

We get that same result.

The extra bonus is that as soon as we add a space prior to the word extra, that libxml2 serializes the whole <t> block in a single line!! Which is far more in line what we intend FoLiA (except for the fact that the leading space would be stripped).

I don't think the text representations are good as they are, with all the indentation, and I think what we're getting now is at odds with how XML sees things. I think what we want in this case is one of two options:

we want the text "deelwoordextra" (without any intermediate spaces), so stripping ALL the initial and trailing spaces outside the markup elements.
The alternative interpretation is to go for the text "deel woord extra", with a single space between all the parts. This would be in line with what HTML does:

<span>
    <span>deel</span>
    <span>woord</span>
    <span>extra</span>
</span>

(see https://download.anaproy.nl/deelwoordextra.html)

If we go for option 1, this does beg the question how we would represent a space if we do want it, say for example between woord and extra. I think the solution to that would be:

       <t>
          <t-style>deel<t-hbr/></t-style>
          <t-style>woord</t-style> <t-str>extra</t-str>
      </t>

If we go for option 2, then it begs the question how we would represent the non-spaced scenario, the solution would be:

       <t>
          <t-style>deel<t-hbr/></t-style>
          <t-style>woord</t-style><t-str>extra</t-str>
      </t>

I think we're currently closer to option 1 in our interpretations, but I need to do some investigation whether option 2 isn't the more natural XML interpretation (after all, it's what HTML does too). Whatever we choose, we have to take into account the fact that twe didn't impose this strictness before and therefore be lenient not to break older files, as addressed in issue #92.

Of course, the one-line solution avoids all these problems in all cases and is the simplest, but it's apparently not what libxml2 prefers to output (pretty formatting), nor something we can expect users to adhere too:

       <t><t-style>deel<t-hbr/></t-style><t-style>woord</t-style> <t-str>extra</t-str></t>

It would be good if we had a way to normalize our FoLiA's to force this one-line representation (as an extra tool), because it would be a valuable preprocessing step that can solve issues like proycon/foliatools#29 and make things easier for parsers that can't deal with all these complexities.

kosloot · 2021-03-09T14:20:02Z

Hmm, it truly is complex. I ponder about the <t-hbr/> in your example. Shouldn't that yield

deelwoord extra

deelwoord
extra

or

deel-
woord
extra

or such? Anyway not just a space after 'deel' I assume, but some representation of the <t-hbr/>.

proycon · 2021-03-09T14:42:32Z

Ah yes, possibly, I didn't consider any representation of t-hbr . I don't think we currently represent it even, do we? Let's save that for another issue :)

kosloot · 2021-03-09T14:50:09Z

Well, it was the source for LanguageMachines/foliautils#56
One of the heads of this dragon

pirolen · 2021-03-09T15:00:34Z

After tokenization with ucto, the t-hbr is gone/turned into a token boundary. In my ideal workflow, the soft break would stay recoverable (and propagatable to FLAT and folia2html), if possible at all.

…roycon/folia#88)

proycon · 2021-03-24T13:52:35Z

A remaining issue, raised by @kosloot, is whether we should actively normalize the more exotic unicode spaces ( see https://en.wikipedia.org/wiki/Whitespace_character#Unicode) to a normal space. This is probably a good idea, but we may need to introduce an explicit <t-hspace> element in case people want to explicitly specify things like space width.

pirolen · 2021-03-24T15:43:04Z

Thanks!
Some more test examples from me would include superscript styling, where the superscripted characters would ideally be adjacent without whitespace to their context on the left and sometimes right, in examples 2 and 3:

<t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.5">
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>im wirtschaftlichen Interessenkampf gegen die Agrarpartei verwert<t-hbr/></t-style>
          </t-str>
          <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.6">
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>baren Schauergemälde bieten</t-style>
            <t-style><feat subset="font_typeface" class="superscript"/><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>6</t-style>
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>, oder welche die Agrarverhältnisse</t-style>
          </t-str>

        <t class="OCR">
          <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p3.t-str.1">
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>Um nicht gewisse Bemerkungen über die Arbeitsverfassung im</t-style>
          </t-str>
          <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p3.t-str.2">
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>ganzen</t-style>
            <t-style><feat subset="font_typeface" class="superscript"/><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>1</t-style>
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>) bei jedem einzelnen Bezirk wiederholen zu müssen, habe</t-style>
          </t-str>
        </t>

        <t class="OCR">
          <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p4.t-str.1">
            <t-style><feat class="superscript" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="12." subset="font_size"/><feat class="{6B4F7D42-EA4B-4F65-B62C-458C902232DA}" subset="font_style"/>1</t-style>
            <t-style><feat class="Times New Roman" subset="font_family"/><feat class="12." subset="font_size"/><feat class="{6B4F7D42-EA4B-4F65-B62C-458C902232DA}" subset="font_style"/>) Grundlage bleibt nach wie vor in dieser Beziehung die Schrift von v.d. Goltz,</t-style>
          </t-str>
        </t>

proycon · 2021-03-24T16:54:32Z

@pirolen To accomplish that in the new situation, there can not be a newline between the two elements (so they must be on the same line). I think this is generated by FoLiA-abby right? We'll have to make sure it produces proper FoLiA in such cases.

pirolen · 2021-03-24T16:59:03Z

Yes, the examples come from FoLiA-abby.

kosloot · 2021-03-24T17:04:18Z

We have to look into this as soon as all text issues have been resolved. At the moment it is a moving target.

…roycon/folia#88)

kosloot · 2021-03-29T08:30:31Z

Still I think we are getting into trouble anyway.
To illustrate the dilemma a simplified example:

Original text:
item1²
Possible FoLiA text: (as @pirolen would like to see it, I suppose)

<t>
  <t-str>item1<t-style><feat class="superscript" subset="font_typeface"/>2</t-style></t-str>
</t>

When using ucto, a string will be extracted like this: item12
imho this is quite useless.
For further processing, we need a way to "know" that the 2 isn't part of the word item1
Any ideas HOW to accomplish this?
Inserting a space (or newline or such) in the FoLiA is a bit harsh, But still I would prefer item1 2 over item12.

In an ideal world, extracting text form the FoLiA would re-introduce the superscript, item1²but that would depend on the set used, and in general these sets can be user-defined, and are open, so any translation is possible.

I'm stuck here

proycon · 2021-03-29T09:04:45Z

I see the problem yes. Technically, following all the rules, the text serialisation item12 is correct. Inserting a space would be too harsh indeed. But I agree that from a tokenisation perspective you would indeed prefer to have item1 and 2 as different tokens. This would then indeed be a problem for the tokeniser (ucto) to tackle, but it is hard to get right and would make all kinds of assumptions we can't really make, so whatever we do would have to be an opt-in parameter I think.

In an ideal world, extracting text form the FoLiA would re-introduce the superscript, item1²but that would depend on the set used, > and in general these sets can be user-defined, and are open, so any translation is possible.

Indeed, and in general styles don't transfer to plain text. You'd need a markup language for that (like Markdown). Properly interpreting styles in custom sets can only be done by the user. We don't certainly don't want text serialisation in FoLiA libraries to even attempt that.

pirolen · 2021-03-29T10:17:46Z

Superscript and subscript are the t-style classes that would imply a token boundary, the others don't (e.g. italic, bold).
Maybe these two could be treated somewhat differently from the rest, so that they always encode a non-breaking boundary (which is not a whitespace boundary)?
I guess t-hbr does not apply here, but perhaps something like https://en.wikipedia.org/wiki/Zero-width_joiner ?

kosloot · 2021-03-29T10:35:57Z

@pirolen:
Maarten and I were thinking in the same direction. Another candidate would be the
Zero-Width-space
It's up to ucto and such to interpret that as a token separator.

@proycon To make this more generic: Could we extend the t-style with an attribute like separator="true"
Which would make text extraction insert that joiner or zero-width?

BUT: There is also another issue, text like: ²footnote text
here the joiner/separator has to come AFTER the ². So maybe the only feasible way to do it is surrounding the text with a special symbol.
It is really tricky.

pirolen · 2021-03-29T17:50:08Z

Would be nice if adding the special symbol around the t-style text element would solve it.

The whole phenomenon reminds me a bit of the choice of tags in sequence labeling, where one can use the prefixes I-O, B-I-O, etc. in combination with the applicable tag (like for a named entity), or simply use the name of the tag as the label. Each of the choices implicitly encodes a specific logic for the tools that ingest the labeled data (and for the humans who interpret them).

kosloot · 2021-03-30T08:38:12Z

More pondering on this:
One problem with 'hidden' characters is their size. Do they count for offset's and string length?
For instance, assuming the separated attribute is implemented:

<t>
  <t-str>item1<t-style><feat class="superscript" subset="font_typeface" separated="yes"/>2</t-style></t-str><t-str>something</t-str>
</t>

(The original text was: item² something)

What should we do here. I assume there is no need to insert 'hidden' characters here, but to implement the str() extraction function so that it does 'the right thing'
But for the fragment above, should str() render:
item1 2 something
OR
item1*2* something were * is a ZERO-WIDTH character, as we were suggesting.

This might raise a lot of problems later on.
What is the offset of '2' in this string? 5 or 6?
And 9 or 7 for 'something'?
Same problems with the string length.

Maybe the clearest solution is, to implement the'separated'attribute, with the semantics of:
when extracting text, insert a space before AND after the styled token.
(and avoiding multiple spaces)

In this way we do not break any old behavior, and don't introduces fuzzy and surprising characters.

pirolen · 2021-03-30T09:47:31Z

Gut feeling:
to render the separator as space would be confusing for humans (e.g. evaluators of OCR extraction), because there is visually no space before/after the sub-/superscripted text (so rather also no hidden character to add to the offset and string length count).

Would it be feasible to regard/treat sub-/superscripted text as a specific type of punctuation? :-o Semantically it seems related to it (=it aids and directs the reading of the text). But just like soft break, its behavior could be configurable. ?

proycon · 2021-03-30T10:18:12Z

Maarten and I were thinking in the same direction. Another candidate would be the
Zero-Width-space. It's up to ucto and such to interpret that as a token separator.

Just to prevent confusion: I definitely don't think there should be zero-width spaces in the FoLiA itself. At most the text extraction function could output one where a token boundary must occur and no space happens, but that would have to be an opt-in feature. And as you said, I foresee issues with the offsets then. So I see where you are going with the separated attribute.

Fundamentally, the issue we're discussing now is a tokeniser issue rather than a FoLiA representation problem (so I see it as distinct form the original issue in this thread). The question is how the tokeniser decides what to tokenize and what not:

What you're essentially suggesting with the separated attribute is to encode extra information in the FoLiA that gives the tokeniser extra information.
An alternative would be to provide the information directly to the tokeniser as a parameter, something like: treat all t-style's with class superscript as separate tokens. (an FQL query might work here but libfolia doesn't implement that and that'd be too much work)

Text content on higher levels is by definition untokenised (so I'm a bit skeptic about adding tokenisation details in there), text content on the word/token level is by definition tokenised. The issue is of course getting from A to B here (which is the task of the tokeniser).

I'm following the line of the extra attribute Ko suggested. But I'm trying to think in a generic way if we expand FoLiA for this: we're essentially encoding some extra 'cue' in the FoLiA to help another tool do its job, and such a cue is needed because the information is not present in the FoLiA yet, or is too complexly encoded. This might be useful for other uses cases than the one we are considering now.

What if we introduce a generic tagattribute that allows people to tag any FoLiA element, the value being a space-delimited list of some user defined vocabulary (because it is tool-specific)? We could then use a value like token or separate for the tokenisation cues:

<t>
  <t-str>item1<t-style tag="token"><feat class="superscript" subset="font_typeface"/>2</t-style></t-str><t-str>something</t-str>
</t>

It's essentially what Ko suggested but stretched to be more generic, it gives some processor-specific flexibility. You can envision tool A setting particular tags, and tool B acting on them.

Note: I opened a new issue for this proposal, see below

proycon added the bug label Dec 8, 2020

proycon self-assigned this Dec 8, 2020

proycon mentioned this issue Dec 8, 2020

Add text markup information after FoLiA-correct LanguageMachines/PICCL#62

Open

proycon added a commit that referenced this issue Dec 8, 2020

added a test example #88

187fa11

proycon added this to the v2.4.1 milestone Dec 8, 2020

proycon added a commit that referenced this issue Dec 8, 2020

bumping revision version for #88 , even though the specification as w…

3d4a5c4

…e had it is not affected

proycon added a commit that referenced this issue Dec 8, 2020

added an extra test to check a scenario where a fix for #88 might fail

e90defe

proycon added a commit to proycon/foliapy that referenced this issue Dec 8, 2020

strip whitespace left and right if there is only a sole string (proyc…

d453ab7

…on/folia#88)

proycon added a commit to LanguageMachines/libfolia that referenced this issue Dec 10, 2020

working on a fix for problems with leading/trailing whitespace (proyc…

f538b60

…on/folia#88)

proycon added a commit to LanguageMachines/foliatest that referenced this issue Dec 10, 2020

adapted two tests to comply with changes introduced by proycon/folia#88

9e83b59

proycon added the ready Implemented but not released yet label Dec 10, 2020

proycon closed this as completed Dec 11, 2020

proycon added a commit that referenced this issue Dec 17, 2020

added extra documentation for handling leading/trailing whitespace #88

30c041c

proycon mentioned this issue Feb 22, 2021

folia2html: XSL conversion results in extra spaces proycon/foliatools#29

Closed

proycon reopened this Mar 9, 2021

proycon modified the milestones: v2.4.1, v2.5.0 Mar 9, 2021

proycon added a commit to LanguageMachines/libfolia that referenced this issue Mar 19, 2021

Implemented fallback for backward compatibility in offset validation (p…

56afe68

…roycon/folia#88)

proycon added a commit to proycon/foliapy that referenced this issue Mar 24, 2021

Fix in maintaining leading spaces (+added test) (proycon/folia#88)

3b093f6

proycon added a commit to LanguageMachines/libfolia that referenced this issue Mar 24, 2021

Fix in maintaining leading spaces (proycon/folia#88)

e6bf82f

proycon added a commit to LanguageMachines/foliatest that referenced this issue Mar 24, 2021

added test for trailing space (proycon/folia#88)

4d1fb3d

proycon added the in progress label Mar 24, 2021

proycon added a commit that referenced this issue Mar 24, 2021

Added hspace annotation #88

0713293

proycon added a commit to proycon/foliapy that referenced this issue Mar 24, 2021

Implemented t-hspace element (proycon/folia#88)

9e1f5a7

proycon added a commit to LanguageMachines/libfolia that referenced this issue Mar 24, 2021

Implemented t-hspace (proycon/folia#88)

dbf2c0c

proycon mentioned this issue Mar 25, 2021

Implement FoLiA v2.5 support with new whitespace behaviour proycon/folia-rust#6

Open

proycon added a commit to proycon/foliapy that referenced this issue Mar 25, 2021

Use regexp in norm_spaces to multiple kinds of spaces are normalized (p…

ac00454

…roycon/folia#88)

proycon added a commit to proycon/foliapy that referenced this issue Mar 25, 2021

fix for normalize space function (proycon/folia#88)

3e6e1bc

proycon added a commit to proycon/foliapy that referenced this issue Mar 25, 2021

added is_space() function (proycon/folia#88)

371374e

proycon mentioned this issue Mar 30, 2021

Tagging mechanism to aid processors #93

Closed

proycon added ready Implemented but not released yet and removed in progress labels Apr 6, 2021

proycon closed this as completed Apr 7, 2021

proycon mentioned this issue Aug 18, 2021

New problems with leading/trailing whitespace around linebreaks in text content #101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with leading/trailing whitespace in text content #88

Problems with leading/trailing whitespace in text content #88

proycon commented Dec 8, 2020

martinreynaert commented Dec 8, 2020

pirolen commented Dec 8, 2020 via email

proycon commented Dec 8, 2020 •

edited

Loading

kosloot commented Dec 10, 2020 •

edited

Loading

proycon commented Dec 10, 2020

proycon commented Mar 9, 2021

kosloot commented Mar 9, 2021

proycon commented Mar 9, 2021

kosloot commented Mar 9, 2021

pirolen commented Mar 9, 2021

proycon commented Mar 24, 2021 •

edited

Loading

pirolen commented Mar 24, 2021 •

edited by proycon

Loading

proycon commented Mar 24, 2021

pirolen commented Mar 24, 2021

kosloot commented Mar 24, 2021

kosloot commented Mar 29, 2021

proycon commented Mar 29, 2021

pirolen commented Mar 29, 2021 •

edited

Loading

kosloot commented Mar 29, 2021

pirolen commented Mar 29, 2021

kosloot commented Mar 30, 2021

pirolen commented Mar 30, 2021

proycon commented Mar 30, 2021 •

edited

Loading

Problems with leading/trailing whitespace in text content #88

Problems with leading/trailing whitespace in text content #88

Comments

proycon commented Dec 8, 2020

martinreynaert commented Dec 8, 2020

pirolen commented Dec 8, 2020 via email

proycon commented Dec 8, 2020 • edited Loading

kosloot commented Dec 10, 2020 • edited Loading

proycon commented Dec 10, 2020

proycon commented Mar 9, 2021

kosloot commented Mar 9, 2021

proycon commented Mar 9, 2021

kosloot commented Mar 9, 2021

pirolen commented Mar 9, 2021

proycon commented Mar 24, 2021 • edited Loading

pirolen commented Mar 24, 2021 • edited by proycon Loading

proycon commented Mar 24, 2021

pirolen commented Mar 24, 2021

kosloot commented Mar 24, 2021

kosloot commented Mar 29, 2021

proycon commented Mar 29, 2021

pirolen commented Mar 29, 2021 • edited Loading

kosloot commented Mar 29, 2021

pirolen commented Mar 29, 2021

kosloot commented Mar 30, 2021

pirolen commented Mar 30, 2021

proycon commented Mar 30, 2021 • edited Loading

proycon commented Dec 8, 2020 •

edited

Loading

kosloot commented Dec 10, 2020 •

edited

Loading

proycon commented Mar 24, 2021 •

edited

Loading

pirolen commented Mar 24, 2021 •

edited by proycon

Loading

pirolen commented Mar 29, 2021 •

edited

Loading

proycon commented Mar 30, 2021 •

edited

Loading