Handling of different types of hypens in text. #67

kosloot · 2023-01-30T10:22:11Z

as found in LanguageMachines/ucto#90 there are some difficulties in handling hyphens in text.
How to represent them in FoLiA in a way such that reconstruction the original text (with FoLiA-2text) or further processing (e.g. with ucto) is possible and does the right thing.

I wil start with a long introduction:

Dit is een tekst-
je met een hyphen
en met een soft¬
hyphen

A soft hyphen ¬ indicates that a single word is split over two lines.
So

soft¬
hyphen

should read softhypen.

In FoliA <t-hbr class="¬"/>

In a lot of (modern) text (in Dutch for sure) a 'normal' hyphen is used.
So

tekst-
je

reads tekstje
In FoLiA this can be represented as <t-hbr class'"-"/>

So the whole text would be:
<t>Dit is een tekst<t-hbr class="-"/>je met een hyphen en met een soft<t-hbr class="¬"/>hyphen</t>
And the extracted tekst would be:
Dit is een tekstje met een hyphen en met een softhyphen

For a lot off applications this is just right.
So far so good. But as we see: the value inside a <t-hbr/> (also the class value) is NO LONGER part of the text.
And also the formatting (line-breaks) are gone.

We have a solution for that too. The <br/> nodes. So lets add them.
The FoLiA text being then:
<t>Dit is een tekst<t-hbr class="-"/><br space="no"/>je met een hyphen en met een soft<t-hbr class="¬"/><br space="no"/>hyphen</t>

On output this would reproduce the original text.

OK, problem solved, you wold think. Close this issue...
So now my main concerns:

AT THE MOMENT, the extracted text is NOT a 1-1 copy of the input, as all hyphens ar removed.
I assume we need a special mechanisme in FoLiA to achieve that.
I will not at address any further here
Can we assume that the '- and the ¬ always play the same role, and can be handled alike?
Or does a mixed text like in the example suggest DIFFERENT roles? Where the - has to be preserved in in text

I assume we have 3 cases:

there are soft-hyphens that are to removed, and normal ones to preserve
there are only normal hyphens which are to be removed
there are both soft and normal hyphens which are removed

NOTE: We only speak of terminating hyphens, followed by a break/newline. Embedded hyphens should always be preserved, I think.

I am afraid that the only way to handle this is to have runtime options in FoLiA-txt so choose the strategy, and unfortunately there is no default case that is always the best. Older corpora do have the ¬. als a soft hyphen, but modern text often has the the - at the line-end in the same role.

@proycon and @pirolen ANY comments?

The text was updated successfully, but these errors were encountered:

pirolen · 2023-01-30T10:59:03Z

I'd say semantic interpretation of hyphen types is not needed in the tools, so the case to be covered would be case 1 above.

There are often normal hyphens at the end of line too, e.g. in long German compounds. These should be kept.

My previous experience was: texts are OCR-ed with Abbyy, and Abbyy decides if the hyphen at EOL a soft one or not. (It can make wrong decisions, so this needs to be sometimes hand-corrected). So Abbyy sometimes produced normal hyphen and sometimes soft hyphens at EOL. FoLiA-abby could interpret this, and ucto worked well on that folia file.

I guess it would be fine to keep this strategy.

kosloot · 2023-01-30T14:21:19Z

Ok I have now implemented scenario 1. Always removing soft-hyhens.
But also I added a --remove-end-hyphens option, to handle trailing hyphens just like soft-hyphens.
So scenario 3, is handled too, and implicitly 2.
Now in GIT

dirkroorda · 2023-02-14T22:21:14Z

For what it is worth: in my conversion from the Mondriaan letters to Text-Fabric (radical stand-off), I have nodes for each character, but also nodes for each word. Each character node has a feature ch with the character as value. Every character ends up there, hyphens, soft or not, whatever.

But I also have nodes for words. They are maximal spans of alphanumeric characters and hyphens. Word nodes have a feature str, which contains the text string of the corresponding word, but without the hyphens. Each word has also a feature after, which contains the string of characters after that word until the following word.

So if you want the raw text string o, look up all character nodes, read the ch feature, and concatenate.

If you want a slightly more polished form of the text, look up all word nodes, read the str and after features, and concatenate them.

So: one dataset, several text representations that can be extracted from it.

And this is not the end. One could also compute slightly different variants of the str and after features in order to get a desired representation of the text.

But yes, what I did to Mondriaan handles embedded hyphens in the wrong way: I should not discard them from the str feature values.

So the coincidence of seeing this issue come up in a slack channel alerted me to a glitch in my code. Thanks!

kosloot · 2023-02-15T16:05:46Z

@dirkroorda Thanx for your input.
I think we have a working solution now for FoLiA. In Fact it is possible to have several variants of a text in one FoLiA document too. (using the textclass attribute).
As a last resort we might look into that. But when handling over to next stages, it is desirable to stick to one textclass.
e.g. tokenization of different texts inside one document is not supported.

pirolen · 2023-03-09T22:51:42Z

Apologies: I am very confused about how I can access hyphenated tokens from a FoLiA-txt output file, i.e. my original question LanguageMachines/ucto#90

Txt text before coversion:

After conversion by FoLiA-txt (I pulled the latest container image):
FoLiA-txt --remove-end-hyphens yes -O tmp/. VMC_Sin_990_Januar_330-636_unbearbeitet.txt

I expected that FoLiA-txt would produce untokenized paragraphs on which I can run python-ucto that would understand the break notation and would give me the concatenated tokens.

Do I see correctly that FoLiA-txt produced untokenized paragraphs with break notation, plus the tokens that are split at the end hyphens? Did I maybe set some flag incorrectly when calling FoLiA-txt?

kosloot · 2023-03-11T12:20:00Z

AH, ok this seems to be an oversight.
When "interpreting" the hyphens, it seems to be a bad idea to add an explicit break after them.
(the <br space="no"/> nodes)
Is assume they must be removed. That is easy to do, but I wonder if this would be a good strategy in general.
Otherwise this should (again) be another option. I will give it some thought first.

kosloot · 2023-03-12T12:06:03Z

Well, this is is solved now, but I had to add a small fix in libfolia too.
But this only affect FoLiA-2text.
The patch I added for FoLiA-txt should be enough to get ucto working as expected.
So when using the git versions, you can test.
New official releases are expected coming week.

NOTE: the <str> nodes in the FoLiA are there only to "document" the original text. Ucto (nor any other tool) uses those.

pirolen · 2023-03-13T21:43:41Z

Thanks a lot, the git version works as expected!

proycon · 2023-03-13T21:45:55Z

Ko has released foliautils 0.20 today so the changes are released now. I updated the docker version accordingly.

pirolen · 2023-03-13T21:47:37Z

Good to know, thanks for the quick action! I was not sure whether that release solves it. At least I practiced building the container too :-) Next, I will test FoLiA-page :-)

kosloot · 2023-03-14T07:45:18Z

I assume that some "recent" improvements can be implemented in other modules like FoIlA-page , FoliA-hocr etc. too
Not sure how feasible or workable that is.
Might take some time.

pirolen · 2023-07-10T14:57:47Z

Ko has released foliautils 0.20 today so the changes are released now. I updated the docker version accordingly.

I pulled the latest container now (proycon/foliautils latest 493744d1613a 4 months ago 164MB).
If I do FoLiA-txt --version I get: foliautils 0.19. :-/ and no EOL hyphen handling, how come?

The container was pullid using docker pull proycon/foliautils -- perhaps one needs to specify a tag?

Or do I still need to build the container myself after pulling the image?

pirolen · 2023-07-10T15:58:43Z

I built a container based on the git repo, so now I have foliautils 0.21.
But if I run FoLiA-txt --remove-end-hyphens yes on a file containing normal hyphens at a linebreak, e.g.

, I still get this:

kosloot · 2023-07-10T21:08:37Z

Yes. This is intended behavior. (--remove-end-hyphens=yes is even the default!)

Please note that when this is further processed by other libfolia based tools, the <t-hbr> is ignored.

I tested with the line:

Dit is een test-
je.

The Folia created is:

   <p xml:id="testfile.p.1">
      <t class="FoLiA-txt">Dit is een test<t-hbr>-</t-hbr>je.<br/></t>
      <str xml:id="testfile.p.1.str.1">
...

running ucto on this line, you will get:
> ucto -Lnld --inputclass FoLiA-txt hmm/testfile.folia.xml

      <s xml:id="testfile.p.1.s.1">
        <w xml:id="testfile.p.1.s.1.w.1" class="WORD">
          <t>Dit</t>
        </w>
        <w xml:id="testfile.p.1.s.1.w.2" class="WORD">
          <t>is</t>
        </w>
        <w xml:id="testfile.p.1.s.1.w.3" class="WORD">
          <t>een</t>
        </w>
        <w xml:id="testfile.p.1.s.1.w.4" class="WORD" space="no">
          <t>testje</t>
        </w>
        <w xml:id="testfile.p.1.s.1.w.5" class="PUNCTUATION">
          <t>.</t>
        </w>
      </s>

Running FoLiA-2text on this file yields:
Dit is een testje.

This seems correct imho.

pirolen · 2023-07-10T21:44:46Z

You are right of course! Apologies, apparently I have not properly documented it for myself, that the concatenated token is to be obtained using ucto. Much thanks for the quick answer!

proycon · 2023-07-17T10:34:50Z

I pulled the latest container now (proycon/foliautils latest 493744d1613a 4
months ago 164MB).
If I do FoLiA-txt --version I get: foliautils 0.19.

Hmm, something must have gone wrong or been forgotten after the last release. I have now pushed foliautils 0.20 (latest release) to Docker Hub. Of course, your own git build is more recent as that contains changed that haven't been released yet.

kosloot assigned proycon and kosloot Jan 30, 2023

pirolen mentioned this issue Mar 13, 2023

FoLiA-page: add support for linebreaks #65

Closed

kosloot closed this as completed Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of different types of hypens in text. #67

Handling of different types of hypens in text. #67

kosloot commented Jan 30, 2023 •

edited

Loading

pirolen commented Jan 30, 2023

kosloot commented Jan 30, 2023

dirkroorda commented Feb 14, 2023

kosloot commented Feb 15, 2023

pirolen commented Mar 9, 2023 •

edited

Loading

kosloot commented Mar 11, 2023

kosloot commented Mar 12, 2023

pirolen commented Mar 13, 2023

proycon commented Mar 13, 2023

pirolen commented Mar 13, 2023

kosloot commented Mar 14, 2023

pirolen commented Jul 10, 2023 •

edited

Loading

pirolen commented Jul 10, 2023

kosloot commented Jul 10, 2023

pirolen commented Jul 10, 2023

proycon commented Jul 17, 2023

Handling of different types of hypens in text. #67

Handling of different types of hypens in text. #67

Comments

kosloot commented Jan 30, 2023 • edited Loading

pirolen commented Jan 30, 2023

kosloot commented Jan 30, 2023

dirkroorda commented Feb 14, 2023

kosloot commented Feb 15, 2023

pirolen commented Mar 9, 2023 • edited Loading

kosloot commented Mar 11, 2023

kosloot commented Mar 12, 2023

pirolen commented Mar 13, 2023

proycon commented Mar 13, 2023

pirolen commented Mar 13, 2023

kosloot commented Mar 14, 2023

pirolen commented Jul 10, 2023 • edited Loading

pirolen commented Jul 10, 2023

kosloot commented Jul 10, 2023

pirolen commented Jul 10, 2023

proycon commented Jul 17, 2023

kosloot commented Jan 30, 2023 •

edited

Loading

pirolen commented Mar 9, 2023 •

edited

Loading

pirolen commented Jul 10, 2023 •

edited

Loading