Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of different types of hypens in text. #67

Closed
kosloot opened this issue Jan 30, 2023 · 16 comments
Closed

Handling of different types of hypens in text. #67

kosloot opened this issue Jan 30, 2023 · 16 comments
Assignees

Comments

@kosloot
Copy link
Contributor

kosloot commented Jan 30, 2023

as found in LanguageMachines/ucto#90 there are some difficulties in handling hyphens in text.
How to represent them in FoLiA in a way such that reconstruction the original text (with FoLiA-2text) or further processing (e.g. with ucto) is possible and does the right thing.

I wil start with a long introduction:

Dit is een tekst-
je met een hyphen
en met een soft¬
hyphen

A soft hyphen ¬ indicates that a single word is split over two lines.
So

soft¬
hyphen

should read softhypen.

In FoliA <t-hbr class="¬"/>

In a lot of (modern) text (in Dutch for sure) a 'normal' hyphen is used.
So

tekst-
je

reads tekstje
In FoLiA this can be represented as <t-hbr class'"-"/>

So the whole text would be:
<t>Dit is een tekst<t-hbr class="-"/>je met een hyphen en met een soft<t-hbr class="¬"/>hyphen</t>
And the extracted tekst would be:
Dit is een tekstje met een hyphen en met een softhyphen

For a lot off applications this is just right.
So far so good. But as we see: the value inside a <t-hbr/> (also the class value) is NO LONGER part of the text.
And also the formatting (line-breaks) are gone.

We have a solution for that too. The <br/> nodes. So lets add them.
The FoLiA text being then:
<t>Dit is een tekst<t-hbr class="-"/><br space="no"/>je met een hyphen en met een soft<t-hbr class="¬"/><br space="no"/>hyphen</t>

On output this would reproduce the original text.

OK, problem solved, you wold think. Close this issue...
So now my main concerns:

  • AT THE MOMENT, the extracted text is NOT a 1-1 copy of the input, as all hyphens ar removed.
    I assume we need a special mechanisme in FoLiA to achieve that.
    I will not at address any further here

  • Can we assume that the '- and the ¬ always play the same role, and can be handled alike?

  • Or does a mixed text like in the example suggest DIFFERENT roles? Where the - has to be preserved in in text

I assume we have 3 cases:

  1. there are soft-hyphens that are to removed, and normal ones to preserve
  2. there are only normal hyphens which are to be removed
  3. there are both soft and normal hyphens which are removed

NOTE: We only speak of terminating hyphens, followed by a break/newline. Embedded hyphens should always be preserved, I think.

I am afraid that the only way to handle this is to have runtime options in FoLiA-txt so choose the strategy, and unfortunately there is no default case that is always the best. Older corpora do have the ¬. als a soft hyphen, but modern text often has the the - at the line-end in the same role.

@proycon and @pirolen ANY comments?

@pirolen
Copy link

pirolen commented Jan 30, 2023

I'd say semantic interpretation of hyphen types is not needed in the tools, so the case to be covered would be case 1 above.

There are often normal hyphens at the end of line too, e.g. in long German compounds. These should be kept.

My previous experience was: texts are OCR-ed with Abbyy, and Abbyy decides if the hyphen at EOL a soft one or not. (It can make wrong decisions, so this needs to be sometimes hand-corrected). So Abbyy sometimes produced normal hyphen and sometimes soft hyphens at EOL. FoLiA-abby could interpret this, and ucto worked well on that folia file.

I guess it would be fine to keep this strategy.

@kosloot
Copy link
Contributor Author

kosloot commented Jan 30, 2023

Ok I have now implemented scenario 1. Always removing soft-hyhens.
But also I added a --remove-end-hyphens option, to handle trailing hyphens just like soft-hyphens.
So scenario 3, is handled too, and implicitly 2.
Now in GIT

@dirkroorda
Copy link

For what it is worth: in my conversion from the Mondriaan letters to Text-Fabric (radical stand-off), I have nodes for each character, but also nodes for each word. Each character node has a feature ch with the character as value. Every character ends up there, hyphens, soft or not, whatever.

But I also have nodes for words. They are maximal spans of alphanumeric characters and hyphens. Word nodes have a feature str, which contains the text string of the corresponding word, but without the hyphens. Each word has also a feature after, which contains the string of characters after that word until the following word.

So if you want the raw text string o, look up all character nodes, read the ch feature, and concatenate.

If you want a slightly more polished form of the text, look up all word nodes, read the str and after features, and concatenate them.

So: one dataset, several text representations that can be extracted from it.

And this is not the end. One could also compute slightly different variants of the str and after features in order to get a desired representation of the text.

But yes, what I did to Mondriaan handles embedded hyphens in the wrong way: I should not discard them from the str feature values.

So the coincidence of seeing this issue come up in a slack channel alerted me to a glitch in my code. Thanks!

@kosloot
Copy link
Contributor Author

kosloot commented Feb 15, 2023

@dirkroorda Thanx for your input.
I think we have a working solution now for FoLiA. In Fact it is possible to have several variants of a text in one FoLiA document too. (using the textclass attribute).
As a last resort we might look into that. But when handling over to next stages, it is desirable to stick to one textclass.
e.g. tokenization of different texts inside one document is not supported.

@pirolen
Copy link

pirolen commented Mar 9, 2023

Apologies: I am very confused about how I can access hyphenated tokens from a FoLiA-txt output file, i.e. my original question LanguageMachines/ucto#90

Txt text before coversion:

Screen Shot 2023-03-09 at 23 53 17

After conversion by FoLiA-txt (I pulled the latest container image):
FoLiA-txt --remove-end-hyphens yes -O tmp/. VMC_Sin_990_Januar_330-636_unbearbeitet.txt

Screen Shot 2023-03-09 at 23 53 44

Screen Shot 2023-03-09 at 23 54 19

I expected that FoLiA-txt would produce untokenized paragraphs on which I can run python-ucto that would understand the break notation and would give me the concatenated tokens.

Do I see correctly that FoLiA-txt produced untokenized paragraphs with break notation, plus the tokens that are split at the end hyphens? Did I maybe set some flag incorrectly when calling FoLiA-txt?

@kosloot
Copy link
Contributor Author

kosloot commented Mar 11, 2023

AH, ok this seems to be an oversight.
When "interpreting" the hyphens, it seems to be a bad idea to add an explicit break after them.
(the <br space="no"/> nodes)
Is assume they must be removed. That is easy to do, but I wonder if this would be a good strategy in general.
Otherwise this should (again) be another option. I will give it some thought first.

@kosloot
Copy link
Contributor Author

kosloot commented Mar 12, 2023

Well, this is is solved now, but I had to add a small fix in libfolia too.
But this only affect FoLiA-2text.
The patch I added for FoLiA-txt should be enough to get ucto working as expected.
So when using the git versions, you can test.
New official releases are expected coming week.

NOTE: the <str> nodes in the FoLiA are there only to "document" the original text. Ucto (nor any other tool) uses those.

@pirolen
Copy link

pirolen commented Mar 13, 2023

Thanks a lot, the git version works as expected!

@proycon
Copy link
Member

proycon commented Mar 13, 2023

Ko has released foliautils 0.20 today so the changes are released now. I updated the docker version accordingly.

@pirolen
Copy link

pirolen commented Mar 13, 2023

Good to know, thanks for the quick action! I was not sure whether that release solves it. At least I practiced building the container too :-) Next, I will test FoLiA-page :-)

@kosloot
Copy link
Contributor Author

kosloot commented Mar 14, 2023

I assume that some "recent" improvements can be implemented in other modules like FoIlA-page , FoliA-hocr etc. too
Not sure how feasible or workable that is.
Might take some time.

@kosloot kosloot closed this as completed Mar 14, 2023
@pirolen
Copy link

pirolen commented Jul 10, 2023

Ko has released foliautils 0.20 today so the changes are released now. I updated the docker version accordingly.

I pulled the latest container now (proycon/foliautils latest 493744d1613a 4 months ago 164MB).
If I do FoLiA-txt --version I get: foliautils 0.19. :-/ and no EOL hyphen handling, how come?

The container was pullid using docker pull proycon/foliautils -- perhaps one needs to specify a tag?

Screen Shot 2023-07-10 at 17 12 56

Or do I still need to build the container myself after pulling the image?

@pirolen
Copy link

pirolen commented Jul 10, 2023

I built a container based on the git repo, so now I have foliautils 0.21.
But if I run FoLiA-txt --remove-end-hyphens yes on a file containing normal hyphens at a linebreak, e.g.

Screen Shot 2023-07-10 at 17 56 41

, I still get this:

Screen Shot 2023-07-10 at 17 57 06

@kosloot
Copy link
Contributor Author

kosloot commented Jul 10, 2023

Yes. This is intended behavior. (--remove-end-hyphens=yes is even the default!)

Please note that when this is further processed by other libfolia based tools, the <t-hbr> is ignored.

I tested with the line:

Dit is een test-
je.

The Folia created is:

   <p xml:id="testfile.p.1">
      <t class="FoLiA-txt">Dit is een test<t-hbr>-</t-hbr>je.<br/></t>
      <str xml:id="testfile.p.1.str.1">
...

running ucto on this line, you will get:
> ucto -Lnld --inputclass FoLiA-txt hmm/testfile.folia.xml

      <s xml:id="testfile.p.1.s.1">
        <w xml:id="testfile.p.1.s.1.w.1" class="WORD">
          <t>Dit</t>
        </w>
        <w xml:id="testfile.p.1.s.1.w.2" class="WORD">
          <t>is</t>
        </w>
        <w xml:id="testfile.p.1.s.1.w.3" class="WORD">
          <t>een</t>
        </w>
        <w xml:id="testfile.p.1.s.1.w.4" class="WORD" space="no">
          <t>testje</t>
        </w>
        <w xml:id="testfile.p.1.s.1.w.5" class="PUNCTUATION">
          <t>.</t>
        </w>
      </s>

Running FoLiA-2text on this file yields:
Dit is een testje.

This seems correct imho.

@pirolen
Copy link

pirolen commented Jul 10, 2023

You are right of course! Apologies, apparently I have not properly documented it for myself, that the concatenated token is to be obtained using ucto. Much thanks for the quick answer!

@proycon
Copy link
Member

proycon commented Jul 17, 2023

I pulled the latest container now (proycon/foliautils latest 493744d1613a 4
months ago 164MB).
If I do FoLiA-txt --version I get: foliautils 0.19.

Hmm, something must have gone wrong or been forgotten after the last release. I have now pushed foliautils 0.20 (latest release) to Docker Hub. Of course, your own git build is more recent as that contains changed that haven't been released yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants