-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of different types of hypens in text. #67
Comments
I'd say semantic interpretation of hyphen types is not needed in the tools, so the case to be covered would be case 1 above. There are often normal hyphens at the end of line too, e.g. in long German compounds. These should be kept. My previous experience was: texts are OCR-ed with Abbyy, and Abbyy decides if the hyphen at EOL a soft one or not. (It can make wrong decisions, so this needs to be sometimes hand-corrected). So Abbyy sometimes produced normal hyphen and sometimes soft hyphens at EOL. FoLiA-abby could interpret this, and ucto worked well on that folia file. I guess it would be fine to keep this strategy. |
Ok I have now implemented scenario 1. Always removing soft-hyhens. |
For what it is worth: in my conversion from the Mondriaan letters to Text-Fabric (radical stand-off), I have nodes for each character, but also nodes for each word. Each character node has a feature But I also have nodes for words. They are maximal spans of alphanumeric characters and hyphens. Word nodes have a feature So if you want the raw text string o, look up all character nodes, read the If you want a slightly more polished form of the text, look up all word nodes, read the So: one dataset, several text representations that can be extracted from it. And this is not the end. One could also compute slightly different variants of the But yes, what I did to Mondriaan handles embedded hyphens in the wrong way: I should not discard them from the So the coincidence of seeing this issue come up in a slack channel alerted me to a glitch in my code. Thanks! |
@dirkroorda Thanx for your input. |
Apologies: I am very confused about how I can access hyphenated tokens from a FoLiA-txt output file, i.e. my original question LanguageMachines/ucto#90 Txt text before coversion: After conversion by FoLiA-txt (I pulled the latest container image): I expected that FoLiA-txt would produce untokenized paragraphs on which I can run python-ucto that would understand the break notation and would give me the concatenated tokens. Do I see correctly that FoLiA-txt produced untokenized paragraphs with break notation, plus the tokens that are split at the end hyphens? Did I maybe set some flag incorrectly when calling FoLiA-txt? |
AH, ok this seems to be an oversight. |
Well, this is is solved now, but I had to add a small fix in libfolia too. NOTE: the |
Thanks a lot, the git version works as expected! |
Ko has released foliautils 0.20 today so the changes are released now. I updated the docker version accordingly. |
Good to know, thanks for the quick action! I was not sure whether that release solves it. At least I practiced building the container too :-) Next, I will test FoLiA-page :-) |
I assume that some "recent" improvements can be implemented in other modules like FoIlA-page , FoliA-hocr etc. too |
Yes. This is intended behavior. (--remove-end-hyphens=yes is even the default!) Please note that when this is further processed by other libfolia based tools, the I tested with the line:
The Folia created is: <p xml:id="testfile.p.1">
<t class="FoLiA-txt">Dit is een test<t-hbr>-</t-hbr>je.<br/></t>
<str xml:id="testfile.p.1.str.1">
... running ucto on this line, you will get: <s xml:id="testfile.p.1.s.1">
<w xml:id="testfile.p.1.s.1.w.1" class="WORD">
<t>Dit</t>
</w>
<w xml:id="testfile.p.1.s.1.w.2" class="WORD">
<t>is</t>
</w>
<w xml:id="testfile.p.1.s.1.w.3" class="WORD">
<t>een</t>
</w>
<w xml:id="testfile.p.1.s.1.w.4" class="WORD" space="no">
<t>testje</t>
</w>
<w xml:id="testfile.p.1.s.1.w.5" class="PUNCTUATION">
<t>.</t>
</w>
</s> Running FoLiA-2text on this file yields: This seems correct imho. |
You are right of course! Apologies, apparently I have not properly documented it for myself, that the concatenated token is to be obtained using ucto. Much thanks for the quick answer! |
Hmm, something must have gone wrong or been forgotten after the last release. I have now pushed foliautils 0.20 (latest release) to Docker Hub. Of course, your own git build is more recent as that contains changed that haven't been released yet. |
as found in LanguageMachines/ucto#90 there are some difficulties in handling hyphens in text.
How to represent them in FoLiA in a way such that reconstruction the original text (with FoLiA-2text) or further processing (e.g. with ucto) is possible and does the right thing.
I wil start with a long introduction:
A soft hyphen
¬
indicates that a single word is split over two lines.So
should read
softhypen
.In FoliA
<t-hbr class="¬"/>
In a lot of (modern) text (in Dutch for sure) a 'normal' hyphen is used.
So
reads
tekstje
In FoLiA this can be represented as
<t-hbr class'"-"/>
So the whole text would be:
<t>Dit is een tekst<t-hbr class="-"/>je met een hyphen en met een soft<t-hbr class="¬"/>hyphen</t>
And the extracted tekst would be:
Dit is een tekstje met een hyphen en met een softhyphen
For a lot off applications this is just right.
So far so good. But as we see: the value inside a
<t-hbr/>
(also theclass
value) is NO LONGER part of the text.And also the formatting (line-breaks) are gone.
We have a solution for that too. The
<br/>
nodes. So lets add them.The FoLiA text being then:
<t>Dit is een tekst<t-hbr class="-"/><br space="no"/>je met een hyphen en met een soft<t-hbr class="¬"/><br space="no"/>hyphen</t>
On output this would reproduce the original text.
OK, problem solved, you wold think. Close this issue...
So now my main concerns:
AT THE MOMENT, the extracted text is NOT a 1-1 copy of the input, as all hyphens ar removed.
I assume we need a special mechanisme in FoLiA to achieve that.
I will not at address any further here
Can we assume that the '
-
and the¬
always play the same role, and can be handled alike?Or does a mixed text like in the example suggest DIFFERENT roles? Where the
-
has to be preserved in in textI assume we have 3 cases:
NOTE: We only speak of terminating hyphens, followed by a break/newline. Embedded hyphens should always be preserved, I think.
I am afraid that the only way to handle this is to have runtime options in FoLiA-txt so choose the strategy, and unfortunately there is no default case that is always the best. Older corpora do have the
¬
. als a soft hyphen, but modern text often has the the-
at the line-end in the same role.@proycon and @pirolen ANY comments?
The text was updated successfully, but these errors were encountered: