Question: Abbreviations list #9

pirolen · 2021-05-10T10:07:45Z

What is the best way to supply a list of known abbreviations to python-ucto and ucto in LaMachine?

proycon · 2021-05-10T11:35:14Z

Those lists are part of the uctodata repository, and referred to from the individual configuration files (like tokconfig-nld). Contributions there are welcome of course! (you can just send a pull request)

pirolen · 2021-05-10T12:06:26Z

Thanks! I have a long list of idiosyncratic and critical-edition-specific abbreviations, I guess most of them are not useful for including them in general, e.g. 'Corp.Inscr. Graec.' or 'Abh. der Sächs. Ges.d.Wiss.’ or S.A. (=Sonnenaufgang). Sure I will suggest if generic ones emerge. Bigger context of the question: for 'Abbreviations with multiple periods’ in FoLiA documents, I should make sure that - a short (h)space is inserted after the period - the abbreviations are treated in the text as a unit, i.e. non-splittable at line/page end. I was wondering how to best use ucto to this end. The language is mostly German and Latin, so I set in the config for German and (as a hopeful fallback) to French.

…

On 10. May 2021, at 13:35, Maarten van Gompel ***@***.***> wrote: Those lists are part of the uctodata repository, and referred to from the individual configuration files (like tokconfig-nld). Contributions there are welcome of course! (you can just send a pull request) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

pirolen · 2021-05-10T12:27:21Z

How is "z.B." supposed to be treated by ucto?
It is not part of the list in https://github.com/LanguageMachines/uctodata/blob/master/config/tokconfig-deu,
and is tokenized with the classes "ABBREVIATION-KNOWN" plus "INITIAL".

Isn't it supposed to be somehow a unit and an instance of 'Abbreviations with multiple periods’?
Maybe @kosloot can tell why, I don't know ucto yet that much.

 <w xml:id="FA-b1_3_1_mwtext_ostpreuss_pp109_277_006_abpproc.text.div1.p4.s.4.w.43" class="ABBREVIATION-KNOWN" set="tokconfig-deu" textclass="OCR">
            <t class="OCR">z.</t>
          </w>
          <w xml:id="FA-b1_3_1_mwtext_ostpreuss_pp109_277_006_abpproc.text.div1.p4.s.4.w.44" class="INITIAL" set="tokconfig-deu" textclass="OCR">
            <t class="OCR">B.</t>
          </w>

kosloot · 2021-05-10T16:02:34Z

That all depends on the ucto rules for German.

In this case z.B. is NOT a known abbreviation. But 'z' is. So 'z.' is tagged as an abbreviation.

Also there is a rule for interpret an Uppercase B. as an Initial:

#retain initials
INITIAL=^(?:\p{Lt}|\p{Lu})\.$

Hence the split into 2 words.

Adding z.B. to the abbreviation list should be enough.
My advise is to get a copy of uctodata. (from Git), make the additions you would like to see, and install them in your system.
('make install')

If you are satisfied, a pull request is welcome

kosloot · 2021-05-10T19:33:12Z

To make life easier, I separated the German abbreviations into a separate file you can edit: deu.abr

pirolen · 2021-05-12T10:39:14Z

OK, for the pull request I would only add generic abbreviations, right? (Attached an illustration.)

The multi-element abbreviations like d.h. or z.B. need to be in the list as d\.h and z\.B?

kosloot · 2021-05-12T11:11:39Z

Ok, I am a bit surprised now, as I tested it myself with the current setup:

ucto -Ldeu -v
ucto: inputfile = 
ucto: outputfile = 
ucto: textcat configured from: /home/sloot/usr/local/share/ucto/textcat.cfg
ucto: configured for languages: [deu]
ucto> Dass ist z.B. falsch . <utt> 
Dass	WORD	BEGINOFSENTENCE NEWPARAGRAPH 
ist	WORD	
z.B.	ABBREVIATION	
falsch	WORD	
.	PUNCTUATION	ENDOFSENTENCE

And:

ucto> d.h. dass es gut geht?
d.h.	ABBREVIATION	BEGINOFSENTENCE 
dass	WORD	
es	WORD	
gut	WORD	
geht	WORD	NOSPACE 
?	PUNCTUATION	ENDOFSENTENCE

So it should work out of the box for those examples.

So there must be some other problem here?? I would need more context.

Despite that: more abbreviations are welcome. (you may also send me a file)
Remember that abbreviations that can be mistaken with real words must be marked.
Like :
dass\.1 and NOT dass

pirolen · 2021-05-12T11:29:32Z

I run the LaMachine command-line ucto (not python-ucto) with --uselanguages=deu,fra when I got the above tokenization (#9 (comment)).

In the untokenized text there is a space between the parts of the abbreviation: "wie z. B. im Kreise".

kosloot · 2021-05-12T11:49:29Z

In the untokenized text there is a space between the parts of the abbreviation: "wie z. B. im Kreise".

AH!
Well it cannot be solved by Ucto in that case. A space is the major (and only unchangeable) separator between tokens in Ucto.

Sorry for that limitation.

pirolen · 2021-05-12T11:51:22Z

Any type of space (e.g. short (h)space too) counts as a separator?

kosloot · 2021-05-12T11:57:52Z

I assume so. But determining spaces is a hell :{
Ucto uses the ICU:u_isspace() function to do so.
see:
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a48dd198b451e691cf81eb41831474ddc

pirolen · 2021-05-12T17:32:26Z

OK, good to know.

Back to the deu.abr list: "d.h." is not in this list. How come the command line ucto covers it?

kosloot · 2021-05-12T20:07:57Z

That is covered by the ABBREVIATION rule in the 'tokconfig-deu' file

#Abbreviations with multiple periods
ABBREVIATION=^(?:[{([<]?)(\p{L}{1,3}(?:.\p{L}{1,3})+.?)(?:\Z|[,:;})]>])

This regexp says something along the line:
dot-separated sequences of 1-3 characters are considered an abbreviation;
even when placed between brackets like '{ }' '[ ]' or '( )'.
And optionally ending with a ',' ':' or ';'

This will catch 'z.B'. and 'd.h.' And also '(z.B'. and {'d.h.}' . But NOT 'z. B.' or 'd. h.'

pirolen · 2021-05-31T18:19:43Z

I suggested some items for inclusion in the German abbrev list: LanguageMachines/uctodata@master...pirolen:patch-1

(several are actually of Latin origin...)

kosloot · 2021-06-01T10:38:49Z

They seem OK to me. So I merged them

pirolen · 2021-07-06T10:58:19Z

Adding z.B. to the abbreviation list should be enough.
My advise is to get a copy of uctodata. (from Git), make the additions you would like to see, and install them in your system.
('make install')

I'd like to supply custom abbreviations for python-ucto in my dev LaMachine.
Would it work to add them to lmdev/src/uctodata/config/deu.abr and some action to refresh the tool?

proycon · 2021-07-06T11:42:15Z

Simply adding them to deu.abr should work yes, but those changes may be overwritten on LaMachine update again. Alternatively you could make your own ucto configuration (copy tokconfig-deu) and refer to an abbreviation file of yourself. No need to refresh the tool, the data will be loaded dynamically when the tokeniser binding instantiates.

pirolen · 2021-07-06T12:07:48Z

Alternatively you could make your own ucto configuration (copy tokconfig-deu) and refer to an abbreviation file of yourself.

Thanks!
The copy should be still called tokconfig-deu (and saved somewhere)? Or renamed as e.g. my-tokconfig-deu?
Would it (also) work to refer to my own abbrev list? e.g.:
[ABBREVIATIONS]
%include my-deu

Yet another Q:
(Where) Would it be possible to address phrases such as "der dort in die Zeit des 1. Jhs. v.Chr. eingeordnet wird"?
I added Jhs, Chr and v.Chr to deu.abr.
In tokconfig-deu, the line of #NUMBER-ORDINAL is commented out.
If I uncomment it, the sentence still gets split after '1.' and after 'Jhs.' .

(I would be happy to add more abbrevs to https://github.com/LanguageMachines/uctodata/blob/master/config/deu.abr, but some are really domain-specific and I am not sure how much of that you'd like to have.)

kosloot · 2021-07-06T12:49:24Z

Well: tokconfog-deu will be overwritten on a LaMachine update too. so:

The copy should be still called tokconfig-deu (and saved somewhere)? Or renamed as e.g. my-tokconfig-deu?
Would it (also) work to refer to my own abbrev list? e.g.:
[ABBREVIATIONS]
%include my-deu

Yes that's the way to do it. And you should run Ucto using the '-c' option to refer to your own config:

ucto -c my-tokconfig-deu ...

Your other question needs some more thinking. What exactly would you like to come out? Keeping 1. Jhs. together as one token?

pirolen · 2021-07-06T12:58:25Z

Thanks!
For the current use case I only need sentence segmentation, and so that the sentence does not get chopped up due to these abbreviations (the entire sentence here is: "Im Talmud erwähnter charismatischer Wundermann, der dort in die Zeit des 1. Jhs. v.Chr. eingeordnet wird und besonders als Regenmacher bekannt war.".)

In fact, it would be nice if there would be an option to access all sentences without tokenization. To achieve this, after getting the sentences from the wrapper by tokenizer.sentences(), I am simply re-joining the punctuation marks with the tokens, using string.punctuation... :-o

kosloot · 2021-07-06T13:32:14Z

OK, so your problem is that this utterance it is split into 3 sentences. hmm. That might be not that easy....
In fact quite hard.

In fact, it would be nice if there would be an option to access all sentences without tokenization.

Well for NON-FoLiA files there is the undocumented --split option which does this. But still would give 3 sentences on this input:

$ more piro
Im Talmud erwähnter charismatischer Wundermann, der dort in die Zeit des 1. Jhs. v.Chr. eingeordnet wird u
nd besonders als Regenmacher bekannt war.

$ ucto -Ldeu piro --split
Im Talmud erwähnter charismatischer Wundermann, der dort in die Zeit des 1. <utt> 
Jhs. <utt> 
v.Chr. eingeordnet wird und besonders als Regenmacher bekannt war. <utt>

Without --split:

$ ucto -Ldeu piro 

Im Talmud erwähnter charismatischer Wundermann , der dort in die Zeit des 1 . <utt> Jhs . <utt> v.Chr. eingeordnet wird und besonders als Regenmacher bekannt war . <utt>

But even if we fix sentence splitting to get juts one sentence, this won't work for FoLiA files (not implemented at all).

kosloot · 2021-07-06T13:59:34Z

@pirolen A quick fix might be this:

In your my-tokconfig-deu file replace:

#retain digits, including those starting with initial period (.22), and negative numbers
NUMBER=-?(?:[\.,]?\p{N}+)

by

#retain digits, including those starting with initial period (.22) or ending with a period (1.), and  also negative numbers
NUMBER=-?(?:[\.,]?\p{N}+)(?:[\.])?

AND: be sure to add Jhs. to your abbreviation list.

This seems to do the trick:

$ ucto -c my-deu piro 
die Zeit des 1. Jhs. v.Chr. eingeordnet <utt>

I'm not sure if this will disturb tokenization otherwise, on a first glance all seems OK

One thing that might bite you: A sentence ending on such a number will no longer be detected as such.
So: "Siehe Seite 5. Alles Gute" will be 1 sentence.

pirolen · 2021-07-06T15:31:11Z

Thanks!!
I need to stay in FoLiA.

For me the renamed/customised configfile does not work, neither with ucto -c with on CLI as in your example, nor for python-ucto in a script as usual, i.e.

configurationfile = "my-tokconfig-deu"
tokenizer = ucto.Tokenizer(configurationfile)

In both cases I get:
ucto:Unable to open configfile:
ucto:Cannot read Tokenizer settingsfile my-tokconfig-deu
ucto:Unsupported language? (Did you install the uctodata package?)

:-(

kosloot · 2021-07-06T15:55:33Z

Hmm, works for me on the command-line. So maybe a LaMachine oddity?

ucto -c my-tokconfig.deu 
ucto: inputfile = 
ucto: outputfile = 
ucto: textcat configured from: /home/sloot/usr/local/share/ucto/textcat.cfg
ucto: configured from file: my-tokconfig.deu
ucto> Siehe Seite 5. Alles Gute
Siehe Seite 5. Alles Gute <utt>

pirolen · 2021-07-06T16:43:29Z

Or I don't know something about the extension convention? In LM there is no .deu extension if I see it well.

kosloot · 2021-07-06T18:27:47Z

not the extension doesn't matter.
the abbreviation file should have an extension .abr though.

pirolen · 2021-09-27T11:01:12Z

Using the custom config file with the command line ucto in LaMachine works if:

the custom abbreviation list is referenced from the config file (e.g. 'my-tokconfig.deu') using the full path, e.g.
[ABBREVIATIONS]
%include /home/ubuntu/lama/src/uctodata/config/my-deu.abr
and the '-c' option is used to refer to 'my-tokconfig.deu' with its full filepath
and the --uselanguages option is not specified.

pirolen · 2021-09-27T11:13:04Z

Is lower/uppercasing retained, i.e. 'Vgl' and 'vgl' are different for ucto?

kosloot · 2021-09-27T18:39:01Z

Is lower/uppercasing retained, i.e. 'Vgl' and 'vgl' are different for ucto?

Yes these are different. You can change that in ABBREVIATION-KNOWN rule, using 'ignore case ((?i))' in the REGEXP.
That would render ALL abbreviations case insensitive.

proycon added the question label Jul 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Abbreviations list #9

Question: Abbreviations list #9

pirolen commented May 10, 2021

proycon commented May 10, 2021 •

edited

Loading

pirolen commented May 10, 2021 via email

pirolen commented May 10, 2021

kosloot commented May 10, 2021

kosloot commented May 10, 2021

pirolen commented May 12, 2021 •

edited

Loading

kosloot commented May 12, 2021

pirolen commented May 12, 2021 •

edited

Loading

kosloot commented May 12, 2021

pirolen commented May 12, 2021

kosloot commented May 12, 2021

pirolen commented May 12, 2021

kosloot commented May 12, 2021 •

edited

Loading

pirolen commented May 31, 2021 •

edited

Loading

kosloot commented Jun 1, 2021

pirolen commented Jul 6, 2021

proycon commented Jul 6, 2021

pirolen commented Jul 6, 2021 •

edited

Loading

kosloot commented Jul 6, 2021

pirolen commented Jul 6, 2021

kosloot commented Jul 6, 2021

kosloot commented Jul 6, 2021 •

edited

Loading

pirolen commented Jul 6, 2021

kosloot commented Jul 6, 2021

pirolen commented Jul 6, 2021

kosloot commented Jul 6, 2021

pirolen commented Sep 27, 2021

pirolen commented Sep 27, 2021

kosloot commented Sep 27, 2021

Question: Abbreviations list #9

Question: Abbreviations list #9

Comments

pirolen commented May 10, 2021

proycon commented May 10, 2021 • edited Loading

pirolen commented May 10, 2021 via email

pirolen commented May 10, 2021

kosloot commented May 10, 2021

kosloot commented May 10, 2021

pirolen commented May 12, 2021 • edited Loading

kosloot commented May 12, 2021

pirolen commented May 12, 2021 • edited Loading

kosloot commented May 12, 2021

pirolen commented May 12, 2021

kosloot commented May 12, 2021

pirolen commented May 12, 2021

kosloot commented May 12, 2021 • edited Loading

pirolen commented May 31, 2021 • edited Loading

kosloot commented Jun 1, 2021

pirolen commented Jul 6, 2021

proycon commented Jul 6, 2021

pirolen commented Jul 6, 2021 • edited Loading

kosloot commented Jul 6, 2021

pirolen commented Jul 6, 2021

kosloot commented Jul 6, 2021

kosloot commented Jul 6, 2021 • edited Loading

pirolen commented Jul 6, 2021

kosloot commented Jul 6, 2021

pirolen commented Jul 6, 2021

kosloot commented Jul 6, 2021

pirolen commented Sep 27, 2021

pirolen commented Sep 27, 2021

kosloot commented Sep 27, 2021

proycon commented May 10, 2021 •

edited

Loading

pirolen commented May 12, 2021 •

edited

Loading

pirolen commented May 12, 2021 •

edited

Loading

kosloot commented May 12, 2021 •

edited

Loading

pirolen commented May 31, 2021 •

edited

Loading

pirolen commented Jul 6, 2021 •

edited

Loading

kosloot commented Jul 6, 2021 •

edited

Loading