Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Abbreviations list #9

Open
pirolen opened this issue May 10, 2021 · 29 comments
Open

Question: Abbreviations list #9

pirolen opened this issue May 10, 2021 · 29 comments
Labels

Comments

@pirolen
Copy link

pirolen commented May 10, 2021

What is the best way to supply a list of known abbreviations to python-ucto and ucto in LaMachine?

@proycon
Copy link
Owner

proycon commented May 10, 2021

Those lists are part of the uctodata repository, and referred to from the individual configuration files (like tokconfig-nld). Contributions there are welcome of course! (you can just send a pull request)

@pirolen
Copy link
Author

pirolen commented May 10, 2021 via email

@pirolen
Copy link
Author

pirolen commented May 10, 2021

How is "z.B." supposed to be treated by ucto?
It is not part of the list in https://github.com/LanguageMachines/uctodata/blob/master/config/tokconfig-deu,
and is tokenized with the classes "ABBREVIATION-KNOWN" plus "INITIAL".

Isn't it supposed to be somehow a unit and an instance of 'Abbreviations with multiple periods’?
Maybe @kosloot can tell why, I don't know ucto yet that much.

 <w xml:id="FA-b1_3_1_mwtext_ostpreuss_pp109_277_006_abpproc.text.div1.p4.s.4.w.43" class="ABBREVIATION-KNOWN" set="tokconfig-deu" textclass="OCR">
            <t class="OCR">z.</t>
          </w>
          <w xml:id="FA-b1_3_1_mwtext_ostpreuss_pp109_277_006_abpproc.text.div1.p4.s.4.w.44" class="INITIAL" set="tokconfig-deu" textclass="OCR">
            <t class="OCR">B.</t>
          </w>

@kosloot
Copy link

kosloot commented May 10, 2021

That all depends on the ucto rules for German.

In this case z.B. is NOT a known abbreviation. But 'z' is. So 'z.' is tagged as an abbreviation.

Also there is a rule for interpret an Uppercase B. as an Initial:

#retain initials
INITIAL=^(?:\p{Lt}|\p{Lu})\.$

Hence the split into 2 words.

Adding z.B. to the abbreviation list should be enough.
My advise is to get a copy of uctodata. (from Git), make the additions you would like to see, and install them in your system.
('make install')

If you are satisfied, a pull request is welcome

@kosloot
Copy link

kosloot commented May 10, 2021

To make life easier, I separated the German abbreviations into a separate file you can edit: deu.abr

@pirolen
Copy link
Author

pirolen commented May 12, 2021

OK, for the pull request I would only add generic abbreviations, right? (Attached an illustration.)

The multi-element abbreviations like d.h. or z.B. need to be in the list as d\.h and z\.B?

Screenshot 2021-05-12 at 12 36 28

@kosloot
Copy link

kosloot commented May 12, 2021

Ok, I am a bit surprised now, as I tested it myself with the current setup:

ucto -Ldeu -v
ucto: inputfile = 
ucto: outputfile = 
ucto: textcat configured from: /home/sloot/usr/local/share/ucto/textcat.cfg
ucto: configured for languages: [deu]
ucto> Dass ist z.B. falsch . <utt> 
Dass	WORD	BEGINOFSENTENCE NEWPARAGRAPH 
ist	WORD	
z.B.	ABBREVIATION	
falsch	WORD	
.	PUNCTUATION	ENDOFSENTENCE 

And:

ucto> d.h. dass es gut geht?
d.h.	ABBREVIATION	BEGINOFSENTENCE 
dass	WORD	
es	WORD	
gut	WORD	
geht	WORD	NOSPACE 
?	PUNCTUATION	ENDOFSENTENCE 

So it should work out of the box for those examples.

So there must be some other problem here?? I would need more context.

Despite that: more abbreviations are welcome. (you may also send me a file)
Remember that abbreviations that can be mistaken with real words must be marked.
Like :
dass\.1 and NOT dass

@pirolen
Copy link
Author

pirolen commented May 12, 2021

I run the LaMachine command-line ucto (not python-ucto) with --uselanguages=deu,fra when I got the above tokenization (#9 (comment)).

In the untokenized text there is a space between the parts of the abbreviation: "wie z. B. im Kreise".

@kosloot
Copy link

kosloot commented May 12, 2021

In the untokenized text there is a space between the parts of the abbreviation: "wie z. B. im Kreise".

AH!
Well it cannot be solved by Ucto in that case. A space is the major (and only unchangeable) separator between tokens in Ucto.

Sorry for that limitation.

@pirolen
Copy link
Author

pirolen commented May 12, 2021

Any type of space (e.g. short (h)space too) counts as a separator?

@kosloot
Copy link

kosloot commented May 12, 2021

I assume so. But determining spaces is a hell :{
Ucto uses the ICU:u_isspace() function to do so.
see:
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a48dd198b451e691cf81eb41831474ddc

@pirolen
Copy link
Author

pirolen commented May 12, 2021

OK, good to know.

Back to the deu.abr list: "d.h." is not in this list. How come the command line ucto covers it?

@kosloot
Copy link

kosloot commented May 12, 2021

That is covered by the ABBREVIATION rule in the 'tokconfig-deu' file

#Abbreviations with multiple periods
ABBREVIATION=^(?:[{([<]?)(\p{L}{1,3}(?:.\p{L}{1,3})+.?)(?:\Z|[,:;})]>])

This regexp says something along the line:
dot-separated sequences of 1-3 characters are considered an abbreviation;
even when placed between brackets like '{ }' '[ ]' or '( )'.
And optionally ending with a ',' ':' or ';'

This will catch 'z.B'. and 'd.h.' And also '(z.B'. and {'d.h.}' . But NOT 'z. B.' or 'd. h.'

@pirolen
Copy link
Author

pirolen commented May 31, 2021

I suggested some items for inclusion in the German abbrev list: LanguageMachines/uctodata@master...pirolen:patch-1

(several are actually of Latin origin...)

@kosloot
Copy link

kosloot commented Jun 1, 2021

They seem OK to me. So I merged them

@pirolen
Copy link
Author

pirolen commented Jul 6, 2021

Adding z.B. to the abbreviation list should be enough.
My advise is to get a copy of uctodata. (from Git), make the additions you would like to see, and install them in your system.
('make install')

I'd like to supply custom abbreviations for python-ucto in my dev LaMachine.
Would it work to add them to lmdev/src/uctodata/config/deu.abr and some action to refresh the tool?

@proycon
Copy link
Owner

proycon commented Jul 6, 2021

Simply adding them to deu.abr should work yes, but those changes may be overwritten on LaMachine update again. Alternatively you could make your own ucto configuration (copy tokconfig-deu) and refer to an abbreviation file of yourself. No need to refresh the tool, the data will be loaded dynamically when the tokeniser binding instantiates.

@pirolen
Copy link
Author

pirolen commented Jul 6, 2021

Alternatively you could make your own ucto configuration (copy tokconfig-deu) and refer to an abbreviation file of yourself.

Thanks!
The copy should be still called tokconfig-deu (and saved somewhere)? Or renamed as e.g. my-tokconfig-deu?
Would it (also) work to refer to my own abbrev list? e.g.:
[ABBREVIATIONS]
%include my-deu

Yet another Q:
(Where) Would it be possible to address phrases such as "der dort in die Zeit des 1. Jhs. v.Chr. eingeordnet wird"?
I added Jhs, Chr and v.Chr to deu.abr.
In tokconfig-deu, the line of #NUMBER-ORDINAL is commented out.
If I uncomment it, the sentence still gets split after '1.' and after 'Jhs.' .

(I would be happy to add more abbrevs to https://github.com/LanguageMachines/uctodata/blob/master/config/deu.abr, but some are really domain-specific and I am not sure how much of that you'd like to have.)

@kosloot
Copy link

kosloot commented Jul 6, 2021

Well: tokconfog-deu will be overwritten on a LaMachine update too. so:

The copy should be still called tokconfig-deu (and saved somewhere)? Or renamed as e.g. my-tokconfig-deu?
Would it (also) work to refer to my own abbrev list? e.g.:
[ABBREVIATIONS]
%include my-deu

Yes that's the way to do it. And you should run Ucto using the '-c' option to refer to your own config:

ucto -c my-tokconfig-deu ...

Your other question needs some more thinking. What exactly would you like to come out? Keeping 1. Jhs. together as one token?

@pirolen
Copy link
Author

pirolen commented Jul 6, 2021

Thanks!
For the current use case I only need sentence segmentation, and so that the sentence does not get chopped up due to these abbreviations (the entire sentence here is: "Im Talmud erwähnter charismatischer Wundermann, der dort in die Zeit des 1. Jhs. v.Chr. eingeordnet wird und besonders als Regenmacher bekannt war.".)

In fact, it would be nice if there would be an option to access all sentences without tokenization. To achieve this, after getting the sentences from the wrapper by tokenizer.sentences(), I am simply re-joining the punctuation marks with the tokens, using string.punctuation... :-o

@kosloot
Copy link

kosloot commented Jul 6, 2021

OK, so your problem is that this utterance it is split into 3 sentences. hmm. That might be not that easy....
In fact quite hard.

In fact, it would be nice if there would be an option to access all sentences without tokenization.

Well for NON-FoLiA files there is the undocumented --split option which does this. But still would give 3 sentences on this input:

$ more piro
Im Talmud erwähnter charismatischer Wundermann, der dort in die Zeit des 1. Jhs. v.Chr. eingeordnet wird u
nd besonders als Regenmacher bekannt war.

$ ucto -Ldeu piro --split
Im Talmud erwähnter charismatischer Wundermann, der dort in die Zeit des 1. <utt> 
Jhs. <utt> 
v.Chr. eingeordnet wird und besonders als Regenmacher bekannt war. <utt> 

Without --split:

$ ucto -Ldeu piro 

Im Talmud erwähnter charismatischer Wundermann , der dort in die Zeit des 1 . <utt> Jhs . <utt> v.Chr. eingeordnet wird und besonders als Regenmacher bekannt war . <utt> 

But even if we fix sentence splitting to get juts one sentence, this won't work for FoLiA files (not implemented at all).

@kosloot
Copy link

kosloot commented Jul 6, 2021

@pirolen A quick fix might be this:

In your my-tokconfig-deu file replace:

#retain digits, including those starting with initial period (.22), and negative numbers
NUMBER=-?(?:[\.,]?\p{N}+)

by

#retain digits, including those starting with initial period (.22) or ending with a period (1.), and  also negative numbers
NUMBER=-?(?:[\.,]?\p{N}+)(?:[\.])?

AND: be sure to add Jhs. to your abbreviation list.

This seems to do the trick:

$ ucto -c my-deu piro 
die Zeit des 1. Jhs. v.Chr. eingeordnet <utt> 

I'm not sure if this will disturb tokenization otherwise, on a first glance all seems OK

One thing that might bite you: A sentence ending on such a number will no longer be detected as such.
So: "Siehe Seite 5. Alles Gute" will be 1 sentence.

@pirolen
Copy link
Author

pirolen commented Jul 6, 2021

Thanks!!
I need to stay in FoLiA.

For me the renamed/customised configfile does not work, neither with ucto -c with on CLI as in your example, nor for python-ucto in a script as usual, i.e.

configurationfile = "my-tokconfig-deu"
tokenizer = ucto.Tokenizer(configurationfile)

In both cases I get:
ucto:Unable to open configfile:
ucto:Cannot read Tokenizer settingsfile my-tokconfig-deu
ucto:Unsupported language? (Did you install the uctodata package?)

:-(

@kosloot
Copy link

kosloot commented Jul 6, 2021

Hmm, works for me on the command-line. So maybe a LaMachine oddity?

ucto -c my-tokconfig.deu 
ucto: inputfile = 
ucto: outputfile = 
ucto: textcat configured from: /home/sloot/usr/local/share/ucto/textcat.cfg
ucto: configured from file: my-tokconfig.deu
ucto> Siehe Seite 5. Alles Gute
Siehe Seite 5. Alles Gute <utt> 

@pirolen
Copy link
Author

pirolen commented Jul 6, 2021

Or I don't know something about the extension convention? In LM there is no .deu extension if I see it well.

@kosloot
Copy link

kosloot commented Jul 6, 2021

not the extension doesn't matter.
the abbreviation file should have an extension .abr though.

@pirolen
Copy link
Author

pirolen commented Sep 27, 2021

Using the custom config file with the command line ucto in LaMachine works if:

  • the custom abbreviation list is referenced from the config file (e.g. 'my-tokconfig.deu') using the full path, e.g.
    [ABBREVIATIONS]
    %include /home/ubuntu/lama/src/uctodata/config/my-deu.abr

  • and the '-c' option is used to refer to 'my-tokconfig.deu' with its full filepath

  • and the --uselanguages option is not specified.

@pirolen
Copy link
Author

pirolen commented Sep 27, 2021

Is lower/uppercasing retained, i.e. 'Vgl' and 'vgl' are different for ucto?

@kosloot
Copy link

kosloot commented Sep 27, 2021

Is lower/uppercasing retained, i.e. 'Vgl' and 'vgl' are different for ucto?

Yes these are different. You can change that in ABBREVIATION-KNOWN rule, using 'ignore case ((?i))' in the REGEXP.
That would render ALL abbreviations case insensitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants