-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Abbreviations list #9
Comments
Those lists are part of the uctodata repository, and referred to from the individual configuration files (like |
Thanks!
I have a long list of idiosyncratic and critical-edition-specific abbreviations, I guess most of them are not useful for including them in general, e.g. 'Corp.Inscr. Graec.' or 'Abh. der Sächs. Ges.d.Wiss.’ or S.A. (=Sonnenaufgang).
Sure I will suggest if generic ones emerge.
Bigger context of the question:
for 'Abbreviations with multiple periods’ in FoLiA documents, I should make sure that
- a short (h)space is inserted after the period
- the abbreviations are treated in the text as a unit, i.e. non-splittable at line/page end.
I was wondering how to best use ucto to this end.
The language is mostly German and Latin, so I set in the config for German and (as a hopeful fallback) to French.
… On 10. May 2021, at 13:35, Maarten van Gompel ***@***.***> wrote:
Those lists are part of the uctodata repository, and referred to from the individual configuration files (like tokconfig-nld). Contributions there are welcome of course! (you can just send a pull request)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
How is "z.B." supposed to be treated by ucto? Isn't it supposed to be somehow a unit and an instance of 'Abbreviations with multiple periods’?
|
That all depends on the ucto rules for German. In this case z.B. is NOT a known abbreviation. But 'z' is. So 'z.' is tagged as an abbreviation. Also there is a rule for interpret an Uppercase B. as an Initial:
Hence the split into 2 words. Adding z.B. to the abbreviation list should be enough. If you are satisfied, a pull request is welcome |
To make life easier, I separated the German abbreviations into a separate file you can edit: deu.abr |
Ok, I am a bit surprised now, as I tested it myself with the current setup:
And:
So it should work out of the box for those examples. So there must be some other problem here?? I would need more context. Despite that: more abbreviations are welcome. (you may also send me a file) |
I run the LaMachine command-line ucto (not python-ucto) with --uselanguages=deu,fra when I got the above tokenization (#9 (comment)). In the untokenized text there is a space between the parts of the abbreviation: "wie z. B. im Kreise". |
AH! Sorry for that limitation. |
Any type of space (e.g. short (h)space too) counts as a separator? |
I assume so. But determining spaces is a hell :{ |
OK, good to know. Back to the deu.abr list: "d.h." is not in this list. How come the command line ucto covers it? |
That is covered by the ABBREVIATION rule in the 'tokconfig-deu' file
This regexp says something along the line: This will catch 'z.B'. and 'd.h.' And also '(z.B'. and {'d.h.}' . But NOT 'z. B.' or 'd. h.' |
I suggested some items for inclusion in the German abbrev list: LanguageMachines/uctodata@master...pirolen:patch-1 (several are actually of Latin origin...) |
They seem OK to me. So I merged them |
I'd like to supply custom abbreviations for python-ucto in my dev LaMachine. |
Simply adding them to |
Thanks! Yet another Q: (I would be happy to add more abbrevs to https://github.com/LanguageMachines/uctodata/blob/master/config/deu.abr, but some are really domain-specific and I am not sure how much of that you'd like to have.) |
Well: tokconfog-deu will be overwritten on a LaMachine update too. so:
Yes that's the way to do it. And you should run Ucto using the '-c' option to refer to your own config:
Your other question needs some more thinking. What exactly would you like to come out? Keeping |
Thanks! In fact, it would be nice if there would be an option to access all sentences without tokenization. To achieve this, after getting the sentences from the wrapper by tokenizer.sentences(), I am simply re-joining the punctuation marks with the tokens, using string.punctuation... :-o |
OK, so your problem is that this utterance it is split into 3 sentences. hmm. That might be not that easy....
Well for NON-FoLiA files there is the undocumented --split option which does this. But still would give 3 sentences on this input:
Without --split:
But even if we fix sentence splitting to get juts one sentence, this won't work for FoLiA files (not implemented at all). |
@pirolen A quick fix might be this: In your my-tokconfig-deu file replace:
by
AND: be sure to add This seems to do the trick:
I'm not sure if this will disturb tokenization otherwise, on a first glance all seems OK One thing that might bite you: A sentence ending on such a number will no longer be detected as such. |
Thanks!! For me the renamed/customised configfile does not work, neither with ucto -c with on CLI as in your example, nor for python-ucto in a script as usual, i.e. configurationfile = "my-tokconfig-deu" In both cases I get: :-( |
Hmm, works for me on the command-line. So maybe a LaMachine oddity?
|
Or I don't know something about the extension convention? In LM there is no .deu extension if I see it well. |
not the extension doesn't matter. |
Using the custom config file with the command line ucto in LaMachine works if:
|
Is lower/uppercasing retained, i.e. 'Vgl' and 'vgl' are different for ucto? |
Yes these are different. You can change that in ABBREVIATION-KNOWN rule, using 'ignore case ( |
What is the best way to supply a list of known abbreviations to python-ucto and ucto in LaMachine?
The text was updated successfully, but these errors were encountered: