-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"expression" lemmas #1
Comments
Thanks for pointing this out. We will aim to fix the lemmas in the next
release.
As for the code switching it is more difficult to decide how to handle
it. We could open a universal issue on this if there are other treebanks
with this problem. In the meantime I think it is ok to flatten the
dependencies in the UD conversion, as the information is in any case
preserved in the original treebank, whereas the UD version is more
likely to be used for training parsers, in which case including Greek
depencencies might just confuse the parser.
Dag
…On 06/09/2017 11:01 AM, Martin Popel wrote:
In UD_Latin-PROIEL v2.0, about 0.5% of words have an artificial lemma.
In train+dev, there are
* 410 |greek.expression|
* 149 |expression|
* 138 |calendar|
* 11 |monetary|
* 9 |calendar.expression|
* 3 |monetary.expression|
For example,
|19 esse sum AUX V- Tense=Pres|VerbForm=Inf|Voice=Act 20 cop _ ref=1.1.2
20 ἀδύνατον greek.expression X F- _ 17 xcomp _ ref=1.1.2 21 Curium
Curius PROPN Ne Case=Acc|Gender=Masc|Number=Sing 22 obj:dir _ ref=1.1.2 |
|22 tribuniciis tribunicius ADJ A- Case=Abl|Degree=Pos|Number=Plur 21
amod _ ref=1.1.1 23 a calendar ADV Df _ 21 amod _ ref=1.1.1 24 d
expression ADV Df _ 23 flat _ ref=1.1.1 25 xvi xvi ADV Df _ 23 flat _
ref=1.1.1 26 Kalend Kalend ADV Df _ 23 flat _ ref=1.1.1 27 Sextilis
Sextilis ADV Df _ 23 flat _ ref=1.1.1 |
|15 HS monetary ADV Df _ 14 advmod _ ref=1.6.1 16 CCCIↃↃↃX̅X̅X̅ expression
ADV Df _ 15 flat _ ref=1.6.1 |
The guidelines
<http://universaldependencies.org/u/overview/morphology.html#lemmas> say
that "The LEMMA field should not be used to encode features or other
similar properties of the word (use FEATS and MISC instead; see format)."
Moreover, the word form should be uniquely defined by the lemma and
FEATS (except for capitalization and other orthographic synonyms).
Thus I suggest
* Keep the lemma equal to the form in these cases.
* For foreign phrases, use the standard feature Foreign=Yes
<http://universaldependencies.org/u/feat/Foreign.html> and if they
span multiple words, use the flat
<http://universaldependencies.org/u/dep/flat.html#foreign-phrases>
deprel.
* For calendar and monetary expressions, design a language-specific
guidelines which are consistent with the universal guidelines
<http://universaldependencies.org/u/dep/flat.html#dates-and-complex-numerals>.
(I think no change is needed here except for fixing the lemmas).
I admit, I feel a bit uneasy with the suggestion to use flat structure
for all foreign phrases because in case of UD_Latin-PROIEL, it would
mean a loss of information. Currently, some Greek words are annotated
with the "correct" dependencies, e.g.:
|9 ignoscendum ignosco VERB V-
Case=Acc|Gender=Neut|Number=Sing|VerbForm=Gdv 3 ccomp _ ref=1.1.4 10
esse sum AUX V- Tense=Pres|VerbForm=Inf|Voice=Act 9 cop _ ref=1.1.4 11
ἐπεὶ greek.expression X F- _ 9 advmod _ ref=1.1.4 12 οὐχ
greek.expression X F- _ 14 flat:foreign _ ref=1.1.4 13 ἱερήϊον
greek.expression X F- _ 14 obj:dir _ ref=1.1.4 14 οὐδὲ greek.expression
X F- _ 11 advmod _ ref=1.1.4 15 βοεΐην greek.expression X F- _ 14
obj:dir _ ref=1.1.4 |
Feel free to open a ''universal" issue
<https://github.com/universaldependencies/docs/issues> to discuss the
cases when the foreign phrase is expected to be understood by the
readers, so it is rather a code switching
<https://en.wikipedia.org/wiki/Code-switching>.
I think in such cases, we can keep the correct dependencies (and
deprels) and just use |Foreign=Yes|.
However, the current UD_Latin-PROIEL is not consistent in this, as shown
in the example above - it uses |flat:foreign|, but only for some words
in the Greek phrases and goes against the guidelines which prescribe
that "/all subsequent/ words in the expression are attached to the
/first/ one".
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1>, or
mute the thread
<https://github.com/notifications/unsubscribe-auth/AMS_l_b-XEAvrxQt8Mzbx8c-y0Q2wax6ks5sCQnVgaJpZM4N1FfQ>.
|
Regarding conventions for date and value expressions, see UniversalDependencies/docs#455 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In UD_Latin-PROIEL v2.0, about 0.5% of words have an artificial lemma.
In train+dev, there are
greek.expression
expression
calendar
monetary
calendar.expression
monetary.expression
For example,
The guidelines say that "The LEMMA field should not be used to encode features or other similar properties of the word (use FEATS and MISC instead; see format)."
Moreover, the word form should be uniquely defined by the lemma and FEATS (except for capitalization and other orthographic synonyms).
Thus I suggest
I admit, I feel a bit uneasy with the suggestion to use flat structure for all foreign phrases because in case of UD_Latin-PROIEL, it would mean a loss of information. Currently, some Greek words are annotated with the "correct" dependencies, e.g.:
Feel free to open a ''universal" issue to discuss the cases when the foreign phrase is expected to be understood by the readers, so it is rather a code switching.
I think in such cases, we can keep the correct dependencies (and deprels) and just use
Foreign=Yes
.However, the current UD_Latin-PROIEL is not consistent in this, as shown in the example above - it uses
flat:foreign
, but only for some words in the Greek phrases and goes against the guidelines which prescribe that "all subsequent words in the expression are attached to the first one".The text was updated successfully, but these errors were encountered: