Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gender=Unsp #780

Closed
arademaker opened this issue Apr 29, 2021 · 7 comments
Closed

Gender=Unsp #780

arademaker opened this issue Apr 29, 2021 · 7 comments

Comments

@arademaker
Copy link
Contributor

arademaker commented Apr 29, 2021

In our article 'Universal Dependencies for Portuguese" we argue that an extra value for Gender is necessary (https://www.aclweb.org/anthology/W17-6523/):

There are adjectives such as grande (‘big’) or feliz (‘happy’) that have only one form for both genders. So we cannot tell whether they are masculine or feminine unless we see the context they appear in. In many cases, even looking at the full sentence, one cannot tell if the word is masculine or feminine.

Revisiting this topic, I wonder what other treebanks are doing. I see two possible solutions if a word can have multiple gender:

  1. if we do have enough information in the context, assign the right gender using the Unsp only for cases where the context does not give enough information.
  2. we unrestrictedly use Com or Neut for such words, regardless of the context (https://universaldependencies.org/u/feat/Gender.html)

Comments?

@dan-zeman
Copy link
Member

I would strongly advice against feature values that say "None", "Unsp(ecified)", and the like, even if technically it is possible to define them at the language-specific level. The correct UD way is to omit the feature completely from the word's annotation. It is also stated in the guidelines: Not mentioning a feature in the data implies the empty value, which means that the feature is either irrelevant for this part of speech, or its value cannot be determined for this word form due to language-specific reasons.

Furthermore, this particular case (or, more precisely, its Spanish counterpart) is discussed here: "For example, in Spanish, nouns distinguish two genders, masculine and feminine, and every noun can be classified as either Masc or Fem. Adjectives are supposed to agree with nouns in gender (and number), which they typically achieve by alternating -o / -a. But then there are adjectives such as grande or feliz that have only one form for both genders. So we cannot tell whether they are masculine or feminine unless we see the context. Yet they are either masculine or feminine (feminine in una ciudad grande, masculine in un puerto grande). Therefore in Spanish we should not tag grande with Gender=Com. Instead, we should either drop the gender feature entirely (suggesting that this word does not inflect for gender) or tag individual instances of grande as either masculine or feminine, depending on context."

@arademaker
Copy link
Contributor Author

Thank you @dan-zeman, I didn't pay attention in the end of the documentation.

@Stormur
Copy link
Contributor

Stormur commented Apr 30, 2021

By the way, from discussions here I came to understand that more than one value for a feature signals ambiguity or indecision: so I think that the best fit for such "desperate" cases is Gender=Fem,Masc, which means that it can be either, but we cannot decide.

Portuguese (and other languages's) adjectives always bear a Gender category, so it should be present; and Gender=Com has a deceptive name, but it really is something else and language-specific.

@dan-zeman
Copy link
Member

@Stormur You are right that multiple values of a feature are possible, but the guidelines also say that if the multi-value would list all values that are relevant in the given language, then the feature should be dropped instead.

@Stormur
Copy link
Contributor

Stormur commented Apr 30, 2021

@Stormur You are right that multiple values of a feature are possible, but the guidelines also say that if the multi-value would list all values that are relevant in the given language, then the feature should be dropped instead.

Right, it is true that I was mainly thinking of Latin, where contrary to most modern Romance langiages we have also the neutral gender, and the ambiguity, apart possibly from really weird cases which I am not aware of, is only for Fem/Masc.

I forgot this exact point you mention, but is it not better to still annotate for Gender in such cases? I mean, this is still useful to distinguish those cases where for some reasons there is an ambiguity, from those where the word really does not inflect nor expresses the category.

@dan-zeman
Copy link
Member

It would be useful to know the disambiguated value, if there is manpower to do it reliably. If it cannot be disambiguated, then I'm not sure about possible benefits of knowing the fine-grained reason of why it cannot be disambiguated.

@Stormur
Copy link
Contributor

Stormur commented May 4, 2021

I think that the rationale is, rather than knowing the reasons for such an ambiguity, to keep annotational coherence for those word classes which normally do express a Gender. I think that the main reason this issue apparently keeps resurfacing is this: one would like to still say that a word has a Gender, even if it is not possible to determine it. This should be a very rare occurrence, anyway.

On the one hand, I think that a feature like InflClass helps in keeping this kind of coherence, because we would still be able to make a difference between truly indeclinable (InflClass=Ind) elements and others which just happen to not have been annotated for a feature linked to inflection, like Gender.

On the other hand, if I were to do a search for elements with e.g. Gender=Fem, I think I would want to include those ambiguous cases, too.

In general, in my opinion the problem is that the undecidedness of a feature is different than the absence of it, so in such non-systematic cases it still makes sense to have a "neutral" annotation rather than not having it at all; systematic is the key word here. And in this context, listing all possile cases would surely be better than using a specific "negative feature".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants