Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonize reason attributes #2580

Open
rettinghaus opened this issue Jul 26, 2024 · 2 comments · May be fixed by #2581
Open

Harmonize reason attributes #2580

rettinghaus opened this issue Jul 26, 2024 · 2 comments · May be fixed by #2581

Comments

@rettinghaus
Copy link
Contributor

@reason is available for the elements gap, secl, supplied, surplus, and unclear.
Mostly they are based on teidata.word, however the first and the last one take the long way using teidata.enumerated instead.
Is there a reason for this? I do not see any, so maybe it would be good to have all @reason attributes modeled in the same way.

@sydb
Copy link
Member

sydb commented Sep 5, 2024

Good point — it is probably better if those five cases of @reason were defined consistently.
On the other hand, the two that are defined as enumerations are, in fact, enumerations (for <gap> the list of enumerations is "cancelled", "deleted", "editorial", "illegible", "inaudible", "irrelevant", and "sampling"; for <unclear> it is "illegible", "inaudible", "faded", "background_noise", and "eccentric_ductus"; in both cases the list is of “suggested values”).
So those two should remain defined as teidata.enumerated. Seems to me the other three probably should be teidata.enumerated, too, both to match and because enumerations for these (to me) makes sense, and can provide for tighter validation and thus better encoding.

It is worth reviewing (my interpretation of) the semantics of these two datatypes. The teidata.word datatype was originally intended as nothing more than a way to provide a “string without funny characters that are likely to be problematic when parsing” sort of datatype, roughly analogous to (but with not quite the same restrictions as) the Nmtoken of XML 4th edition. Because the words “string”, “text”, and “token” were already taken, it was named (poorly, in retrospect) “word”. However, within a few years the meaning morphed into more of “single token that has its own semantics” sorta thing. In any case, its meaning is quite distinct from teidata.enumeration, which represents the exact same syntax, but which means “there is (or should be) a controlled vocabulary for this”.
The controlled vocabularies are provided with the <valList> element, and come in three flavors:

  • closed (“legal values are”) — this is the list of possible values, thou shalt not use any others
  • semi (“suggested values include”) — this is a list of applicable values. If your case matches one of these cases, you should use the suggested value. If your case does not match any of these cases, you should make up your own value in the same vein.
  • open (“sample values include”) — this is a list of sample values. You might want to use these, you might not.

In some cases the Guidelines do not actually provide a controlled vocabulary at all. I think the semantics of these cases is “you, the customizer writing the ODD customization for a TEI project, should provide a controlled vocabulary for this (but we’re not going to provide any helpful suggestions)”.

Of the 169 cases of teidata.enumerated in the Guidelines,

  • 135 have a controlled vocabulary (48 closed, 45 open, and 41 semi), and
  • 34 do not.

Given that the @reason of <gap> and <unclear> are already enumerations, and that the descriptions of @reason of <secl>, <supplied>, and <surplus> each include at least one sample value (and clearly are not intended to be plain text), I am (pretty strongly) of the opinion they should all be teidata.enumerated. And, in general, it seems to me to make sense for the Guidelines to provide vocabulary lists wherever possible. (I am not personally qualified to come up with a list for <secl>, but could probably handle the other two.)

But the other question to ponder is whether or not @reason should be in an attribute class of which these 5 elements would be members, each providing its own <valList>.

@rettinghaus
Copy link
Contributor Author

@sydb Thanks for sharing your thoughts. Based on your explanation I agree that teidata.enumerated is the better datatype here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants