Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipe (bar) character (“|”) in MISC values #569

Open
msklvsk opened this issue Sep 15, 2018 · 9 comments
Open

Pipe (bar) character (“|”) in MISC values #569

msklvsk opened this issue Sep 15, 2018 · 9 comments
Labels
Milestone

Comments

@msklvsk
Copy link
Member

msklvsk commented Sep 15, 2018

The guidelines say MISC field

has to be formatted as a list that can be split on the bar character (|) without special escaping.

What if a value in MISC contains the bar character? Like LTranslit=||Translit=|.

@foxik
Copy link
Member

foxik commented Sep 15, 2018

I believe the guidelines are unambiguous -- all bar characters are split points.

In UDPipe, we perform an internal escape mechanism to allow bars as values when needed (if you want details, we use c-string-like formatting, where \p is a bar character [pipe], \n newline, \s space, etc, \\ is a backslash).

@dan-zeman
Copy link
Member

Then I'm afraid some kind of escaping is necessary. In case of LTranslit and Translit, it might be sensible to simply omit the features (provided they're just copies of LEMMA and FORM, not really transliterations of some non-Latin character). But I have to admit that my own code does not handle such cases correctly.

@msklvsk
Copy link
Member Author

msklvsk commented Sep 15, 2018

I guess I’ll omit the translit for “|” for time being, leaving it as a surprise for downstream applications.
No real solution without an escaping standart which is inevitable for any such format.

msklvsk added a commit to UniversalDependencies/UD_Ukrainian-IU that referenced this issue Sep 15, 2018
@amir-zeldes
Copy link
Contributor

Another option is to allow XML-style Unicode escapes, i.e. | == |

@msklvsk
Copy link
Member Author

msklvsk commented Nov 13, 2018

The problem isn’t solved. No intention to introduce escaping?

@dan-zeman
Copy link
Member

@msklvsk : Do you mean to standardize escaping in the specification of the CoNLL-U format?

@dan-zeman dan-zeman reopened this Nov 13, 2018
@dan-zeman dan-zeman added this to the later milestone Nov 13, 2018
@msklvsk
Copy link
Member Author

msklvsk commented Nov 13, 2018

Yes, I mean to change this part:

it has to be formatted as a list that can be split on the bar character (|) without special escaping

@dan-zeman
Copy link
Member

dan-zeman commented Nov 13, 2018

I think that the intended meaning of "without special escaping" was not to ban escaping of the vertical bar inside individual items of the list. It was meant to say that the escaping must not involve the "|" character itself (e.g. with backslash, "\|") because then a simple split("|") would not work.

@msklvsk
Copy link
Member Author

msklvsk commented Nov 13, 2018

Understood. Then we should agree on an escaping convention that doesn’t involve the pipe, like what foxik or amir-zeldes suggested. I like \p for pipe and \\ for backslash. XML-style would mean lots of & and may be confusing because actual XML special chars are not escaped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants