-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
over-specification in Conformance: Unicode normalization #483
Comments
I tried to find the history behind this and it seems like it was mostly to make sure that Personally, I don't really know much about unicode normalization but it may make sense to allow it for the cue text? Given that WebVTT is very strict on not normalizing right now, it's a lot easier to allow rather than if it was the reverse. I'll have to read up on charmod-norm. Finally, WebVTT does specify that the file be UTF-8 encoded. |
I think the hint is "such as the matching of cue identifiers" in Addison's comment, and that we wanted that matching could be done by byte-equality rather than string-equivalence. For the text of cues, I agree, I can't think why we would care. |
This was brought in quite early on before my time as editor. From my reading, it doesn't just relate to identifiers, but to all if webvtt parsing. I believe it may be that a lot of the parsing rules rely on byte equality as Dave is saying. Note that webvtt is not XML but a text format with some markup and this is quite strict. What problems do you see arising from being this strict? |
The Timed Text Working Group just discussed
The full IRC log of that discussion<nigel> Topic: WebVTT: over-specification in Conformance: Unicode normalization #483<nigel> github: https://github.com//issues/483 <nigel> Gary: This issue came from the i18n working group, about Unicode normalisation. <nigel> .. WebVTT specifically disallows this, and says to compare the bytes directly. <nigel> .. The issue raised is that it is not what we want, potentially. <nigel> .. I don't have much knowledge personally of why you would want or not want to do it. <nigel> .. From digging around in the history, it sounds like it was mostly to make sure that <nigel> .. things that are required in WebVTT are easy to identify like the arrow in the time <nigel> .. signature so that we aren't matching normalised Unicode and can find it more easily. <nigel> .. I want to ask if anyone had more knowledge about it, or if TTML or IMSC handle <nigel> .. Unicode normalisation. <nigel> Nigel: I think in TTML it is delegated to XML so whatever XML says, which we assume is <nigel> .. the correct thing, is what happens. <nigel> Gary: Yes. It's relevant that WebVTT is not XML but a text format with markup. <nigel> .. David Singer said that for the text of the cues we could do normalisation, but even that <nigel> .. might be a bit more complicated because HTML tags are allowed to be used. <nigel> Nigel: Also what about metadata payload in the cues? <nigel> .. For example if it is JSON, does that specify Unicode normalisation? I do not know. <atsushi> https://infra.spec.whatwg.org/#json <nigel> Atsushi: In JSON I believe that it depends on the processor for values <gkatsev> -> https://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf json specification <nigel> Gary: The spec is quite small and just says it is a sequence of Unicode code points. <cyril> rrsagent, pointer <RRSAgent> See https://www.w3.org/2020/04/30-tt-irc#T15-20-07 <nigel> Atsushi: I think currently in WebVTT, case sensitive non-normalised matching is defined. <nigel> Nigel: The issue is that it is _not_ using that. <nigel> Atsushi: I think the linked document was written after work on WebVTT began. <nigel> .. The standard operation was written after WebVTT so maybe even if the result is the same <nigel> .. but some text is over-specified in the current standard. <nigel> Cyril: A different angle: do we have tests for this in WPT? <nigel> Gary: For Unicode normalisation? <nigel> Cyril: Yes, to match the MUST NOT in the spec. <nigel> Gary: I'm not sure <nigel> Nigel: Are you thinking about if we can establish what implementations do now via the tests? <nigel> Cyril: Yes <nigel> Gary: From a quick look I'm not seeing anything specific to Unicode. <nigel> Nigel: I'm a bit confused about where the line is drawn between parsing the WebVTT <nigel> .. document e.g. during processing, cue matching etc. and text presentation. <nigel> .. If some payload text is passed onto a text renderer and there's a step that does <nigel> .. normalise the text, is that broken, according to the spec text in §2.2? <nigel> Gary: The example is about cue matching, which is very specific. <nigel> Nigel: Is "processing" a defined term? <nigel> Gary: It could refer to the "processing model" part of the spec. <nigel> .. That would make sense because that's when you would be applying styling and whatnot. <nigel> Atsushi: I am not sure that there is any case that is not covered by "case sensitive non-normalising" <nigel> .. if there is no such case then I suppose it may be possible to write it into the standard <nigel> .. in a simpler way. <nigel> Gary: You mean to link to the charmod-norm spec to the section that matches what <nigel> .. we want to do in WebVTT? <nigel> Atsushi: Actually the character model normalisation is not a Rec track doc but a WG note <nigel> .. so it cannot be normative. You would need to copy and paste the spec text. <nigel> .. Recently there are several standards that say this kind of thing so having this kind of <nigel> .. spec may be easier for readers and may not have some strange cases. <nigel> .. The last point of the issue comment is for character encoding, but I'm not sure if we need <nigel> .. to have this strong restriction for later processing by scripts or web browser. <nigel> scribe: [not sure I got that very well] <nigel> Gary: You mean from cue text? <nigel> Atsushi: Yes <nigel> Nigel: Does the requirement that WebVTT is always UTF-8 make some of the concern <nigel> .. disappear here? <nigel> Atsushi: I need to think about that more. <nigel> .. At this moment I don't see any difference between the suggestion and the current <nigel> .. spec text and description. <nigel> Nigel: Not sure how we move to a resolution on this. Gary? <nigel> Gary: I think I need to read up on the charmod-norm first and it would be good to get <nigel> .. clarification on how WebVTT being specified as UTF-8 affects/does not affect things. <nigel> .. It does sound like it might be okay to change how we handle the cue text normalisation <nigel> .. but we likely don't want to do that for other parts of WebVTT. <nigel> SUMMARY: Investigation of impact to continue. |
I spent a bit of time looking into this. There's a blogpost from Anne van Kesteren about unicode normalization https://annevankesteren.nl/2009/02/unicode-normalization which made a lot of sense to me. The conclusion being not to normalize. Given that HTML and CSS also do so, we should do the same. I think what we have now fits that criteria. Also, given that we have embedded CSS and also have HTML syntax in the cues, we should apply this to all the webvtt text and not just the cue settings line. I've also re-read the original post, it sounds like the question here is around the specific language used rather than what is specifically said?
Seems to match what we do, though, definitely said in different words. Is the ask to update it based on language from charmod? |
I think that the key points that the i18n WG wanted to make here are that:
The original trigger for the comment was:
This just seemed too general a statement. The note that follows goes on to mention identifiers, which is good, but the normative text is not precise enough. If it said something along the lines of:
that might address the issue. Does that help? |
Sounds like the key issue is that "during processing" is too broad, and apparently excludes some reasonable processing that is out of scope of the spec. Downstream text operations such as searching or indexing the natural language text might well need to do some normalisation, depending on exactly what they are intending to achieve. Would it help to be really explicit that any hand-over of content originating in the WebVTT document from the WebVTT processor to some other downstream processor will not have had any normalisation applied, so that if they want/need to do it then they know the state of incoming data? |
That makes sense. My read of |
Section:
Conformance: Unicode normalization
https://www.w3.org/TR/webvtt/#unicode-normalization
The I18N WG noticed the above conformance requirement recently and discussed it in recent teleconferences.
Unicode normalization is only one consideration that affects processing of WebVTT and its operations (such as the matching of cue identifiers). While this requirement is consistent with our recommendations and intentions, we'd suggest that you consider a more expansive approach as documented in our document Charmod-norm, particularly section 3.1. Unless there is a special reason that our WG is unaware of, WebVTT is not especially sensitive to variations, so a case-sensitive non-normalizing matching for cue identifiers makes sense to us.
The other concern we have is that this requirement forbids any and all normalization when processing a webvtt document, not just when performing operations such a cue id matching. Is there a reason to extend a processing requirement to the entire document? Or to forbid normalization when converting character encoding (although, if memory serves, webvtt doesn't support encodings other than UTF-8, so this may not apply)
The text was updated successfully, but these errors were encountered: