-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Descriptive content #17
Conversation
This is looking good to me -- my only concern is with properly crediting @msporny's contributions. We should figure out an appropriate way to do that -- whether that be with "former editors" or something along those lines. |
Yes, that's totally fine. Thanks! |
I’m fine with keeping @msporny as a former editor, although that may be intended for former editors of Recs, more formally. I also suggested elsewhere that emerging practice may also list chairs as having a role someplace in the document header. |
@msporny, I believe I got your w3cid right, but let me know.
@msporny, I used "41758" as your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor, except the i18n and L10n concerns
spec/index.html
Outdated
Compute the first degree hash for each blank node in the dataset using <a href="#hash-1d-quads" class="sectionRef"></a>.</li> | ||
<li><strong>Canonically label unique nodes</strong>. | ||
Assign canonical identifiers via <a href="#issue-identifier" class="sectionRef"></a>, | ||
in lexicographical order, to each blank node whose first degree hash is unique.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is lexicographical order
defined? If I understand this doc correctly as it stands, I think including this definition is important, especially because we're working with global data and users, so need to consider the complications inherent in i18n and L10n (i.e., internationalization and localization, or internationalisation and localisation, depending on your locale ... noting also that W3C uses US English).
((tangent ... it's too bad that Append ",spell" to a W3C URI to invoke W3C's spell checker
can't be brought to bear on GitHub-hosted documents. I wonder whether W3C's tech magicians can provide the dictionary on which that checker must be based for use on GitHub and/or local docs and/or etc?))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Unicode Collation Algorithm 15.0 defines a sorting algorithm, which can be pretty complex for encodings other than UTF-8.
Although I've always understood the meaning of "Lexicographical order" as clear (there is a Wikipedia entry on it, we can certainly define the term and reference Unicode ordering.
Note that we used the term "lexicographical order" and similar in JSON-LD API with no apparent misunderstanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[@gkellogg]
The Unicode Collation Algorithm 15.0 defines a sorting algorithm, which can be pretty complex for encodings other than UTF-8.
Given that we're already going to require a number of transformations, I think requiring other encodings to be converted to UTF-8 is not unreasonable. Having skimmed the Unicode sorting algorithm, I'm afraid it will be pretty complex even for UTF-8 encoded data. (I certainly don't fully grasp it yet!) Their discussion of Deterministic Sorting seems particularly important for implementers of our work.
[@gkellogg]
there is a Wikipedia entry on [Lexicographical order]
Indeed there is ... and its second paragraph starts with —
[Wikipedia]
There are several variants and generalizations of the lexicographical ordering.
— which is reinforced by the several See Also items —
[Wikipedia]
- Collation
- Kleene–Brouwer order
- Lexicographic preferences
- Lexicographic order topology on the unit square
- Lexicographic ordering in tensor abstract index notation
- Lexicographically minimal string rotation
- Long line (topology)
- Lyndon word
- Star product, a different way of combining partial orders
- Orders on the Cartesian product of totally ordered sets
[@gkellogg]
Note that we used the term "lexicographical order" and similar in JSON-LD API with no apparent misunderstanding.
This the sort of thing that tends not to get raised as an issue until two people with different interpretations of the meaning of a term trade data and discover their unexpected differences with varying severity of impact. I regret that I was not involved sufficiently with the JSON-LD API work to have caught this as an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[@gkellogg]
The Unicode Collation Algorithm 15.0 defines a sorting algorithm, which can be pretty complex for encodings other than UTF-8.Given that we're already going to require a number of transformations, I think requiring other encodings to be converted to UTF-8 is not unreasonable. Having skimmed the Unicode sorting algorithm, I'm afraid it will be pretty complex even for UTF-8 encoded data. (I certainly don't fully grasp it yet!) Their discussion of Deterministic Sorting seems particularly important for implementers of our work.
I expect String.sort()
as implemented on most platforms to do the right thing on most platforms. It's beyond our scope to exhaustively explore problem areas that belong to the Unicode specs.
See if the definitions I added adequately satisfy your concerns. I really think that we could go overboard trying to specify what lexicographical ordering of Unicode strings means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, note that sorting is done after conversion to N-Quads, which requires quads to be encoded in UTF-8, which simplifies the collation problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it looks like you can specify ducet
(Default Unicode Collation Element Table (DUCET)) as the locale to use the Unicode Collation Algorithm in JavaScript ... if it's supported:
Unfortunately, it also looks like it's perhaps not supported in browsers (well, not in Firefox anyway, if MDN docs are accurate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, for our particular use case, it's important that we don't sort based on platform / system locale, but rather, ensure that the sort order is the same regardless of platform / system locale. This points to using code points or code units to me.
IMO, our requirements are:
- Consistent sort order no matter the platform / system locale.
- Speed.
- Simplicity to implement.
- Matching (or getting as close as possible) to existing implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unicode collation is not helpful. It is locale sensitive and you must know the locale. It is not total.
https://unicode.org/reports/tr10/#Common_Misperceptions
"Swedish and German share most of the same characters, for example, but have very different sorting orders."
"Collation is not a property of strings."
"Collation order is not preserved when comparing sort keys generated from different collation sequences."
For @dlongley points 1-4, the conclusion is one of codepoint sorting (logically convert to 21 bits values, sort) or
a code unit for a specific code unit choice.
SPARQL is (strictly) codepoint (the abstract character) ordering.
Section 17.3 says use collation http://www.w3.org/2005/xpath-functions/collation/codepoint
(credit to Eric Prud'hommeaux).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also suspect we want to compare using code points. The only reason we'd choose code units would be if most languages already natively sort by UTF-16 code units, but I think this is unlikely.
For those languages that don't offer some fast, native support for code point sorting, but instead compare natively using UTF-16 code units, I imagine the input could be scanned just once to look for surrogate pairs. Only if any were detected would special comparison code need to be used.
This discussion is somewhat split between here and #18 -- I recommend we say in this PR that lexicographical order refers to Unicode code point order and then continue the discussion over there. My guess is that existing implementations already sort by Unicode code point order or are very close and just need a slight adjustment for some data because they compare using UTF-16 code units today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does DUCET canonicalize codepoints? (https://unicode.org/reports/tr10/#Contractions_DUCET)
So different ways of writing the same character (e.g. combining characters, Unicode Normalization) have the same weight but then two different xsd:strings are not totally ordered if one contains the character as one codepoint and the other used two (the second being a combining character).
Not simple any more!
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
I've incorporated most of the text I thought was straightforward from Arnold/Longley. Still need examples and images, but I think this is a good point to finish this particular PR. |
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Either the ordering is important and necessary, or it isn't. If it isn't, then all reference to it should be removed. If it is, either we need to provide sufficient detail that the ordering is predictable and consistent, or we need to explain why (and how much of) the unpredictability and/or inconsistency is acceptable. Presuming the ordering is important and necessary, I think @dlongley's suggestion of UTF-8 codepoint order is likely to play an important part. (It was in my own thinking yesterday, but not firm enough to write up, given the other questions still in the air.) I'm pretty sure we'll need more discussion on this, but it probably does need its own issue(s). For now, I'm happy enough that it's on more people's radar than my own. |
Ensuring proper ordering, whatever it may be, would be important regardless of what algorithm is chosen for actually navigating the quads to create hashes. I hadn't looked at the documents closely enough to be sensitive to the Localization issues, and wasn't aware of the JavaScript implementation issues @dlongley pointed out. Given the fact that this is something which would face many different specifications that there is not a single solution useable across domains. This may be a case where we call for early feedback from either the Internationalization WG or the TAG. I'll create an issue to track (#18). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small tweaks.
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
@msporny I added the |
Oxford comma. Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Approved" but I think this is a good time to start to remove the terminology "lexicographic order". It is confusing because it may be taken to imply it works on characters, not codepoints (combining characters etc), and hence unicode normalization and possibly locale sensitive ordering is being specified when it's not.
I think we settled on needing a fixed total ordering for any strings (URIs or literals) by using codepoint ordering which has no semantic (string value, or collation) semantics.
For speed, at least an issue box saying we will remove the use of "lexicographic order" and replace it with "total order of RDF terms".
I can do a commit changing "lexicographical order" (and related) with "total order" (or "codepoint order") later this weekend. @dlongley WDYT? |
Yes, I think it's a good idea to change the language. My vote is for "Unicode code point order" for clarity. I think "total order" would only work if that's a term itself that we define in the terminology section as Unicode code point order. |
No need to define it -- "total ordering" (wikipedia link) is the term for the requirement we need (and not some weird lattice comparison). "Unicode code point order" seems the best choice for a web standard. |
… reference to "total ordering" from Wikipedia.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with some suggestions to consider. Thanks!
Co-authored-by: Dave Longley <dlongley@digitalbazaar.com>
cc/ @dlongley
Preview | Diff