Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI construction for DMLex fragments #111

Merged
merged 9 commits into from
May 20, 2024
Merged

Conversation

vojtech-kovar
Copy link
Contributor

No description provided.

@jmccrae
Copy link
Contributor

jmccrae commented Apr 4, 2024

Is this related to #97?

@jmccrae jmccrae self-requested a review April 4, 2024 08:16
@DavidFatDavidF DavidFatDavidF added this to the 2nd public review milestone Apr 4, 2024
@vojtech-kovar
Copy link
Contributor Author

Is this related to #97?

Yes -- sorry for not mentioning that before, and thanks you volunteered for reviewing :) We had a discussion about that at the meeting today after you left, and there will be some changes -- so maybe wait with the review after I implement the changes (tomorow or Monday, I hope).

@vojtech-kovar
Copy link
Contributor Author

notes from today's meeting:

  • we want to do IRIs, not URIs
  • change the structure so that "Optional roots" is 3.1, Fragment identification is 3.2, Fragment URIs is 3.2.1, lexicographicResource is 3.3
  • reformulate/hedge the authoritativeness of the instructions, say something like it is "recommended for dictionaries living on-line" and "recommended method for inter-operability"
  • point to here from the linking module (again, not in any authoritative way, rather as a recommendation and for reader's convenience)

feel free to add if I forgot anything

@jmccrae
Copy link
Contributor

jmccrae commented Apr 5, 2024

I have some doubts about this scheme:

  • Some elements can be assigned ambiguous empty IDs: collocateMarker and etymology both have one optional unique property, that may be missing, so in this case their identity translates to an empty string.
  • Some IDs will be very long: definition has only text as its unique property, this may lead to a very long identifier as definitions can be quite substantial in a dictionary
  • listingOrder is not used as a property, but would be the obvious choice for many elements. e.g. currently we would have something like http://www.example.com/lexicon/entry/cat~1~noun/sense/small+furry+animal and this could be simplified to http://www.example.com/lexicon/entry/cat~1~noun/sense/1
  • The order of elements with multiple properties is not clear, e.g., should it be cat~1~noun or cat~noun~1
  • It should be noted that the second case (single unique property, arity=1, value is an object) actually does not occur in the spec
  • We should give the result of applying this schema along with each element definition

@vojtech-kovar
Copy link
Contributor Author

vojtech-kovar commented Apr 5, 2024

Thanks for the notes, let me add my thoughts:

  • Some elements can be assigned ambiguous empty IDs: collocateMarker and etymology both have one optional unique property, that may be missing, so in this case their identity translates to an empty string.

Not sure if I understand correctly: Do you mean e.g. two different etymology objects under one entry, both with missing description? According to my understanding of UNIQUEness, this should not be allowed -- because once two objects at the same level miss a UNIQUE identifier, it is no more UNIQUE, the objects cannot be distinguished by this property. (NB there is the same situation with sense, both UNIQUE properties are also OPTIONAL.) If a property is marked both UNIQUE and OPTIONAL, I understood it's because we want to allow a single etymology (or sense) without description (or definition) under each entry, not multiple. Am I reading it wrong?

It could anyway be stated more explicitly in the description of UNIQUEness.

  • Some IDs will be very long: definition has only text as its unique property, this may lead to a very long identifier as definitions can be quite substantial in a dictionary

Yes, that's right -- I've asked about it and we have discussed this at the meeting after you left, and even considered an option of some hashing, but we agreed we prefer readibility and transparency to compression.

I am against using listingOrder -- you are right it would be easy to use (and short), but if you use it as a link and then the listing order changes without changing the link (which can happen anytime if the resource is not frozen), the link will still work (i.e. nobody will notice anything, everything will be valid etc.) but it will point to a wrong object. I think we want to avoid that.

  • The order of elements with multiple properties is not clear, e.g., should it be cat~1~noun or cat~noun~1

That's right, thanks for spotting -- I will state that explicitly.

  • It should be noted that the second case (single unique property, arity=1, value is an object) actually does not occur in the spec

OK

  • We should give the result of applying this schema along with each element definition

I can do that, too, I just didn't want this feature be over-presented (maybe it's not that important :) ) -- what do others think?

@vojtech-kovar
Copy link
Contributor Author

I have now implemented the changes we agreed on, please review if you can :)

@jmccrae
Copy link
Contributor

jmccrae commented Apr 11, 2024

I understood it's because we want to allow a single etymology (or sense) without description (or definition) under each entry, not multiple. Am I reading it wrong?

In fact, it is possible to have multiple etymologies without description under the same entry, this is the problem.

<para><literal>lexicographicResource.uri/entry/entryID/sense/senseID/example/exampleID</literal></para>

<section id="objectids">
<title>Object IDs</title>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object ID is potentially ambiguous for etymology

@jmccrae
Copy link
Contributor

jmccrae commented Apr 11, 2024

Another issue is that the fields are not identified so in some cases the identifier may be ambigous

<entry>
  <headword>foo</headword>
   <sense>
    <indicator>x</indicator>
  </sense>
  <sense>
    <definition>x</definition>
  </sense>
</entry>

Both resolve to http://www.example.com/lexicographicResource/entry/foo/sense/x


<para>Every fragment <glossterm>should</glossterm> be assigned a unique IRI (Internationalized Resource Identifier [<link linkend="bib_rfc3987">RFC 3987</link>]), composed of <literal>lexicographicResource.uri</literal> and a sequence of identifiers that uniquely determines the path in the DMLex tree structure. The IRI of the root object <literal>lexicographicResource</literal> is the value of its attribute <literal>lexicographicResource.uri</literal>, converted to IRI according to the algorithm specified in [<link linkend="bib_rfc3987">RFC 3987</link>], if needed. The IRIs of its direct children are constructed as follows:</para>

<para><literal>lexicographicResource.uri/objectTypeName/objectID</literal></para>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be lexicographicResource.uri#objectTypeName/objectID?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR claims to create fragments but does not. Fragments are the portion of an (HTTP) URI that occur after the # symbol. This is an important distinction as, for example this URL http://www.example.com/lexicon/lexicographicResource/entry/cat refers to a document that describes only the entry cat. In contrast http://www.example.com/lexicon#lexicographic/entry/cat refers to the identified section of the document http://www.example.com/lexicon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just a terminological misunderstanding. "Fragment" in fragment identification does not refer to URI fragment. It's merely a fragment in the sense of part of the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, the term "fragment" is pretty widely understood and I wouldn't redefine it. Secondly, I think you do want fragments in this sense as otherwise it is very challenging to create URIs that resolve, and this would be a big technical issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • well, I was placing my content within the already existing section Fragment identification, so I guess the meaning of the word "fragment" was already (re)defined and it has the sense of (partial) DMLex objects. I could rename 3.2.1 Fragment IRIs to Object IRIs to avoid confusion but I don't want to touch 3.2 now as it was introduced before me (unless there is a consensus that it should be renamed)
  • I don't think we want URI fragments, don't understand how is it challenging, can you explain? My view on URI fragments is that they are anchors within the response of the previous part of the URI -- e.g. you download a web page based on the URI without the fragment and scroll down to the anchor defined by the fragment. Also the RFC says

The fragment's format and resolution is therefore dependent on the media type of a potentially retrieved representation, even though such a retrieval is only performed if the URI is dereferenced.

I don't think this matches our semantics, we want to identify the objects directly, not as anchors within the whole lexicographic resource. (But I don't think it's something extremely important, will not fight against fragments if more of you think it's better.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we ever discussed that we would want the IRIs to be usable as URLs so this is a far reaching implicit assumption that is false at this moment.

URIs starting with http are HTTP URLs. The examples you have given are HTTP URLs so the assumption is pretty clear.

But even if we agree that we want this, I strongly object against adhering any such notion of document in the sense of what HTTP or similar protocols might have defined ages ago and noone actually follows.

HTTP is a foundational standard of the web, I don't understand why you think no-one follows it

The IRIs are strictly to be understood as links within DMLex-internal addressing mechanisms, and under no circumstances as addresses giving any expectations as to what they should return, particularly not within the framework of one arbitrary protocol such as HTTP.

I am not sure I agree on the need an internal addressing mechanism, but if we do introduce a mechanism like this (once we have ironed out the bugs), it should not be a mechanism that looks like HTTP but does not function like HTTP. Creating our own rules that contradict one of the most widely deployed standards is only likely to lead to confusion and challenges in implementation.

Btw, even for HTTP, I think it is completely fine for implementers to return whole documents, or any portions of them -- DMLex is not designed to be a round-trip mechanism, this is simply out of the scope.

Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using # fragments. Otherwise we put a lot of technical questions to implementers, such as whose job it is to validate these identifiers and how can this be implemented on widely-used servers (Apache, nginx, etc.).

I don't understand your resistance here. This comment is about changing one character in a URI to make it conformant with widely-used standards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(My comments about IRIs/URLs are mainly about resolvability: we do not enforce that, so in that sense that might not be a valid/usable URL (and it will frequently be the case it is not usable) --- but that's a minor point.)

The core of my objection is though the fact that the whole dictionary is not to be seen as one HTTP document -- this is completely up to the implementation what a document is in the context of HTTP.

So: Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using # fragments is very much an unwanted behaviour. Depending on the context, I want to be able make different HTTP fragments over the same DMLex fragment. Such as that you have links between entries or senses, but you want to navigate the user to a particular example or some other part within an HTML page -- making anything after the lexicographicResource being a URL fragment makes it impossible to anchor anything within a particular entry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core of my objection is though the fact that the whole dictionary is not to be seen as one HTTP document -- this is completely up to the implementation what a document is in the context of HTTP.

Our standard says that the lexicon (or entry) is a single document in XML and JSON serializations.

So: Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using # fragments is very much an unwanted behaviour. Depending on the context, I want to be able make different HTTP fragments over the same DMLex fragment. Such as that you have links between entries or senses, but you want to navigate the user to a particular example or some other part within an HTML page -- making anything after the lexicographicResource being a URL fragment makes it impossible to anchor anything within a particular entry.

I am not really sure I understand... "navigating to a particular [element] within an HTML page" is the use case of fragments. A particular application could easily further extend this fragment scheme if they wish so there is no challenge with adding extra fragments to the "DMLex fragments", we are simply defining one mechanism within a DMLex document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the argument is about following or not following HTTP conventions -- we all want to follow that. I think the core disagreement is about this:

Returning the whole lexicon document for all URIs is the behaviour we get for free

That's right -- but I think this behaviour does not play well with DMLex principles -- I always felt like its nature are interlinked objects, with lexicographicResource being just one of them and nothing special. Using HTTP fragments would make it very special, and would (kind of) enforce downloading the whole lexicographic resource whenever asking for a single entry, sense, or even example. I don't like that. (But I am still new to these discussions, and yes, it's just one character, so do say and I will do as you say ;-))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core of my objection is though the fact that the whole dictionary is not to be seen as one HTTP document -- this is completely up to the implementation what a document is in the context of HTTP.

Our standard says that the lexicon (or entry) is a single document in XML and JSON serializations.

Yes, but that's a completely different thing and it is relevant only for those two particular serializations. The addressing mechanism is not serialization-specific, so this is not relevant.

So: Returning the whole lexicon document for all URIs is the behaviour we get for free if we implement this using # fragments is very much an unwanted behaviour. Depending on the context, I want to be able make different HTTP fragments over the same DMLex fragment. Such as that you have links between entries or senses, but you want to navigate the user to a particular example or some other part within an HTML page -- making anything after the lexicographicResource being a URL fragment makes it impossible to anchor anything within a particular entry.

I am not really sure I understand... "navigating to a particular [element] within an HTML page" is the use case of fragments. A particular application could easily further extend this fragment scheme if they wish so there is no challenge with adding extra fragments to the "DMLex fragments", we are simply defining one mechanism within a DMLex document.

That's not true -- the RFC for URI (https://datatracker.ietf.org/doc/html/rfc3986) is very clear that the fragment may not contain a hash sign.

The bottom line is:

  • the addressing mechanism is not serialization specific, there is no concept of a document
  • if (rarely) the URLs would be resolvable, we want to make it possible to anchor to arbitrary response parts, therefore we do not want to use the hash sign anywhere.

@jmccrae
Copy link
Contributor

jmccrae commented Apr 11, 2024

Comment on empty specifiers should be added before acceptance of this PR

@jmccrae
Copy link
Contributor

jmccrae commented Apr 11, 2024

A couple more potentially ambiguous results.

<entry homographNumber="0">
  <headword>test</headword>
</entry>
<entry>
  <headword>test</headword>
  <definition>0</headword>
</entry>
<pronunciation soundFile="x"/>
<pronunciation>
  <transcription>x</transcription>
</pronunciation>

I checked the others :)

@jmccrae
Copy link
Contributor

jmccrae commented Apr 12, 2024

One further comment, not even sure if this a bug, but it is not possible to construct a fragment identifier for member as there are not unique properties for relation

@vojtech-kovar
Copy link
Contributor Author

One further comment, not even sure if this a bug, but it is not possible to construct a fragment identifier for member as there are not unique properties for relation

Yes, that's correct -- the procedure cannot live without the UNIQUE identifiers. I tried to say it by the following sentence:

DMLex does not define the structure of IRIs for object types without UNIQUE properties.

should I add anything to it?

  - no empty object IDs (replace empty values with "0" and escape "0")
  - avoid conflicts "indicator:x, empty definition"
                vs. "empty indicator, definition:x"
@vojtech-kovar vojtech-kovar requested a review from jmccrae April 17, 2024 13:30
@jmccrae
Copy link
Contributor

jmccrae commented Apr 18, 2024

  1. The uniqueness issues seem to be fixed. Although we still need a resolution to Example A1.11 fails uniqueness validation #123 for senses.
  2. I find the choice of which elements can be addressed to be rather arbitrary and I cannot see how this fits with any use cases (e.g., why not relation?)
  3. I am against using listingOrder -- you are right it would be easy to use (and short), but if you use it as a link and then the listing order changes without changing the link (which can happen anytime if the resource is not frozen), the link will still work (i.e. nobody will notice anything, everything will be valid etc.) but it will point to a wrong object. I think we want to avoid that.

I see this problem with listingOrder, but currently we also change the URI every time a unique element (e.g., definition) changes, and this is tricky to implement in a dynamic web application use case. We could allow listingOrder to be used in an XPath-like syntax so we could refer to a sense as

EITHER
http://www.example.com/lexicon/entry/abandon~0~verb/sense/0~/to%20suddenly%20leave%20a%20place%20or%20a%20person/
OR
http://www.example.com/lexicon/entry[1]/sense[1]
  1. Are we concerned about the URL maximum length (2048 bytes)? It seems very easy to reach with this very verbose URL scheme

@jmccrae
Copy link
Contributor

jmccrae commented Apr 22, 2024

I have thought about this over the weekend and I see four key issues with the proposal as it stands

  1. It doesn't satisfy some use cases: There are probably three applicable use cases, here. Firstly, to support editing environments using a dynamic server and each server should implement the URL scheme as defined here. However, most existing such interfaces I can find (e.g., https://en-word.net/, https://en.wiktionary.org) seem to use a mix of fragments and paths to identify content. I cannot find a single editor that provides a unique page for elements such as senses and definitions, which is implied by the @vojtech-kovar and @mjakubicek proposal. However, my fix does not solve this either as is clear as forcing the lexicon to be edited on a single page is not viable. The proposal of @vojtech-kovar and @mjakubicek seems incompatible with the use case of static hosting and exchange and also with the use of conversion tools.
  2. It does not improve interoperability: The goal of this PR is to provide a "method for addressing DMLex objects present on-line" in order to improve "general interoperability". The overall goal of this standardisation is to help producers and consumers to work through a standard model. This PR requires data producers to adopt a particular IRI scheme in order, however there is no clear idea of what these IRIs should resolve to under HTTP. As such, the usefulness for consumers is not clear. In other words, we are building an addressing system without knowing what is at these addresses! This puts a burden on producers without providing instructions that are helpful to consumers
  3. The identifiers are unstable. As discussed with @vojtech-kovar, the argument against using listingOrder, changes to the data change the identifier, and so the identifiers are unstable. This is a general problem with this scheme. For example, a minor change to a definition would require updating the parent element's ID (sense), siblings' IDs (example) and all incoming links. I think this is technically very challenging (in conflict with @michmech's vision of the model) and can probably only be implemented by search/replace/hope or using another internal identifier scheme (in which case what is the point of this?).
  4. Identifiers are long and involve ugly tradeoffs: The identifiers this scheme proposes are very long and this may lead to technical issues. We also have to make some ugly tradeoffs to avoid ambiguity, for example including a homograph number in every identifier (word~0~noun) even when not needed and adding 0~ in many places. Aesthetics are not a showstopper, but they will certainly limit the adoption of this model

As solutions, I see the following approaches

  1. Ignore the problems: This scheme is marked as not required so we could just accept and move on and let the implementers handle any problems.
  2. Try to fix the issues: Issue 1 seems quite thorny as the position of the # symbol in a URI is important technically and I am not sure how we define enough modelling to allow implementers to put it anywhere. An alternative would be to not define full IRIs, but make this something like XPath for DMLEX, but not tied to the XML serialization. However, even if we solve Issue 1, Issue 2 seems fairly intractable and I don't think we avoid Issue 3 and 4 without a radical redesign
  3. Don't reinvent the wheel: We could instead just say that all identifiers are user-specified and documented in the data model. The advantages of this are:
    • It is already mostly implemented for sense, entry and collocateMarker
    • It solves the issues above
    • It conforms well with the xml:id in XML serialization
    • It conforms even better with the RDF serialization and avoids blank nodes
    • It is simple for us and implementers

@mjakubicek
Copy link
Contributor

mjakubicek commented Apr 24, 2024

This discussion gets repetitive so let me just summarize why most of the objections are either false or largely missing the point of this PR.

First of all it needs to be emphasized that the specification is very clear about the fact that it describes an addressing mechanism on the model level and then there are serialization-specific addressing mechanisms which anyone is free to use (this would be e.g. XPath/XQuery for XML).

This answers Objection number 1, because if we are talking about static hosting of data files, those files are necessarily serialized in some format, and then a serialization-specific addressing mechanism should be used.
It is therefore false that this use case is not supported, on contrary there are a number of options to choose from, and all are depending on a particular serialization format. It shall be emphasized too that coercing model-level descriptions towards particular serialization is a malpractice to be avoided.

The Objection number 2 says "PR requires data producers to adopt a particular IRI scheme" which is not true (it is optional), and generally completely ignores the primary motivation behind a model-level addressing mechanism, i.e. being able to address without the restrictions of any particular serialization method. This objection for reasons not explained instead keeps talking about a request-response processing mechanism, which again, is not the primary motivation behind the addressing, and can be easily done using any serialization-specific addressing mechanisms. Again, the primary motivation of the model-level addressing is to point to a particular DMLex object in serialization unspecific way; not defining a request-response round-trip.

The issues described in Objection number 3 were also discussed multiple time and they are not very relevant to this PR. All this is intentional and in line with best lexicographic as well as data maintenance practices to prevent unintentional data degradation. The principles of DMLex are to remove processing complexity where it is not necessary, not where we would arbitrarily wish to do so. The fact that many tools currently to dot exercise these integrity checks suggests that it is even more so important to promote it in the standard.

Objection number 4 is true but it is important to realize that the links are not meant to be human-processed, or human-presented in the full form. They would be machine processed and visualized in implementation-specific ways that will suite the user/device/situation context. So yes, the links could be sometimes long a ugly, but also in many cases rather short and easy to interpret.

To sum up, I find all the objections completely invalid and do not understand the motivation behind bringing them again and again without any reasonable justification.

@jmccrae
Copy link
Contributor

jmccrae commented Apr 24, 2024

You are asking why this is important, so I will try to reiterate this:

  1. I know that identifiers are defined at the model-level, which is an abstract level. Abstract models need to be instantiated, and my argument is that this proposal seems impossible to instantiate on any real-world, serialized data (all data is serialized somehow in the real world). You claim that serializations should use different mechanisms such as XPath, are you implying that this is a proposal with no real-world (i.e., serialized) applicability?
  2. This proposal defines IRI, which identify resources. Resources can be webpages or XML documents and usually are, but they can also be abstract concepts. For representing references to abstract concepts, the Resource Description Framework was invented. As an RDF expert, I have concerns about this proposal. In particular, addressing objects in a serialization-unspecific way usually requires methods like content negotiation to be implementable.
  3. I have implemented this proposal and it took nearly 800 lines of code. I had to change the proposal in a way you find unacceptable to make this work (in order to obtain valid relative URIs in RDF serialization). It is not simple and the implementation discovered several other bugs (etymon should be unique on etymology #116, Should collocateMarker have a uniqueness constraints? #122, Example A1.11 fails uniqueness validation #123 and several documented on this PR). There is also a very simple alternative proposal (Solution 3).

@vojtech-kovar
Copy link
Contributor Author

In the beginning of all this we wanted recommended addresses for all DMLex objects, based on the data (and namely the values of the UNIQUE properties), not arbitrary IDs, nor a particular serialization. It was all about (and only about) suggesting unique identifiers, not prescribing how they should behave if used in HTTP requests or in any other particular scenario. I get it now that @jmccrae does not like this very principle (to put it mildly), on the other hand we agreed we will do it in a meeting with all of us present, so I took it as agreed.

It was crystal clear from the very beginning that it is not possible to devise a method of addressing that will guarantee that all the possible use cases will work out of the box. I am pretty sure that we cannot even predict any substantial part of the possible use case scenarios, we can just bring some arbitrary examples.

But now we are (John is) bringing one arbitrary use case after another and argue it does not work out of the box for them. Well, it doesn't. It is not possible to satisfy everyone. (And I don't like trying to satisfy all the use cases we can think of, especially by complicating the DMLex model itself, like we did on the last meeting with the new property deciding if '/' or '#' is used. None of the use cases, nor the whole addressing itself, is so important that it would be worth making the model more complex.)

So, instead of fiddling with arbitrary use cases, I think we should answer the main question: "Do we want a model-level mechanism as described in the first paragraph, even though it does not satisfy all the use-cases perfectly?" Do we?

I think the model-level addressing brings a choice: either use this, even if it requires some extra effort with particular formats/setups, or use a serialization-specific addressing and/or their own IDs if it's more convenient. The advantage of the former option would be universality (indepedence on a particular resource, its serialization format and arbitrary IDs -- if you are e.g. a dictionary aggregator, this could make you happy) and readability (even if the address leads to nowhere, a human is able to decode/fix it, unlike an address with arbitrary IDs.) Of course, we can as well decide to drop all this (John's option 3, and also the current status) which leaves only the latter option.

@michmech @DavidFatDavidF please comment

@jmccrae
Copy link
Contributor

jmccrae commented Apr 25, 2024

I think that this is getting a bit out of hand for what is a small part of this overall great project. When summarising the issues discussed in this long thread I have been accused of "bringing them again and again without any reasonable justification" and by defining three use cases I am accused of "bringing one arbitrary use case after another". Can we chill it please?

As I have made clear, I am open to compromise (Option 2) although as is clear, my personal opinion is that user-defined identifiers (Option 3) would be superior to content-based ones.

These concerns are based on blocking technical issues that have become clear to me from implementing this system and I have outlined them clearly above.

To implement the compromise option (Option 2) I would propose the following text:

<para>Every top-level model object may be assigned one or more identifiers 
that uniquely determines the path in the DMLex tree structure. These can be used to construct IRIs, by 
appending them to the IRI of the root object. The IRI of the root element is the value of its attribute <literal>lexicographicResource.uri</literal>, converted to IRI according to the algorithm specified in
 [<link linkend="bib_rfc3987">RFC 3987</link>]. IRIs can be constructed in a schemes such as 
follows:</para>

<para><literal>lexicographicResource.uri/objectTypeName/objectID</literal></para>
<para><literal>lexicographicResource.uri#objectTypeName/objectID</literal></para>

<para>Other schemas may be adopted by applications. This standard does not mandate the adoption of any 
IRI schema or describe what kind of resources are located by IRIs constructed in this way.</para>

etc...

Then all examples are changed so that they do not include the HTTP URI (e.g., entry/cat~1~noun/sense/0~small%20furry%20animal instead of http://www.example.com/lexicon/entry/cat~1~noun/sense/0~small%20furry%20animal). We continue to define objectIDs but do not define IRIs based on them. Our identifiers no longer start with http and thus don't depend on a serialization.

This satisfies Problem 1, as it is much more vague and does not mandate a URI schema so more use cases can be satisfied. Problem 2 is mostly side-stepped as this proposal now doesn't require anything of producers or consumers of data. I also think it is closer to what @mjakubicek has in mind, as he doesn't want a "request-response" mechanism based on serialization, while an HTTP URI requires that you can make an HTTP request and receive a serialized response.

I would also reiterate the proposal to also allow object IDs by listingOrder

EITHER
entry/abandon~0~verb/sense/0~to%20suddenly%20leave%20a%20place%20or%20a%20person
OR
entry_1/sense_1

The adoption of listing order as an alternative mechanism would solve Problem 4, and Problem 3 would be reduced as implementers can choose the option that is more stable for their application.

I am happy to turn this into a PR if others are happy with this.

@mjakubicek
Copy link
Contributor

You are asking why this is important, so I will try to reiterate this:

  1. I know that identifiers are defined at the model-level, which is an abstract level. Abstract models need to be instantiated, and my argument is that this proposal seems impossible to instantiate on any real-world, serialized data (all data is serialized somehow in the real world). You claim that serializations should use different mechanisms such as XPath, are you implying that this is a proposal with no real-world (i.e., serialized) applicability?

This is utter nonsense, the fragment ID is just a string. That's it John, a string. You do whatever you like with it.

  1. This proposal defines IRI, which identify resources. Resources can be webpages or XML documents and usually are, but they can also be abstract concepts. For representing references to abstract concepts, the Resource Description Framework was invented. As an RDF expert, I have concerns about this proposal. In particular, addressing objects in a serialization-unspecific way usually requires methods like content negotiation to be implementable.

You see John, this is the problem. You're forcing in your world here, that we are not necessarily interested in. Making an IRI does not bring in RDF, nor does it bring in content negotiation. You have to live with the fact that others do not see things that way. An IRI is just a string. Nothing else.

To quote from https://www.ietf.org/rfc/rfc3987.txt:

"An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646)"

The standard also makes it absolutely clear that IRIs are not bound to a protocol with regard to this, on multiple places, e.g.

"Applications using IRIs as identity tokens with no relationship to a protocol MUST use the Simple String Comparison"

This is exactly our case, it's a string, it compares as a string, and it serves as identification of some DMLex entry part for us. We may call them "DMLex fragment identification strings" and not "IRIs", but given your attitude I doubt this would help here to move forward.

  1. I have implemented this proposal and it took nearly 800 lines of code. I had to change the proposal in a way you find unacceptable to make this work (in order to obtain valid relative URIs in RDF serialization). It is not simple and the implementation discovered several other bugs (etymon should be unique on etymology #116, Should collocateMarker have a uniqueness constraints? #122, Example A1.11 fails uniqueness validation #123 and several documented on this PR). There is also a very simple alternative proposal (Solution 3).

Yes, all those are valid integrity checks that need to performed, thank you for that. We all know we need to do more of them, to find out all the forgotten small bugs in the spec here and there. None of that presents any substantial challenge.

In any case, this discussion leads nowhere. I find all the issues raised by John as void and none of the proposals by John are acceptable for me, particularly not the variant number 3, which is absolutely disastrous as discussed many times.

For the next meeting, I propose voting on this PR as is; and if it is not approved, we simply remove fragment identification from the specs completely and move on.

@mjakubicek
Copy link
Contributor

This satisfies Problem 1, as it is much more vague and does not mandate a URI schema so more use cases can be satisfied. Problem 2 is mostly side-stepped as this proposal now doesn't require anything of producers or consumers of data. I also think it is closer to what @mjakubicek has in mind, as he doesn't want a "request-response" mechanism based on serialization, while an HTTP URI requires that you can make an HTTP request and receive a serialized response.

For last: it does NOT. "an HTTP URI requires that you can make an HTTP request". There is no "HTTP URI". Just "URI", and an URI (or IRI, in our case), unlike an URL, does not mandate you need to be able to locate the resource. The name of the protocol does not affect this.

But if all the bugs you is the http:// scheme, we may just use urn: instead. It would perhaps fit more even from the theoretical perspective, though that's going to be a very subjective issue.

@jmccrae
Copy link
Contributor

jmccrae commented Apr 25, 2024

@mjakubicek, you continue to make highly uncivil comments on a public forum.

We may call them "DMLex fragment identification strings" and not "IRIs", but given your attitude I doubt this would help here to move forward.

I think this is exactly what I just proposed, right?

There is no "HTTP URI"

HTTP URI is an established term. It is pretty clear it means URIs that use the http scheme.

But if all the bugs you is the http:// scheme, we may just use urn: instead. It would perhaps fit more even from the theoretical perspective, though that's going to be a very subjective issue.

I would support this, however I note that it requires a registration process with IANA as described in RFC 8141

@mjakubicek
Copy link
Contributor

We may call them "DMLex fragment identification strings" and not "IRIs", but given your attitude I doubt this would help here to move forward.

I think this is exactly what I just proposed, right?

So if we keep everything else as is, and replace all occurrences of "IRI" in the spec with "DMLex fragment identification string", you will vote for this?

There is no "HTTP URI"

HTTP URI is an established term. It is pretty clear it means URIs that use the http scheme.

Yes, but not requiring that you can make an HTTP request, which is what you were saying, and I was refuting. It's not about quibbling, but about facts John. Facts that you present here that are simply not true, and you continue doing so despite being falsified multiple times.

But if all the bugs you is the http:// scheme, we may just use urn: instead. It would perhaps fit more even from the theoretical perspective, though that's going to be a very subjective issue.

I would support this, however I note that it requires a registration process with IANA as described in RFC 8141

Only if we would want to make our own namespace which we do not need to, there are other options (e.g. the tag namespace, maybe others too.) which require no central registration.

@jmccrae
Copy link
Contributor

jmccrae commented Apr 25, 2024

We may call them "DMLex fragment identification strings" and not "IRIs", but given your attitude I doubt this would help here to move forward.

I think this is exactly what I just proposed, right?

So if we keep everything else as is, and replace all occurrences of "IRI" in the spec with "DMLex fragment identification string", you will vote for this?

I guess so, but I would prefer that they did not start with http as this would be confusing

Yes, but not requiring that you can make an HTTP request, which is what you were saying, and I was refuting. It's not about quibbling, but about facts John. Facts that you present here that are simply not true, and you continue doing so despite being falsified multiple times.

"The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location")" [RFC 3986]
"The HTTP URL scheme is used to designate Internet resources accessible using HTTP (HyperText Transfer Protocol)" [RFC 1738]

My facts are pretty clear.

@mjakubicek
Copy link
Contributor

We may call them "DMLex fragment identification strings" and not "IRIs", but given your attitude I doubt this would help here to move forward.

I think this is exactly what I just proposed, right?

So if we keep everything else as is, and replace all occurrences of "IRI" in the spec with "DMLex fragment identification string", you will vote for this?

I guess so, but I would prefer that they did not start with http as this would be confusing

Fine, I think noone really worries about the scheme being used here, which I see as a completely arbitrary choice.
So, to avoid confusion, if this PR is changed so that all mentions of IRIs are replaced with "DMLex fragment identification string" and there is no "http://" prefix, you are happy with the rest and we can merge it and move on?

Yes, but not requiring that you can make an HTTP request, which is what you were saying, and I was refuting. It's not about quibbling, but about facts John. Facts that you present here that are simply not true, and you continue doing so despite being falsified multiple times.

"The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location")" [RFC 3986] "The HTTP URL scheme is used to designate Internet resources accessible using HTTP (HyperText Transfer Protocol)" [RFC 1738]

My facts are pretty clear.

Facts are clear in that you now for the first time talk about a URL (i.e. a Uniform Resource Locator, not URI which is Uniform Resource Identifier), which was never discussed and never considered and never mentioned before. What you were saying before was that "an HTTP URI requires that you can make an HTTP request" -- and this is simply not true, and thus all your seemingly necessary implications you were making thereof are not true as well.

@jmccrae
Copy link
Contributor

jmccrae commented Apr 25, 2024

We may call them "DMLex fragment identification strings" and not "IRIs", but given your attitude I doubt this would help here to move forward.

I think this is exactly what I just proposed, right?

So if we keep everything else as is, and replace all occurrences of "IRI" in the spec with "DMLex fragment identification string", you will vote for this?

I guess so, but I would prefer that they did not start with http as this would be confusing

Fine, I think noone really worries about the scheme being used here, which I see as a completely arbitrary choice. So, to avoid confusion, if this PR is changed so that all mentions of IRIs are replaced with "DMLex fragment identification string" and there is no "http://" prefix, you are happy with the rest and we can merge it and move on?

You have exactly arrived at the solution I proposed this morning. Why would I object?

Of course, it needs to be implemented and #123 needs a resolution before this PR can be merged.

I also would like us to consider the use of listingOrder as an alternative mechanism, but I can make this a comment on the next CSD.

Yes, but not requiring that you can make an HTTP request, which is what you were saying, and I was refuting. It's not about quibbling, but about facts John. Facts that you present here that are simply not true, and you continue doing so despite being falsified multiple times.

"The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location")" [RFC 3986] "The HTTP URL scheme is used to designate Internet resources accessible using HTTP (HyperText Transfer Protocol)" [RFC 1738]
My facts are pretty clear.

Facts are clear in that you now for the first time talk about a URL (i.e. a Uniform Resource Locator, not URI which is Uniform Resource Identifier), which was never discussed and never considered and never mentioned before. What you were saying before was that "an HTTP URI requires that you can make an HTTP request" -- and this is simply not true, and thus all your seemingly necessary implications you were making thereof are not true as well.

We have already discussed URLs in fact:

I do not think we ever discussed that we would want the IRIs to be usable as URLs so this is a far reaching implicit assumption that is false at this moment. - @mjakubicek

URIs starting with http are HTTP URLs. The examples you have given are HTTP URLs so the assumption is pretty clear. - @jmccrae

That URLs designate such resources means that you only refer to resources that meet these requirements. Being accessible by HTTP means you can access them by making an HTTP request. Hence "an HTTP URL requires that you can make an HTTP request".

@mjakubicek
Copy link
Contributor

Fine, I think noone really worries about the scheme being used here, which I see as a completely arbitrary choice. So, to avoid confusion, if this PR is changed so that all mentions of IRIs are replaced with "DMLex fragment identification string" and there is no "http://" prefix, you are happy with the rest and we can merge it and move on?

You have exactly arrived at the solution I proposed this morning. Why would I object?

Because this is not what your initial proposal was (this morning), as everyone can read up in the thread. I do not want the "#" to be part of "DMLex fragment identification strings", which is what your proposal starts with, and then continues on with other things, among others also mentioning this rename.

And that's why I'm double checking that we understand that the only change performed would be a wording issue solvable by a simple sed (i.e. find and replace command):

sed 's/IRI/DMLex fragment identification strings/g'

That's it.

We have already discussed URLs in fact:

I do not think we ever discussed that we would want the IRIs to be usable as URLs so this is a far reaching implicit assumption that is false at this moment. - @mjakubicek

Ok, you got me, we have already rule them out once ;-)

@jmccrae
Copy link
Contributor

jmccrae commented Apr 26, 2024

Fine, I think noone really worries about the scheme being used here, which I see as a completely arbitrary choice. So, to avoid confusion, if this PR is changed so that all mentions of IRIs are replaced with "DMLex fragment identification string" and there is no "http://" prefix, you are happy with the rest and we can merge it and move on?

You have exactly arrived at the solution I proposed this morning. Why would I object?

Because this is not what your initial proposal was (this morning), as everyone can read up in the thread. I do not want the "#" to be part of "DMLex fragment identification strings", which is what your proposal starts with, and then continues on with other things, among others also mentioning this rename.

And that's why I'm double checking that we understand that the only change performed would be a wording issue solvable by a simple sed (i.e. find and replace command):

sed 's/IRI/DMLex fragment identification strings/g'

That's it.

In principle that's right, although a quick look at the text shows that a little more care than a text replacement is needed!

The other part is removing the http:// prefix. I have a few suggestions here:

# Don't include lexicographicResource.uri at all (do we need it?)
entry/cat~1~noun

# Drop the http://
www.example.com/lexicon/entry/cat~1~noun

# Put the lexicographicResource.uri in brackets (one of the following)
[http://www.example.com/lexicon]entry/cat~1~noun
(http://www.example.com/lexicon)entry/cat~1~noun
<http://www.example.com/lexicon>entry/cat~1~noun

# Put the lexicographicResource.uri after the objectId
entry/cat~1~noun@http://www.example.com/lexicon

All seem good and avoid creating identifiers that are accidentally non-functioning URLs.

@vojtech-kovar
Copy link
Contributor Author

I don't feel like adding more disagreement to this discussion, and nobody else wrote anything, so I did what you propose (i.e., renamed IRIs to "DMLex fragment identification strings" and removed the http:// prefix).

Just FTR: Though acceptable, I don't agree with it -- I think one of the reasons why we said first URIs and then IRIs is that they can be used as HTTP(S) URLs which is an advantage, and we are now losing this option (kind of, as adding http:// is in fact not that complex operation). At the same time, I am not bothered by many IRIs that don't work as HTTP URLs, or lead nowhere (I think I still don't fully understand John's reasons, but never mind).

I have also addressed the problem with #123, using listingOrder in cases where all the UNIQUE attributes are empty and there are more objects with duplicate IDs. (And the exact semantics of UNIQUEness still needs to be specified more precisely in the text somewhere around 1.3.5, I believe.)

@jmccrae
Copy link
Contributor

jmccrae commented May 20, 2024

Okay, sounds like a good fix.

My objection is I don't think that hard to understand: HTTP URLs that lead nowhere are called broken links and cause many problems not just to the user experience, but also affecting SEO for websites. Implementing only working HTTP URLs ensures the global uniqueness of these identifiers and prevents malicious attacks.

@michmech michmech merged commit b310afc into oasis-tcs:master May 20, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants