Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YAML-LD datatypes (and tags for datatypes) #17

Open
VladimirAlexiev opened this issue May 31, 2022 · 27 comments
Open

YAML-LD datatypes (and tags for datatypes) #17

VladimirAlexiev opened this issue May 31, 2022 · 27 comments
Labels
UCR Issue on Use Case/Recommendation
Milestone

Comments

@VladimirAlexiev
Copy link
Contributor

VladimirAlexiev commented May 31, 2022

  • RDF uses explicitly tagged literals, in particular lang strings and XSD datatypes, including infinite precision integers and decimals.
  • JSON carries faithfully strings and small numbers, everything else must be represented as a string with a separate field to indicate the datatype (@type in JSON-LD). Eg see Elaborate on handling of JSON builtin types integer and double w3c/json-ld-syntax#387 for the pitfalls of using large integers or decimals
  • YAML can use tags to carry literals faithfully (including infinite precision, "markers" like -.inf and .nan, datetimes), and even more complex structures. One could declare "YAML schemas" with additional tags, eg to represent all XSD datatypes

Why might we want more than "string plus @type"?

  • convenience (eg see dc:date below and many other examples)
  • normalization (reduce/eliminate lexical vs value space differences): it seems to me easier for a processor to normalize 02022-05-18 to 2022-05-18 if tagged as !xsd!date rather than looking at a parallel @type field.

Let's collect below examples of what we could want.


@gkellogg in ietf-wg-httpapi/mediatypes#8 (comment)

If I were to revisit anything in the JSON-LD data model, it would be the interpretation of JSON numbers to allow for decimal values. As it is now, JSON numbers are either interpreted as integers (long) or doubles based on the range of the number. But, in JSON-LD 1.1, we use The JSON Canonicalization Scheme (RFC8785) as a way to represent numbers in the rdf:JSON datatype serialization, which allows for a serialization form of either integer, decimal, or double. This really only comes into play in JSON-LD when creating RDF literals from native JSON numbers (something which is generally a bad design point, but is there to allow a reasonable interpretation of native JSON forms), but could also come into play when representing those numbers in the data model, and thus in serializations to forms such as YAML.


@VladimirAlexiev from #2:

  • Tags are comparable to datatypes.
  • the YAML json schema and core schema handle string, boolean, integer, float (the latter allows things like -.inf and .nan).
  • https://yaml.org/type/ handles a wider set, in particular dates and datetimes. But please note these are considered deprecated in 1.2 and are being removed in 1.3 Remove timestamp examples from the 1.2 spec yaml/yaml-spec#268 (comment)
  • Maybe we should define a YAML schema to handle more xsd datatypes?
    • It should aim to eliminate problems related to the limited and non-standardized set of JSON literals. Eg the JSON number 12345678901234567890.12345 is converted to RDF literal "12345678901234567168"^^xsd:integer (see jsonld playground)
    • And could even work as a replacement of @type, eg
# short form using tags
dc:date: !xsd!date 2022-05-18

# instead of long form
dc:date: {"@type": xsd:date, "@value": 2022-05-18}

New ones:

  • is it at all feasible to write "foo"@en in YAML rather than a separate @language field?
  • JSON-LD cannot capture GeoJSON because that uses nested arrays. Can this be worked around somehow with a YAML tag for "2D array"?
@pchampin
Copy link
Contributor

JSON-LD cannot capture GeoJSON because that uses nested arrays.

This is not the case anymore with JSON-LD 1.1 (example)

@ioggstream
Copy link
Contributor

ioggstream commented May 31, 2022

This is another interesting direction to explore that does not seem to create inconsistencies with YAML spec, thanks Vladimir!
We could then ask the YAML community if it is possible to "register" in some way the xsd namespace to support this kind
of mappings and associate them to the yaml.org 1.2 namespace.

I suggest using full-URI tags in the examples for clarity, eg:

# see https://yaml.org/spec/1.2.2/#tag-directives
%TAG !xsd! tag:http://www.w3.org/2001/XMLSchema:
---
# short form using tags
dc:date: !xsd!date 2022-05-18

# instead of long form
dc:date: {"@type": xsd:date, "@value": 2022-05-18}

@anatoly-scherbakov
Copy link
Contributor

I feel that manually specifying data types for each value is very tedious, and the tag syntax is not very intuitive. My feeling is this: why don't we delegate that task to the context?

The machine is smart enough to understand that a value of a dc:date is actually a literal with xsd:date datatype — and JSON-LD contexts can express that.

@ioggstream
Copy link
Contributor

Can you post an example? Probably we should start collecting examples of "equivalence classes" of yaml files in this repo.

@VladimirAlexiev
Copy link
Contributor Author

VladimirAlexiev commented Jun 1, 2022

@ioggstream
We should use the actual XSD namespace. The tag: URI scheme is recommended by the YAML people but is not mandatory, so I'd rather follow TimBL's principles of using resolvable URLs:
%TAG !xsd! http://www.w3.org/2001/XMLSchema#

https://yaml.org/spec/1.2.2/#104-other-schemas allows us to make an XSD YAML scheme,
and we should ask the YAML people to publish it at https://yaml.org/type/

@anatoly-scherbakov Of course if a field ALWAYS uses the same datatype, the context can provide it. But dates in instance data often come in various granularities (same with numbers). So wouldn't it be nice to write this instead of the respective long forms?

dct:created   !xsd!gYear    2000
dct:issued    !xsd!date     2022-05-18
dct:modified  !xsd!dateTime 2022-05-18T01:12:23

@pchampin
Copy link
Contributor

pchampin commented Jun 1, 2022

@anatoly-scherbakov

My feeling is this: why don't we delegate that task to the context?

Of course we can, and that's an important role of JSON-LD contexts: making explicit some implicit constrains/dependencies (e.g. "this field expects this datatype").

However, we also need a way to make this information explicit (e.g. in the expanded form of JSON-LD). In JSON-LD, this is done with a value object {"@value": "...", "@type": "..." }. In YAML-LD, tags provide a more concise and more idiomatic way to do it.

Also, +1 to @VladimirAlexiev use-case above.

@VladimirAlexiev VladimirAlexiev mentioned this issue Jun 1, 2022
14 tasks
@anatoly-scherbakov
Copy link
Contributor

@VladimirAlexiev @ioggstream that is an interesting point. When using JSON-LD, I always tried to ensure that a particular property always maps to a specific type, but I agree that this application of tags is compelling. 👍

@gkellogg gkellogg added the UCR Issue on Use Case/Recommendation label Jun 4, 2022
@gkellogg gkellogg closed this as completed Jun 4, 2022
@gkellogg gkellogg reopened this Jun 4, 2022
@gkellogg
Copy link
Member

This was discussed during today's call: https://json-ld.org/minutes/2022-06-22/.

gkellogg added a commit that referenced this issue Jul 2, 2022
This was referenced Jul 4, 2022
@ioggstream ioggstream added this to the -future milestone Jul 5, 2022
@gkellogg
Copy link
Member

This issue was discussed in today's meeting.

@gkellogg
Copy link
Member

gkellogg commented Aug 6, 2022

I think this is a great candidate for something an extended profile could do, and something like the %TAG ! http://www.w3.org/2001/XMLSchema# seems like a great way to go.

In my mind, this isn't a direct replacement for the @type of JSON-LD value objects, but an extension of the JSON-LD internal representation, much the say that booleans and numbers are treated in the JSON-LD (specifically to/from RDF algorithms). Implementations would need to maintain the internally typed values when expanding/compacting/framing, represent them using the appropriate tag when serializing to YAML in extended mode, or expanding them to value objects when serializing in the basic mode.

The toRdf and fromRdf algorithms would need to honor them when generating RDF or turning RDF back into the internal representation, again running with the appropriate processing mode.

Otherwise, this change should be fairly transparent. IMO, this is the primary motivation for an extended profile.

@rob-metalinkage
Copy link
Contributor

So what is actually in play here is a profile of YAML itself - the profile for which JSON-LD translations are lossless, so we dont need a profile of YAML-LD, but YAML-LD is an extension of a "YAML-JSON-compatible" profile. Such a profile could be implicit - or made explicit if multiple YAML/JSON conversions are defined. Another reason to make it explicit would be to validate if a given YAML document is compatible with YAML-LD before defining the YAML-LD extended syntax for that YAML schema.

@gkellogg
Copy link
Member

gkellogg commented Aug 7, 2022

I guess in my mind, the "YAML-JDON-compatible" profile is analogous to YAML using the JSON schema. This does not depend on explicit tags, but implicitly associates the values with tag:yaml.org,2002:null, tag:yaml.org,2002:bool, tag:yaml.org,2002:int, and tag:yaml.org,2002:float.

I think something like a "YAML-XSD-compatible" profile might require the use of a tag namespace such as suggested by @VladimirAlexiev: %TAG !xsd! tag:http://www.w3.org/2001/XMLSchema:, so a tagged value such as !xsd!dateTime 2022-05-18T01:12:23 would parse to a native DateTime literal, and the JSON-LD internal representation would be extended to support the various literal types from XSD.

If running in "extended", or "YAML-XSD-compatible" mode, a %TAG definition such as above would be legitimate. If not running in that mode, a processor may reject the input or use Postel's law and parse it, but it should not be emitted unless the profile is set accordingly.

In my mind, this and alias nodes are the primary think that would be enabled by an extended mode.

If a processor sees some other %TAG definition (or definitions outside of some pre-defined set) it should probably fail to process the document, which then acts as an extension point for processors to eventually support more values for %TAG in the future, but for RDF purposes, anything beyond the XSD set

Given this, I think we may be about ready to define the processing modes more completely.

@rob-metalinkage
Copy link
Contributor

rob-metalinkage commented Aug 8, 2022

I'm thinking here about statements about conformance - :myresource dct:conformsTo - how do I know if a yaml resource is "YAML using the JSON schema." (the same holds true for the identifiers for YAML-LD and JSON-LD.)

general Use Case is to be able to determine what an API supports in terms of interoperability of data payloads. Can anyone orient me to where this is being defined or discussed? I can see inline directives such as https://yaml.org/spec/1.2.2/#681-yaml-directives, @context where a URI is referenced and $schema directives - but not where such things are registered - we have a related in IANA profiles on media types for encodings, but what about information content profiles?

Is identification of the profile out-of-band using resolvable identifiers (i.e. not in syntax-specific directives using syntax-specific keywords and versioning) a factor in defining processing modes?

@TallTed

This comment was marked as resolved.

@gkellogg
Copy link
Member

I've looked into this some more as part of trying to implement extended support for XSD scalar values in YAML. IMO, the appropriate %TAG value would be something like the following:

%TAG ! http://www.w3.org/2001/XMLSchema#

This would allow values such as !date 2022-08-08, which would expand as !<http://www.w3.org/2001/XMLSchema#> "2022-08-08" and be a natural way to capture "2022-08-08"^^<http://www.w3.org/2001/XMLSchema#>. However, I'm stymied by a bug in LibYAML, which Ruby and many other languages rely on for parsing YAML (yaml/libyaml#253), where # is not accepted as a URI character (really ns-uri-char). So far, the LibYAML team has been unresponsive, and the library shows very little activity in the last couple of years. Of course, we could hack this with some other URI, but that doesn't seem appropriate for this group.

Other YAML tools show similar issues, I think largely due to the fact that that YAML spec only uses the tag scheme in its examples. Until this issue is resolved, I think we need to defer an extended mode for YAML-LD that would involve interpreting XSD datatype scalar values. The spec recommends the use of tag: (oddly), and if we were to go there, we would probably want to introduce something like %TAG ! tag:www.w3.org,2022:xsd/ but that seems quite arbitrary.

An example file I've been working with to exercise this variation is the following:

%YAML 1.2
%TAG ! http://www.w3.org/2001/XMLSchema#
---
"@context":
  "@vocab": http://xmlns.com/foaf/0.1/
name: !string Gregg Kellogg
homepage: https://greggkellogg.net/
depiction: http://www.gravatar.com/avatar/42f948adff3afaa52249d963117af7c8
date: !date 2022-08-08

(note, the use of a specific tag name shouldn't be significant. In this case, it's using the primary tag handle, but it could just as well be the secondary tag handle (!!) or a named tag handle (! xsd !) for our processing model).

If we are to support XSD types, we probably want to white-list allowed datatype URIs to include most XSD types, in addition to tag:yaml.org,2002:str, tag:yaml.org,2002:null, tag:yaml.org,2002:int, tag:yaml.org,2002:float, and tag:yaml.org,2002:bool which would map more directly to the JSON-LD Internal Representation.

See also yaml/yaml-spec#268 (comment).

@gkellogg
Copy link
Member

  • is it at all feasible to write "foo"@en in YAML rather than a separate @language field?

No, I don't believe it is, however, we could consider using a datatype form such as defined for the i18n namespace:

@prefix i18n: <https://www.w3.org/ns/i18n#> .

[ ex:title "foo"^^i18n:en ] .

Although it's defined to allow a combination of language and base-direction, it can be used for just language or base direction. Of course, we would need to define that literal values using an i18n datatype consisting of only language would be translated to language-tagged literals, and visa-versa.

@gkellogg gkellogg added the spec Issue on specification label Aug 17, 2022
@gkellogg gkellogg mentioned this issue Aug 26, 2022
@VladimirAlexiev
Copy link
Contributor Author

VladimirAlexiev commented Sep 1, 2022

@gkellogg

  • I agree that the "YAML-JSON-compatible" profile should use the YAML JSON schema
    • with a warning that it may mangle numbers (then people come complaining "why is my 12.3 converted to "1.230000005e2"^^xsd:float?")
  • I like !date 2022-08-08 better than !xsd!date 2022-08-08
  • I like your extended suggestion YAML-LD IRI tags #79 but how do we tag URLs? Do we just mandate !id in our "YAML XSD Schema"?

@VladimirAlexiev
Copy link
Contributor Author

VladimirAlexiev commented Sep 1, 2022

onlineyamltools.com allows # but then complains with:
Error: YAMLException: unknown tag !<http://www.w3.org/2001/XMLSchema#string> at line 6, column 28

Trying with explicit xsd tag gives the same error:

%YAML 1.2
%TAG !xsd! http://www.w3.org/2001/XMLSchema#
---
name: !xsd!string Gregg Kellogg

This tool can only use the "YAML JSON schema" builtin tags (and supports timestamp, although that has been deprecated).
As expected, it can mangle numbers:

%YAML 1.2
%TAG ! tag:yaml.org,2002:
---
name:   !str Gregg Kellogg
int:    !int 123
bigint: !int 123456789012345678901231                             # -> 1.2345678901234569e+23  ouch!
bigint: 123456789012345678901231                                  # -> 1.2345678901234569e+23  ouch!
float:  !float 1.235609853907835079889067406870964870956870967908 # -> 1.235609853907835
date:   !timestamp 2022-08-08 -> 2022-08-08T00:00:00.000Z

@gkellogg
Copy link
Member

gkellogg commented Sep 1, 2022

My implementation needed to use a lower-level parser that just transforms YAML to the Representation Graph without further interpretation. In Ruby Psych, this is done via Psych.parse_stream. That level shouldn't place constraints on any specific schema.

@gkellogg
Copy link
Member

Discussed at TPAC F2F

@VladimirAlexiev
Copy link
Contributor Author

VladimirAlexiev commented Sep 28, 2022

Beyond XSD: let's not forget custom datatypes, eg:

  • geo:wktLiteral, geo:gmlLiteral, and 5-10 more new ones in GeoSPARQL 1.1 (eg geo:geoJson)
  • cdt:ucum, eg !cdt!ucum 1.20 m is equal to (though not identical to) !cdt!ucum 120 cm
    see LINDT units of measure w3c/sparql-dev#129
  • the tentative rdf:JSON and rdf:YAML

@gkellogg
Copy link
Member

gkellogg commented Sep 30, 2022

This was discussed on [2022-09-28](https://json-ld.org/minutes/2022-09-28/#16).
Pierre-Antoine Champin: The devil is in the details, and in the bnodes :-D
Vladimir Alexiev: I think we should use YAML tags in the form that datatypes are used for RDF.
... JSON-LD is more verbose, and the YAML syntax is more concise.
... In many case the context will relieve you of this need, but there are cases where the graph is heterogeneus
... May be a problem with parsers.
... This also relates to YAML schemas, and how to attach types.
... YAML had a schema including dates, but have backed up.
... My proposal would be that the WG will declare a %TAG |xsd| ...
... But, implementers will need to use a better parser that supports tags.
... This is also important for numbers.
... We had trouble in xxx group, where the number would be mis-interpreted.
... Then we need to look at a YAML parsers matrix to determine how widely available it is.
Gregg Kellogg: The current "spec" refers to a basic profile, which doesn't include tags but only basic YAML values
... and an Extended profile that includes XSD datatypes, and tags for URLs (is it absolute, or relative...)
... Gregg has an implementation that uses the YAML parse tree.
... Also in JSON-LD (discussion between Gregg and Antoine at TPAC), there is a movement towards handling more datatypes, and not mangling literals with default treatment of numbers
Vladimir Alexiev: What about URLs?
... In a heterogeneous dataset, the same field could contain either a string or a resource.
... can we have a single tag !id or !uri that would handle absolute, relative and CURIEs?
Gregg Kellogg: We want to explore some more use cases of URLs before deciding
Vladimir Alexiev: Can we decide this issue?
... let's not forget custom datatypes, eg geo:wktLiteral, geo:gmlLiteral, 5-10 more in GeoSPARQL 1.1, and the tentative rdf:JSON and rdf:YAML
Gregg Kellogg: Questions of quoting: is !xsd!integer '123' the same as !xsd!integer 123 and same as 123, or different?
Niklas Lindström: Author: someone!tag-key => as if author was defined in the context with "`@type`": <tag-key>; then if e.g. someone!uri was encountered, *and* uri is defined as an alias of "`@id`", this is short for {"`@id`": "someone"}
... the tag comes before the value, eg !tag-key someone
Gregg Kellogg: Tags should be declared in %TAG not in context, else we'll go against the grain of YAML

@TallTed
Copy link
Contributor

TallTed commented Sep 30, 2022

@gkellogg -- Several unfenced @ entities are in the last several lines of the bot-posted conversation #17 (comment) causing more unintended alerts to be fired in their direction.... Maybe the bot can be tweaked to codefence such entities going forward?

@gkellogg
Copy link
Member

Sorry, must have been unfenced on IRC. I’ll fix them later

@TallTed
Copy link
Contributor

TallTed commented Oct 1, 2022

Yeah, I'm sure they were unfenced on IRC. There's no consistent value to fencing there.

Weirdly, now that they're single-backtick fenced here, those backticks are showing as part of the text instead of being interpreted as markdown -- so, for instance, we now see (bold added here to help with clarity) {"`@id`": "someone"}, where we'd expect to see {"@id": "someone"}.

I suspect this won't be a quick or easy fix, but it should be raised with the folks running the (now several!) IRC/log-to-GitHub bots.

@gkellogg
Copy link
Member

gkellogg commented Oct 2, 2022

Well, I handle the irc log to HTML for these minutes, which were inserted here. Perhaps could detect some bare keywords, but you’re right that the result in the comment is wrongly interpreted, but that seems like a GH issue.

@TallTed
Copy link
Contributor

TallTed commented Oct 2, 2022

I'd suggest wrapping the larger element including the @, so {"@id": "someone"}, which makes overall sense anyway, the larger element being code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
UCR Issue on Use Case/Recommendation
Projects
None yet
Development

No branches or pull requests

7 participants