Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datatype coercion of native types #98

Closed
gkellogg opened this issue Apr 3, 2012 · 18 comments
Closed

Datatype coercion of native types #98

gkellogg opened this issue Apr 3, 2012 · 18 comments

Comments

@gkellogg
Copy link
Member

gkellogg commented Apr 3, 2012

We previously discussed this as issues #87 and #81.

What is the range of the coercion operator in JSON-LD? As indicated by issue 87, it is any value (not an object or an array). This would include boolean and numeric, in addition to string. One possibility is limiting this to string, or doing it on a case-by-case basis. (boolean could coerce numeric types based on 0 or not 0, integer or double could coerce boolean or other numeric).

I believe that the best thing for expansion, compaction (and possibly framing) is to not modify values expressed using native datatypes. Coercion needs to happen when turning into RDF triples (which, in my implementation, is used to flatten data for framing), but otherwise should be left alone.

The history of native datatype conversion is somewhat contradictory, and issue #87 would have all values coerced to %1.16E, which leads to ambiguities.

PROPOSAL: native datatypes (boolean and numeric) are preserved unchanged through expansion and compaction, even if the associated property has coercion.

PROPOSAL: native datatypes are converted to typed literals when converting to RDF, independent of coercion of associated property, with native numbers having a fractional or expenential lexical representation converted using %1.15E. String representations of typed literals are not converted when transforming to RDF.

PROPOSAL: typed literals are converted to string form (@value and @datatype) when transforming RDF to JSON-LD.

OPEN: conversion/representation of native datatypes when framing.

See also:
http://lists.w3.org/Archives/Public/public-linked-json/2012Mar/0027.html
http://lists.w3.org/Archives/Public/public-linked-json/2012Mar/0028.html
http://lists.w3.org/Archives/Public/public-linked-json/2012Mar/0029.html
http://lists.w3.org/Archives/Public/public-linked-json/2012Mar/0031.html

@lanthaler
Copy link
Member

Just to make sure I understand your first proposal. You propose that

{
  "@context": {
    "prop": { ..., "@type": "xsd:double" }
  },
  "prop": 5
}

Would expand to:

[
  {
    "http://prop": [ 5 ]
  }
]

Instead of

[
  {
    "http://prop": [ { "@value": 5, "@type": "xsd:double" } ]
  }
]

(Please note, it's still a number and not a string)

I think that's problematic as it will lead to RDF data that has a "5"^^"xsd:integer" which doesn't match the authors intent anymore. I think we have to make sure that expansion never loses any information. If a conflict or an ambiguity occurs, the algorithm should throw an exception. That's fundamental since basically all other algorithms depend on expansion.

@gkellogg
Copy link
Member Author

gkellogg commented Apr 3, 2012

We could decide to apply datatype coercion on transformation to RDF, which would deal with the 5.0 === 5 issue. Within expansion and compaction, I think we should stick with a native representation, without adding @value. This best reflects the authors meaning.

The value of @value is always a string, if we break that, I don't think it really buys us anything, and it becomes less useful for the author.

@lanthaler
Copy link
Member

I agree we should move that round-tripping issue to the fromRDF()/toRDF() API calls as it really just appears there. Nevertheless I still think we would need to expand to an expanded object form because otherwise we would loose the type-coercion data during expansion.

So I would propose the following

PROPOSAL: native datatypes (boolean and numeric) are preserved through expansion, compaction, and framing, but MIGHT be converted to the expanded object form (e.g. "@value": 5) if the associated property is type-coerced.

PROPOSAL: native datatypes are automatically converted to typed literals when converting to RDF. Independent of the value of @type, booleans are converted to the strings "true"and "false" and numbers to a string representation thereof (numbers having a fractional or expenential lexical representation are converted using %1.15E).

I didn't come to a conclusion yet what to do in fromRDF(). I think we should restore xsd:boolean, xsd:double, and xsd:integer to native types whenever possible.

@dlongley
Copy link
Member

I'd say that there is no such thing as leaving native datatypes as "unchanged"; given the most liberal use of that word, anyway.

JSON parsers will have different internal representations for numeric types that will affect how the data is reserialized. Most of the time, integers will be unchanged (but this won't necessarily be the case for large integers, especially in languages like javascript or 32-bit PHP), but doubles will inevitably look different on the way out. I don't think we can reasonably mandate that they be "unchanged" during expansion or compaction. We can say that their original types (boolean or numeric) will be preserved, but that's all. However, as Markus said, if an author specifies a type-coercion rule, we need to include that information in expanded form.

I would propose this:

PROPOSAL: native datatypes (boolean and numeric) are preserved through expansion, compaction, and framing, but MUST be converted to expanded object form (e.g. {"@value": "5", "@type": "http://www.w3.org/2001/XMLSchema#integer"} if the associated property is type-coerced. Numerical values may change due to internal parser representations.

It doesn't matter what the type-coercion rule is in the @context the string value of the native datatype should appear as the value of @value and the @type from the @context as the type. I think the @type should be treated as opaque.

This ensures that the @type information that the author specified in the @context is preserved -- and we don't need to start supporting more than just a string type for @value. I don't think it matters how the processor goes about changing a number to a string because we simply can't depend on it (and it is just as non-dependable as leaving the values in their native types, but this adds complexity to @value's current "string only" rule).

If we want to convert xsd:integer, xsd:double, and xsd:boolean to native datatypes when compacting, I will support that. I think that we should make a best effort to convert them to native datatypes (eg: "12abc" => 12 for xsd:integer) and possibly have a "strict" mode that will throw exceptions if a clean conversion isn't possible.

@lanthaler
Copy link
Member

I'd say that there is no such thing as leaving native datatypes as "unchanged"; given the most liberal use of that word, anyway.

Why not? Datatypes will remain the same, what might changes is the value.

PROPOSAL: native datatypes (boolean and numeric) are preserved through expansion, compaction, and framing, but MUST be converted to expanded object form (e.g. {"@value": "5", "@type": "http://www.w3.org/2001/XMLSchema#integer"} if the associated property is type-coerced. Numerical values may change due to internal parser representations.

The example in your proposal is contradicting the prose. You state that "native datatypes (boolean and numeric) are preserved through expansion, compaction, and framing" but you convert a number to a string in the example: {"@value": "5", "@type": "xsd:integer"}.

Are you arguing to convert the datatype to string for everything or just to expand to the @value form?

If we want to convert xsd:integer, xsd:double, and xsd:boolean to native datatypes when compacting, I will support that. I think that we should make a best effort to convert them to native datatypes (eg: "12abc" => 12 for xsd:integer) and possibly have a "strict" mode that will throw exceptions if a clean conversion isn't possible.

I don't agree. We should NEVER convert 12abc to 12. Either we leave it in there as string (IMO the most sensible way to handle it) or we throw an exception that the conversion is not possible - but I don't really see a reason for that. We just keep the expanded @value form and the type coercion from the context won't apply anyway.

Compaction will always be a best-effort approach as we don't know on what data a context is applied - so I think it's completely fine to gracefully degrade the result in case of such mismatches instead of magically trying to solve everything; that will just lead to, almost impossible to find, bugs (in the user's code).

@dlongley
Copy link
Member

On 04/11/2012 11:41 PM, Markus Lanthaler wrote:

I'd say that there is no such thing as leaving native datatypes as "unchanged"; given the most liberal use of that word, anyway.
Why not? Datatypes will remain the same, what might changes is the value.

I think this is probably just some confusion caused by the english
language. Here I was interpreting "native datatypes" as "[values that
have] native datatypes" and it sounds like you were not. I tried to
convey that meaning by saying I was being liberal with the language --
because I was trying to make a point about how it is futile, to a
practical degree, to specify how to convert native values to strings in
an effort to achieve consistent round-tripping. We may already be in
agreement on this point.

PROPOSAL: native datatypes (boolean and numeric) are preserved through expansion, compaction, and framing, but MUST be converted to expanded object form (e.g. {"@value": "5", "@type": "http://www.w3.org/2001/XMLSchema#integer"} if the associated property is type-coerced. Numerical values may change due to internal parser representations.
The example in your proposal is contradicting the prose. You state that "native datatypes (boolean and numeric) are preserved through expansion, compaction, and framing" but you convert a number to a string in the example: {"@value": "5", "@type": "xsd:integer"}.

The example follows a "but" that says that the native datatype value
must be converted if the associated property is type-coerced. So, if
there is no type-coercion rule, then the native datatype is preserved,
but if there is a type-coercion rule, then the native datatype is
converted to expanded object form (as in the example).

To be clear, when expanding the value {"foo": 5}, if there is no
type-coercion rule, it expands to {"foo": 5}; if there is one, then it
expands to {"foo": {"@value": "5", "@type":
"http://www.w3.org/2001/XMLSchema#integer"}}.

Are you arguing to convert the datatype to string for everything or just to expand to the @value form?

I am arguing to convert the datatype to string for everything with a
type-coercion rule (where the conversion is sensible). I think that the
type for the value of '@value' should always be a string.

If we want to convert xsd:integer, xsd:double, and xsd:boolean to native datatypes when compacting, I will support that. I think that we should make a best effort to convert them to native datatypes (eg: "12abc" => 12 for xsd:integer) and possibly have a "strict" mode that will throw exceptions if a clean conversion isn't possible.
I don't agree. We should NEVER convert 12abc to 12. Either we leave it in there as string (IMO the most sensible way to handle it) or we throw an exception that the conversion is not possible - but I don't really see a reason for that. We just keep the expanded @value form and the type coercion from the context won't apply anyway.

I'm asserting that most people who are going to write applications that
frame their JSON-LD input either won't care about bogus data conversions
or they will want to bail out in a single location and reject the input.
This means that I think that we need to do one of two things:

  1. Throw an exception when a conversion isn't sensible.
  2. Make a best-effort to convert.

If people are expecting their @context to be applied and want to code to
a particular structure (especially important for framing and zero-edit
backwards compatibility) then we must do one of these two options.
Otherwise, they will have to bloat their code with conditional
statements every time they want to access the values of a property. This
is precisely what we want to avoid when framing output.

Compaction will always be a best-effort approach as we don't know on what data a context is applied - so I think it's completely fine to gracefully degrade the result in case of such mismatches instead of magically trying to solve everything; that will just lead to, almost impossible to find, bugs (in the user's code).

I think having the option to reject input that doesn't convert the way
you want it to is very desirable. The use case I am thinking of is the
developer who wants to abstract away the details of JSON-LD as much as
possible within their application. This developer will receive JSON-LD
input and then frame it so that it is in a dependable structure that
their application can use without having to resort to conditional
statements that look for JSON-LD idioms when accessing every property.
This is the main use case for framing; it provides a general method for
structurally-normalizing JSON-LD inputs for JSON developers such that
most of them won't need to write their own layer to do it.

@lanthaler
Copy link
Member

I think we should start to distinguish two use cases here as I expect users of each to have very different needs. Simply speaking we could talk about RDF and JSON-people (might over-simplistic but I think it illustrates it quite nicely).

I would argue that JSON-people simply DO NOT CARE about type coercions, they don't just don't wanna see them appear in their data. This at least applies to data types that have a mapping to a native type other than string. It might still be useful to further type strings (think datetimes, coordinates, etc.)

On the other hand, RDF people DO CARE about type coercions and would like to fine-tune the RDF transformations. I would even go that far as to say that it is completely fine for them to work (most of the time) with strings instead of native data types.

So, what do I wanna say here. I think we should define a round-tripping to RDF which ensures that data will be converted back to the native data types without having to put type coercions into a context. That would satisfy JSON-people and would require us to mostly define rules from native types to the xsd type system. No need to deal with invalid integers and the like. For RDF people, it would be fine to keep most of the data in the form of strings, so no need to automatically convert "ab12" to "12".

Summarizing, I would propose the following:

  • we automatically convert native numbers to xsd:integer/xsd:double and back in RDF round-tripping
  • we automatically convert booleans to xsd:boolean and back in RDF round-tripping
  • if there are type coercions, we expand to @value: string
  • if in compaction/framing there are type coercion rules, we drop the @type from @value but keep it as a string
  • if we encounter something we can't safely convert back to a native data type ("ab12") we leave it in there as a string

I think this would solve most (all?) corner cases, be very predictable, and satisfy both user groups.

Thoughts?

@dlongley
Copy link
Member

There's a danger that precision will be lost if "xsd:double" is in string form and is automatically converted to a native type. I don't think we should force that as automatic behavior. I think that's my only issue with your proposal; your proposal is essentially how this all worked when JSON-LD was first implemented.

Hmm, I will say that we should leave unsafe conversions alone only if are using automatic type-coercion rules... but I'm not sure even that is satisfactory. If I'm used to letting everything happen automatically, then I wouldn't want to be writing code expecting integers (and thus, not checking for strings or objects) and have something else pop up all of the sudden. I don't want to have to check for that case everywhere and I presume the same to be true for most JSON people. Again, this just means that I support some kind of "strict" flag that would cause an exception to be thrown if a conversion can't be safely performed. I'd prefer to be alerted that the data has a problem with it in a single place.

@lanthaler
Copy link
Member

There's a danger that precision will be lost if "xsd:double" is in
string form and is automatically converted to a native type. I don't
think we should force that as automatic behavior. I think that's my
only issue with your proposal; your proposal is essentially how this
all worked when JSON-LD was first implemented.

I'm not entirely sure I understand what you are saying here. I was talking about RDF round-tripping and in that case every xsd:double will always be in string form.

Hmm, I will say that we should leave unsafe conversions alone only if
are using automatic type-coercion rules... but I'm not sure even that
is satisfactory. If I'm used to letting everything happen
automatically, then I wouldn't want to be writing code expecting
integers (and thus, not checking for strings or objects) and have
something else pop up all of the sudden. I don't want to have to check
for that case everywhere and I presume the same to be true for most
JSON people. Again, this just means that I support some kind of
"strict" flag that would cause an exception to be thrown if a
conversion can't be safely performed. I'd prefer to be alerted that the
data has a problem with it in a single place.

That depends on who's data you are working with. If you work with somebody else's data you will never know what you get back, if you work with your own data, you can rely and on those automatic conversions as nothing else than a native integer/double/boolean will be converted to the corresponding xsd typed literal.

@dlongley
Copy link
Member

I interpreted your statement:

"we automatically convert native numbers to xsd:integer/xsd:double and back in RDF round-tripping" to mean the following:

If I get in a value of "12.512134123231" with a datatype of "xsd:double", it will be automatically converted to: a native double of 12.5121341232 (or some other close approximation) in JSON-LD.

Your statement says "convert native numbers to ... and back" (emphasis on "and back"). I took this to mean that any incoming xsd:doubles would be converted back to native numbers. If this wasn't what you intended by your statement, could you clarify?

That depends on who's data you are working with. If you work with somebody else's data you will never know what you get back, if you work with your own data, you can rely and on those automatic conversions as nothing else than a native integer/double/boolean will be converted to the corresponding xsd typed literal.

It isn't a question as to whether or not you know the data values (if that's what you're saying here), it is a question as to how data values that don't match their types are handled. There are three ways we've discussed handling mismatches: doing a "best fit", leaving the value in an expanded form, or throwing an exception. If you leave the value in expanded form, then every place where a value is accessed must be checked for that extra form -- or you must run yet another process to convert the data to your liking. If your application is going to be checking for that extra form anyway that isn't an issue. But a lot of JSON people, I believe, will not want to write their applications that way; they will want to handle a single exception and just reject the data that doesn't match in a single spot. I feel like we may be miscommunicating here.

@lanthaler
Copy link
Member

My previous proposal was:

  • we automatically convert native numbers to xsd:integer/xsd:double and back in RDF round-tripping
  • we automatically convert booleans to xsd:boolean and back in RDF round-tripping
  • if there are type coercions, we expand to @value: __string__
  • if in compaction/framing there are type coercion rules, we drop the @type from @value but keep it as a string
  • if we encounter something we can't safely convert back to a native data type ("ab12") we leave it in there as a string

@dlongley you said that you only have a problem with the automatic type conversion in RDF round-tripping. How would you prevent this? This will always happen (at least in toRDF) - independent of whether there value is type coerced or not in the JSON-LD document.

If I understand you correctly, your are concerned with the conversion in fromRDF to a native type as there might be a loss in precision, right? Does the "strict" flag I proposed (just) for fromRDF in issue #100 address this? Do you think we need another flag which disables these automatic type conversion? Something like no-type-conversion which would basically mean that every value is a string (either directly or indirectly in expanded object form with typing attached).

(sorry posted that to the wrong issue before)

@lanthaler
Copy link
Member

RESOLVED: When round-tripping xsd:boolean values from JSON-LD through expansion and back through compaction, a JSON-native boolean value with xsd:boolean datatype coersion will remain a JSON-native boolean value.

@lanthaler
Copy link
Member

RESOLVED: @value supports native JSON datatypes such as number, boolean, string.

RESOLVED: During expansion, a native JSON value with type coercion applied gets expanded to the expanded object form where the value of @value is still in the native JSON form and @type is the type in the type coercion rule.

@lanthaler
Copy link
Member

RESOLVED: When compacting, if there is a direct match for @type for the property and @type for the property in the context, then the value of the property is replaced with the value of @value.

@lanthaler
Copy link
Member

Example: This is your data: "foo": {"@value": true, "@type": "xsd:boolean"}, this is in the context "foo": {"@id": "...", "@type": "xsd:boolean"}, then this is the result: "foo": true

@dlongley
Copy link
Member

dlongley commented May 1, 2012

We could use the flags I proposed for issue #100 for fromRDF (we could apply the same to toRDF) and then simply treat every datatype as opaque during expansion and compaction. This means that:

When compacting, if an expanded form value (eg: {"@value": 5, "@type": "my:something"}) has a @type that matches a coercion rule in the context, we use the value of @value. If not, we don't pick the term/prefix with that rule. We don't try to do any special conversions of native datatypes; they are always left alone. All we do is a string comparison on @type with coercion rules to decide if a compaction term is appropriate or not, and if so, we change from expanded form: {"@value": 5, "@type": "my:something"} to compacted form: 5. The native type of @value being irrelevant and unchanged.

@lanthaler
Copy link
Member

That's exactly how I would like it to behave.

@lanthaler
Copy link
Member

I'm closing this issue as the specs have already been updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants