Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider support for underscores in numeric literals #173

Open
niklasl opened this issue Dec 5, 2024 · 12 comments
Open

Consider support for underscores in numeric literals #173

niklasl opened this issue Dec 5, 2024 · 12 comments
Labels
spec:enhancement Issue or proposed change to enhance the spec without changing the normative content substantively spec:substantive Issue or proposed change in the spec that changes its normative content

Comments

@niklasl
Copy link

niklasl commented Dec 5, 2024

To aid readability, many programming languages allow underscores in numeric literals for separating groups of digits. A common use is to separate in groups of thousands, such as 42_000_000.

For example, this is supported in Java, Python, Javascript, Rust, Erlang and Racket (see links for details and rationale).

Proposed new productions:

DIGITS    ::=   DIGIT ("_" | DIGIT)*
DIGIT     ::=   [0-9]

Updated productions:

INTEGER   ::=   DIGITS
DECIMAL   ::=   DIGITS? '.' DIGITS
DOUBLE    ::=   DIGITS '.' DIGITS? EXPONENT | '.' DIGITS EXPONENT | DIGITS EXPONENT

Example use:

SELECT * { ?s ?p ?o } LIMIT 1_000 OFFSET 20_000

Note that this would purely be syntactic sugar.

(Personally, I've only missed this feature in SPARQL when using LIMIT and/or OFFSET. But it may be nice to define this similarly for Turtle/TriG.)

@afs
Copy link
Contributor

afs commented Dec 5, 2024

I'd like to see it if we do it uniformly.

@afs
Copy link
Contributor

afs commented Dec 5, 2024

DIGITS    ::=   DIGIT ("_" | DIGIT)*
DIGIT     ::=   [0-9]

Hmm- that allows trailing _.

For internal _ only:

DIGITS    ::= DIGIT (("_" | DIGIT)* DIGIT)?

c.f. BLANK_NODE_LABEL avoiding trailing .

@afs
Copy link
Contributor

afs commented Dec 6, 2024

Presumably we need to say the "_" is not part of the lexical form.

    FILTER ( sameTerm(?x, 1_000) )

@TallTed
Copy link
Member

TallTed commented Dec 6, 2024

It seems very odd to support the underscore _ as thousands separator, when humans (who will continue to input much data) typically use dots . or commas , which are not supported. I would suggest supporting all three, or re-un-supporting _

@gkellogg
Copy link
Member

gkellogg commented Dec 6, 2024

Dots and comas are already syntactic elements. :a :p 123,545 . are two statements.

@niklasl
Copy link
Author

niklasl commented Dec 6, 2024

It seems very odd to support the underscore _ as thousands separator, when humans (who will continue to input much data) typically use dots . or commas , which are not supported. I would suggest supporting all three, or re-un-supporting _

By "odd", are you arguing that SPARQL caters to a different audience than the languages referenced above? The use of _ to make numeric literals more readable is fairly ubiquitous in programming languages today. Even SQL nowadays supports this form. It is intended to help humans read and write queries, within the syntactic constraints of these specific languages. SPARQL too can adopt this recognized style.

(Some languages, e.g. Swedish, use comma , as decimal separator. But SPARQL is not natural language, does not and reasonably cannot have localized notation for numbers. Foremost since . and , are already used, second since this variability would be unpredictable. The only place I've seen localized support was many years ago in Vimscript; which broke a lot of things.)

@domel
Copy link
Contributor

domel commented Dec 6, 2024

Agree with @gkellogg and @niklasl. In addition to the fact that dots and commas are already used, there are other inconsistencies with natural languages, e.g. in Polish, a comma is used to separate decimal parts, just like a dot in English. On the other hand, underscore is quite often used in a unified way in different languages.

@afs
Copy link
Contributor

afs commented Dec 7, 2024

In @niklasl's language survey, the usage is in any place and including factional parts.
_ isn't a value-based separator, and that's a positive aspect because, e.g., the Indian numbering system, does not follow the same grouping conventions.

@TallTed
Copy link
Member

TallTed commented Dec 9, 2024

I think I'm understanding now, that this underscore support is only relevant with un-quoted numeric literals, so :a :p "123,545.0" . is one statement, and "123,545.0" gets interpreted as a decimal, while "123,545" would be interpreted as an integer, and both are equivalent (though not RDF equal) to 123545?

(How would "123,545." be interpreted?)

@gkellogg
Copy link
Member

The triple :a :p "123,545.0" . is, of course, perfectly valid, as the object is a plain literal with no expectation of any format. However, if you expected it to be an xsd:double, the Lexical Mapping sets out the form of valid decimal numbers using the expression (\+|-)?([0-9]+(\.[0-9]*)?|\.[0-9]+)([Ee](\+|-)?[0-9]+)? |(\+|-)?INF|NaN which does not allow comma. So, it does not get interpreted as a decimal, neither would "123,545" be interpreted as anything, and in any case does not match the pattern expected for an integer where the lexical representation is "... a finite-length sequence of one or more decimal digits (#x30-#x39) with an optional leading sign. If the sign is omitted, "+" is assumed. For example: -1, 0, 12678967543233, +100000."

So, :a :p 123,545.0 . would be interpreted as two statements, the first with an integer object of 123 and the second with a decimal object of 545.0 (as the comma separates objects in an objectList.

The only interpretation of numeric literals which are not strings with a a specific datatype is NumericLiteral. Literals of the form RDFLiteral can have a datatype, and this is where the conversation comes about optional support for literals with a recognized datatype which are invalid according to the value space description of the associated datatype.

@afs
Copy link
Contributor

afs commented Dec 10, 2024

_ is nothing more than superficial padding for human readability.

In processing, the parser removes all _ in the abbreviated syntax form for numbers (i.e. written without datatype) when determining the lexical form.

Examples --

Valid ways to write the literal xsd:integer "123456"^^xsd:integer:

RDF Turtle 1.1

:s :p 123456 .

Proposed: same RDF term as 123456 and the generated. RDF term is "123456"^^xsd:integer

123_456
1_23_456

There is no proposal to change the meaning of datatypes, or the syntax that explicitly has the datatype, or the datatype lexical to value mapping:

Illegal:

"123_456"^^xsd:integer
"123,456"^^xsd:integer

In SPARQL 1.1, the abbreviated syntax form must be used in LIMIT and OFFSET; "123"^^xsd:integer is invalid syntax.

   LIMIT 1000
   LIMIT 1_000
   LIMIT 10_00

@afs afs added spec:substantive Issue or proposed change in the spec that changes its normative content spec:enhancement Issue or proposed change to enhance the spec without changing the normative content substantively labels Dec 10, 2024
@afs
Copy link
Contributor

afs commented Dec 10, 2024

By the label descriptions, this is seems to fall between
spec:enhancement "Issue or proposed change to enhance the spec without changing the normative content substantively" and
spec:substantive "Issue or proposed change in the spec that changes its normative content".

Should this be marked ms:future-work or is there advocacy for this in the RDF 1.2 publications?

IMO This is a small usability improvement. It is normative in the sense there are grammar changes; it is localised and so arguably below the "substantively" level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:enhancement Issue or proposed change to enhance the spec without changing the normative content substantively spec:substantive Issue or proposed change in the spec that changes its normative content
Projects
None yet
Development

No branches or pull requests

5 participants