-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[superseded] Adding support for user-defined number suffixes #17020
Conversation
While I didn't add this check to the lexer, I could see a benefit to it: I could easily forbid suffixes that start with the letters Let me know if adding this restriction to the lexer (and adding it to the documentation) is desirable. |
Two thoughts:
|
I'm not sure how it currently works, but floating point values containing 'e' and hexadecimal numbers need to use the |
how about requiring a pragma, which mostly addresses this concern, and makes it more searchable, eg: proc big(a: string) {.litteral.} = # create a JsBigInt from a
echo -123'big # ok
proc bad(a: string) = # create a JsBigInt from a
echo -123'bad # error, bad must have {.litteral.} (big can still be used in other context; but the pragma also allows it to be used in a litteral context) |
A pragma doesn't address the scoping problem either. It's not that I want to annotate the suffix declaration, I am thinking about a scope extension. For example: from bignums import nil
echo 355'bignum # works because it's turned into
echo bignums.bignum("355")
If it weren't so silly I would propose that |
that's the exact same problem with MCS though. but here, you can use: option 1: no scope extension(but IIUC your problem is "small identifiers" polluting scope?) from bignums import bignum
echo 355'bignum option 2: special litteral rulefrom bignums import 'bignum # makes it only usable as a litteral
echo 355'bignum # ok
echo bignum("355") # `bignum` not declared, was imported via 'bignum but isn't option 1 good enough? |
Oh there is an option: Suffix procs need to be declared as proc `'bignum` (...)
And then the rewrite rule is:
And then we can import suffixes just like:
|
seems good; in particular it helps with turning f32 (etc) into a user defined litteral (defined in system.nim via doAssert not declared(f32)
doAssert declared(`'f32`) this also encourages 1 way to write things: a user could of course define both, but that'd be opt in: proc `'bignum`(...)
proc bignum(...)
1 way to do things seems better (more searchable, etc); see also my argument about ambiguities this would cause otherwise (12e32 vs 12'e32 etc). This would help turning the builtin f32 (etc) into user defined literals, with obvious advantages. |
I like the two big ideas that have emerged from this discussion:
I'll start work on both; starting with the lexer itself. Slightly off-topic: after this change is released, I/we will need to encourage linters and code markup tools (such as prettifiers) to properly handle the unpaired "single quotes". I use the sublime text editor and I'll need to fix/hack that sooner rather than later.) |
I doubt we can deprecate the |
Started work on these changes. Just wrote the new test cases. But, since we are talking "big picture", would there be value in also allowing more than [0-9a-fA-F] characters in a number? Perhaps with restrictions such as:
This would allow for useful things like this: import korean
var a = 0k열여섯'hangul or import numericsystems
var a = 0s03jKi3h'base60 or import roman
assert 256 == 0CCLVI'roman or crazy stuff like: import moneystuff
var a = 1,234,332.20'USD I, myself, don't need this functionality personally, so I'm not pushing hard for it. But, I can see it being useful for other libraries. I can also see danger in allowing any punctuation other than dots and underscores. So perhaps even if we allow more than the just digit/hex characters, punctuation should not be allowed. |
user can always write:
Commas would be more common and useful to support but I don't see how to avoid the ambiguity this would lead to: In any case, this could be handled in future work. |
We can always add it later. |
@JohnAD what's the current state of this PR? I really want it... :-) |
Started work again on this today. I have already added the expanded/corrected unit testing. Should have a new commit for the lexer/parser by end of tomorrow. |
Open question: The minus symbol (
And, if we were doing things in the higher levels of the language, I suspect I could easily distinguish between them. But, I'm trying to get the "negative" usage into the numeric literal where it really makes sense. To do that, the logic needs to be in the lexer, which is literally a character-by-character scan. In the end, being conservative, I suspect any numeric library will still need to handle the unary negative. I just added the ability for the lexer to "remember" the previous tokType in the main loop. This solved multiple problems for me. So, if the previous tokType is in the
set or is preceded by whitespace, then a Sound reasonable? Suggestions? (Off-topic, but IMO, the conversion to float/int (and the handling of different bases) for numeric literals should not be in the lexer. That is beyond the scope of a lexer. But that is me being nitpicky. And moving it would be a thought for another day.) |
Still to do:
|
yes, this is the best approach IMO.
what about
well it's not really off-topic and relates to nim-lang/RFCs#228
but it can indeed be fixed in followup work |
I've got a new version ready for commit, but I'll pause for a bit just in case you are in the middle of adding a conversation item. |
for bug 6, relevant code to fix is in lib/packages/docutils/highlite.nim eg: of '\'':
inc(pos)
g.kind = gtCharLit |
|
fix for bug 10: diff --git a/compiler/lexer.nim b/compiler/lexer.nim
index c0ba217b5..efd78f983 100644
--- a/compiler/lexer.nim
+++ b/compiler/lexer.nim
@@ -1339,7 +1339,7 @@ proc rawGetTok*(L: var Lexer, tok: var Token) =
if L.buf[L.bufpos] in SymChars+{'_'}:
lexMessage(L, errGenerated, "invalid token: no whitespace between number and identifier")
of '-':
- if ((tok.strongSpaceA > 0) or (L.previousTokType in negationPrefixes)) and (L.buf[L.bufpos + 1] in '0'..'9'):
+ if ((tok.strongSpaceA > 0) or (L.previousTokType in negationPrefixes) or L.buf[L.bufpos - 1] == ' ') and (L.buf[L.bufpos + 1] in '0'..'9'):
getNumber(L, tok)
if L.buf[L.bufpos] in SymChars+{'_'}:
lexMessage(L, errGenerated, "invalid token: no whitespace between number and identifier") test for bug 10 (adapt as needed in your test file): when true:
func `'wrap`(a: string): string = "{" & a & "}"
template toSuf1(): untyped =
-12'wrap
assert toSuf1() == "{-12}"
template toSuf2(): untyped =
var a = "adsf"
-12'wrap
assert toSuf2() == "{-12}" => fixed in devel, test in #17500 |
when true:
func `'wrap`(a: string): string = "{" & a & "}"
template toSuf(`'suf`): untyped =
let x = -12'suf
x
assert toSuf(`'wrap`) == "{-12}" (needs a test case; again, integrate as needed to reuse your => #17500 |
when defined case40:
import std/genasts # pending #17426
import std/macros
func `'wrap2`(a: string): string = "{" & a & "}"
macro bar(): untyped =
func wrap1(a: string): string = "{" & a & "}"
func `'wrap3`(a: string): string = "{" & a & "}"
result = genAst():
let a1 = wrap1"-128" # ok
let a2 = -128'wrap2
# let a3 = -128'wrap3 # xxx still doesn't work
let a3 = `'wrap3`("-128") # ok
bar() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, remaining comments and bug fixes can be done in followup PR
/cc @Araq
(EDIT: see bug 11 i added but it's more of a TODO for followup PR)
if baseTokType == tkStrNumLit: | ||
result = dotExpr(p, result) | ||
result = parseGStrLit(p, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems hacky and has no corresponding grammar update. I'll see if it can be done differently.
I took over, see #17489 |
…ng#17488) * make unary minus part of number literals, refs nim-lang#17020 * fixes regression
* user defined integer literals; refs nim-lang#17020 * updated renderer.nim * use mlexerutils helper * imported all test cases from nim-lang#17020 * final grammar updated
As a precursor for adding a decimal (base10) library to Nim's stdlib, this PR adds support for user-defined suffixes to numeric literals.
nim-lang/RFCs#308 - Add decimal to standard library
This PR also solves:
nim-lang/RFCs#216 - User defined integer literals
nim-lang/RFCs#228 - user defined quote literals, eg: -12.3E+4022'dec128 ; -128'i8 is transformed as
i8("-128")
To distinguish between a typical string literal and a string literal that is the result of a numeric suffix, a new token was added to the lexer:
tkStrNumLit
. This token is only generated if, when processing a numeric literal, the suffix does not match one of the pre-defined suffixes used internally.The rest of the work is done in the parser. When the
tkStrNumLit
token is detected, a "dot" is presumed to follow and the following symbol is considered the indentifier called in a "dotExpr" syntax. So, effectively:becomes the equivalent of:
With this, a library, including the up-and-coming decimal library, can parse numbers in source code that cannot be handled by even
int64
orfloat64
. For example:(The decimal library plans on using
m
as the suffix to match the convention seen in other languages such as C#.)While I've written tests and don't know of any bugs, this is my first attempt at modifying the core lexer/parser of a compiler, I've marked the PR as Work-In-Progress [WIP]. Any and all code review is very much appreciated.
links