-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parseFloatThousandSep #15421
Add parseFloatThousandSep #15421
Conversation
I think this might be a good addition. However, if we are to have something like this I would trade correctness for speed. Perform a first pass over the string and make explicit checks, namely:
Have that pass over the string remove the separators and hand the result to normal |
@Vindaar |
This is for strings as formatted for humans with quirky punctuation, Thats why I named differently, because does different things than |
@juancarlospaco I'll write a couple of lines later if I don't forget (ping me otherwise). Not a parsing expert though. :P |
Probably neither the smartest, nor most efficient way, but this is what I came up with: edit: haha, I just noticed I forgot about both minus and exp notation. The latter isn't important (who in their right mind would combine the two, but the former kinda is). If you think this is useful, I can add it. But won't do it now. Doesn't change the idea / approach. import parseutils, strutils
proc parseFloatThousandSep(s: string,
sep: static char = ',',
decimalDot: static char = '.'): float = # maybe should not have a default sep?
## version of `parseFloat` which allows for thousand separators.
## The following assumptions / requirements must be met by the string
## - no separator before a digit
## - first separator can be anywhere after first digit, but no more than 3 characters
## - there ``has`` to be 3 digits between successive separators
## - and between the last separator and the decimal dot
## - no separator after decimal dot
## - no duplicate separators
## - floats without separator allowed
var
buf = s
idx = 0
successive = 0
afterDot = false
lastWasDot = false
lastWasSep = false
hasAnySep = false
template bail(msg: untyped): untyped =
raise newException(ValueError, "Invalid float containing thousand separator." &
" Reason: " & $msg)
while idx < buf.len:
case buf[idx]
of sep:
if idx == 0:
bail("String starts with thousand separator.")
elif lastWasSep:
bail("Two separators in a row.")
elif afterDot:
bail("Separator found after decimal dot.")
buf.delete(idx, idx)
lastWasSep = true
hasAnySep = true
successive = 0
of '0' .. '9':
if hasAnySep and successive > 2:
bail("More than 3 digits between thousand separators.")
lastWasSep = false
lastWasDot = false
inc successive
inc idx
of decimalDot:
if idx == 0:
bail("String starts with decimal dot.")
elif hasAnySep and successive != 3:
bail("Not 3 successive digits before decimal point, despite larger 1000!")
successive = 0
lastWasDot = true
afterDot = true
inc idx
else:
# NOTE: could also move separator logic here if we wanted runtime separator
# selection. Case needs CT info
bail("Invalid character in float: " & $buf[idx])
result = buf.parseFloat
doAssert parseFloatThousandSep("1.0") == 1.0
doAssert parseFloatThousandSep("1.000") == 1.0
doAssert parseFloatThousandSep("1,000") == 1000.0
doAssertRaises(ValueError):
# invalid because , not the sep
discard parseFloatThousandSep("1,000", sep = '\'')
# compile time error due to duplicate case label
# parseFloatThousandSep("1,000", sep = '.')
doAssert parseFloatThousandSep("10,000.000") == 10000.0
doAssertRaises(ValueError):
# thousand sep after decimal dot
discard parseFloatThousandSep("10.000,000")
doAssert parseFloatThousandSep("1,000,000.000") == 1000000.0
doAssert parseFloatThousandSep("10,000,000.000") == 10000000.0
doAssertRaises(ValueError):
# starts with sep
discard parseFloatThousandSep(",123.000")
doAssertRaises(ValueError):
# starts with decimal dot
discard parseFloatThousandSep(".000")
doAssertRaises(ValueError):
# duplicate thousand sep
discard parseFloatThousandSep("123,,100.0")
doAssertRaises(ValueError):
# sep before dot
discard parseFloatThousandSep("123,.0") Up to someone else to decide if the separator should be compile time, but this made the code a lot nicer imo. Feel free to take it or leave it. :) |
Cool, can we also get the opposite of this? :) |
…ousands-separators
…ousands-separators
…ousands-separators
doAssert parseFloatThousandSep("-Inf", {pfNanInf}) == -Inf | ||
doAssert parseFloatThousandSep("+Inf", {pfNanInf}) == +Inf | ||
doAssert parseFloatThousandSep("1000.000000E+90") == 1e93 | ||
doAssert parseFloatThousandSep("-10 000 000 000.0001", sep=' ') == -10000000000.0001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parseFloatThousandSep("1e1")
raises but should be accepted even without {pfDotOptional}
because:
- a user may not want to set
{pfDotOptional}
(too loose, eg would allow integers to be parsed as float) - nim: strutils.parseFloat accepts it
- D: accepts it too (
rdmd --eval 'writeln("1e1".to!double);'
) - python3: accepts it (
float("1e1")
)
@juancarlospaco I think https://github.com/nim-lang/Nim/pull/15421/files#r534562246 is my last comment, after that LGTM finally... unless i've missed some other case EDIT: just found another bug: https://github.com/nim-lang/Nim/pull/15421/files#r534567618 please run sanity checks (or rather increase test coverage) before the next PTAL |
doAssert parseFloatThousandSep("1000.000000E+90") == 1e93 | ||
doAssert parseFloatThousandSep("-10 000 000 000.0001", sep=' ') == -10000000000.0001 | ||
doAssert parseFloatThousandSep("-10 000 000 000,0001", sep=' ', decimalDot = ',') == -10000000000.0001 | ||
doAssert classify(parseFloatThousandSep("NaN", {pfNanInf})) == fcNan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug, this shouldn't be accepted:
nim> echo parseFloatThousandSep("inf.0", {pfNanInf})
0.0
@@ -0,0 +1,59 @@ | |||
import strmisc, math |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please move the content of tests/stdlib/tstrmiscs.nim
into here (tstrmiscs.nim was badly named)
|
||
proc parseFloatThousandSepRaise(i: int; c: char; s: openArray[char]) {.noinline, noreturn.} = | ||
raise newException(ValueError, | ||
"Invalid float containing thousand separators, invalid char $1 at index $2 for input $3" % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise newException(ValueError,
"Invalid float containing thousand separators, invalid char $1 at index $2 for input '$3'" % [$c, $i, s.join])
so that it shows as
'1,0000'
instead of
['1', ',', '0', '0', '0', '0']
parseFloatThousandSepRaise(0, sep, str) # "1,1" | ||
|
||
if (strLen == 3 or strLen == 4) and ( | ||
(str[0] in {'i', 'I'} and str[1] in {'n', 'N'} and str[2] in {'f', 'F'}) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cna you simplify this logic? it's buggy for 2 reasons:
- https://github.com/nim-lang/Nim/pull/15421/files#r534567618 wrongly accepted
- inconsistently defers error handling to parseFloat:
nim> echo parseFloatThousandSep("infx", {pfNanInf})
Error: unhandled exception: invalid float: infx [ValueError]
nim> echo parseFloatThousandSep("infxx", {pfNanInf})
Error: unhandled exception: Invalid float containing thousand separators, invalid char , at index 0 for input 'infxx' [ValueError]
pfNanInf ## Allow "NaN", "Inf", "-Inf", etc. | ||
|
||
func parseFloatThousandSep*(str: openArray[char]; options: set[ParseFloatOptions] = {}; | ||
sep = ','; decimalDot = '.'): float = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be discussed in future work, but just curious if there's a reverse proc for this?
I just found about insertSep
, maybe they should cross-reference each other?
I'm not sure how robust insertSep
is though; there's also formatFloat
and formatEng
, maybe they should be extended to support formatting with thousand separators
Sorry, rejected. For multiple reasons:
|
/cc @Araq
these are good arguments; how about the following instead: proc parseIntCustom(a: string, start: int, T: typedesc[SomeInt], options: ParseOpt = {}): tuple[val: T, num: int] =
## val: parsed value; num: number of characters parsed on success or 0 on error
runnableExamples:
assert parseIntCustom("score: -1,234,567 name: foo", start = 5, int) == (-1,234,567, 10)
assert parseIntCustom("--1", start = 0, int) == (0, 0) # because --
assert parseIntCustom("-1", start = 0, uint) == (0, 0) # because uint
=> an API for that can be built on top of
proposed API fixes that
but the more hacky way doesn't help with input validation
I think |
parseFloat
designed to parse floats with thousand separators as found in the wild formatted for humans.runnableExamples
withdoAssert
,since
, changelog, etc.:)
links