-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
isdigit(Char(::UInt8)) performance regression? #25883
Comments
Related: Is the omission of
This would allow |
Yes, they used to be defined but were removed: #16024 |
Thx for the reference @ararslan. The mapping between characters and integers [0:127] has been well defined since 1963. If I write #16024 (comment) raises the example of The RFCs for UTF-8-aware wire-formats like HTTP and JSON are careful to ensure that most parsing can be done in the octet domain (ignoring UTF-8) and decoding of UTF-8 can be left to higher layers. The new However, in my parser module I can always define something like this:
The argument that stuff not needed by base needn't be in Base is a sensible one. Maybe someone should build an ASCII.jl package that adds all the methods that "make obvious sense" when you are writing a byte parser, but are not helpful if you're doing linear algebra. |
This is type piracy, which means it can easily conflict with definitions in other modules. |
Well, I'm not importing and modifying |
I think we just made a judgment call that characters and integers aren't the same kind of thing in general, and so should not be considered equal. It might make sense to add methods to |
+1 for integer (or at least |
Disambiguating question: when one writes |
A codepoint. Fortunately, because of the self-synchronizing property of UTF-8, you will never get false positives for codeunits that are part of multi-byte encoded codepoints. |
Right, the RFC says: "[UTF-8] has the quality of preserving the full US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one octet having the normal US-ASCII value, and any octet with such a value can only stand for a US-ASCII character, and nothing else." |
It seems like we should just broaden the character predicates to take integer code point arguments. This has the advantage that for some integer types like bytes, the answer is considerably easier to compute that way. There's probably also some optimization work to be done on conversion from function Char(b::UInt8)
u = b % UInt32
b < 0x80 && return reinterpret(Char, u << 24)
m = 0xc2800000 + (UInt32(b ≥ 0xc0) << 24)
reinterpret(Char, ((u & 0x3f) << 16) | m)
end
function Char(b::Int8)
0 ≤ b || code_point_err(u)::Union{}
reinterpret(Char, (b % UInt32) << 24)
end But the generated code is nearly identical to the above for the existing definitions. |
It sounds too bad that we'd need to allow things like The ideal solution would be that the compiler removes unnecessary conversions. Given how smart LLVM can be, I'm surprised that even simple things like |
It's because the |
A thought that @Keno and I talked about at some point was having a |
Oh yeah, this was basically the reason I wanted that (because I didn't like defining these predicate functions on integers). |
|
@stevengj Note that @StefanKarpinski, I haven't done a specific benchmark on this. I've been benchmarking and studying [Aside: I'm making progress, in both speed and memory allocation:]
Another consideration is how the definition works alongside other code. e.g.
I'm guessing that if |
Related:
|
I think that's because To get a benchmark that benefits less from branch prediction (which makes doing the same
So, in the absence of branch prediction, you can get a 4–5× speedup (not counting the time for the |
In 0.7 it seems that
isdigit(Char(x::Uint8))
produces more than 3 times as many instructions asx::UInt8 in UInt8('0'):UInt8('9')
.Is there a good reason not to define
isdigit(x::UInt8) = x in UInt8('0'):UInt8('9')
in Base?The text was updated successfully, but these errors were encountered: