Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isdigit(Char(::UInt8)) performance regression? #25883

Open
samoconnor opened this issue Feb 5, 2018 · 19 comments
Open

isdigit(Char(::UInt8)) performance regression? #25883

samoconnor opened this issue Feb 5, 2018 · 19 comments
Labels
strings "Strings!"

Comments

@samoconnor
Copy link
Contributor

In 0.7 it seems that isdigit(Char(x::Uint8)) produces more than 3 times as many instructions as
x::UInt8 in UInt8('0'):UInt8('9').

Is there a good reason not to define isdigit(x::UInt8) = x in UInt8('0'):UInt8('9') in Base?

julia> VERSION
v"0.6.2"

julia> code_native(x->isdigit(Char(x)), (UInt8,))
	.section	__TEXT,__text,regular,pure_instructions
	pushq	%rbp
	movq	%rsp, %rbp
	addb	$-48, %dil
	cmpb	$10, %dil
	setb	%al
	popq	%rbp
	retq
	nopw	%cs:(%rax,%rax)
julia> VERSION
v"0.7.0-DEV.3639"

# x in '0':'9' produces 5 instructions...
julia> code_native(x->x in UInt8('0'):UInt8('9'), (UInt8,))
	.section	__TEXT,__text,regular,pure_instructions
	addb	$-48, %dil
	cmpb	$10, %dil
	setb	%al
	retq
	nopl	(%rax)
;}

# ... but isdigit produces...
julia> code_native(x->isdigit(Char(x)), (UInt8,))
	.section	__TEXT,__text,regular,pure_instructions
	movzbl	%dil, %eax
	testb	%dil, %dil
	js	L22
	shll	$24, %eax
	cmpl	$805306368, %eax        ## imm = 0x30000000
	jae	L43
	xorl	%eax, %eax
	retq
L22:
	movl	%eax, %ecx
	andl	$63, %ecx
	andl	$192, %eax
	leal	(%rcx,%rax,4), %eax
	shll	$16, %eax
	orl	$3229614080, %eax       ## imm = 0xC0800000
L43:
	cmpl	$956301313, %eax        ## imm = 0x39000001
	setb	%al
	retq
	nopw	%cs:(%rax,%rax)
@samoconnor
Copy link
Contributor Author

Related: Is the omission of == and in methods for Integers and Chars deliberate?
i.e. should things like the following be defined?

==(i::T, c::Char) where T <: Integer = i == T(c)
in(a::T, b::StepRange{Char,S}) where {T <: Integer, S} = in(a, StepRange{T, S}(b))

This would allow 0x30 == '0' and 0x33 in '0':''9'.

@ararslan
Copy link
Member

ararslan commented Feb 5, 2018

Is the omission of == and in methods for Integers and Chars deliberate?

Yes, they used to be defined but were removed: #16024

@samoconnor
Copy link
Contributor Author

Thx for the reference @ararslan.
I think I tend to agree with @stevengj's comments in #16024.

The mapping between characters and integers [0:127] has been well defined since 1963.

If I write x == '\n', there is no ambiguity, I want to know if x is 0x0A.

#16024 (comment) raises the example of Set(Any[120, 'x']) == Set(['x']) == true as a problem. That particular use case seems unlikely. I would say that anyone putting 'x' in a Set as a special marker for something should use a Symbol :x instead. But maybe this was just intended as a minimal example and there are related real practical problems lurking...

The RFCs for UTF-8-aware wire-formats like HTTP and JSON are careful to ensure that most parsing can be done in the octet domain (ignoring UTF-8) and decoding of UTF-8 can be left to higher layers. The new CodeUnits wrapper is nice for parsers like this, but it isn't nice to have to write UInt8('[') every time I want to write c == '['.

However, in my parser module I can always define something like this:

==(a, b) = Base.isequal(a, b)
!=(a, b) = !(a == b)
==(i::T, c::Char) where T <: Integer = Base.isequal(i, T(c))
in(a, b) = Base.in(a, b)
in(a::T, b::StepRange{Char,S}) where {S, T <: Integer} = Base.in(a, StepRange{T, S}(b))

The argument that stuff not needed by base needn't be in Base is a sensible one. Maybe someone should build an ASCII.jl package that adds all the methods that "make obvious sense" when you are writing a byte parser, but are not helpful if you're doing linear algebra.

@nalimilan nalimilan added the strings "Strings!" label Feb 5, 2018
@stevengj
Copy link
Member

stevengj commented Feb 5, 2018

However, in my parser module I can always define something like this:

This is type piracy, which means it can easily conflict with definitions in other modules.

@samoconnor
Copy link
Contributor Author

Well, I'm not importing and modifying Base.==, I'm defining a non-expoted local MyModule.== who's default method is to call Base.==. If I understand correctly, this shouldn't have any impact outside MyModule.

@JeffBezanson
Copy link
Member

I think we just made a judgment call that characters and integers aren't the same kind of thing in general, and so should not be considered equal. It might make sense to add methods to isdigit et. al. for integers though, since those functions are explicitly for classifying characters --- there's nothing else the call could mean.

@stevengj
Copy link
Member

stevengj commented Feb 5, 2018

+1 for integer (or at least UInt8) versions of these functions

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Feb 6, 2018

Disambiguating question: when one writes isdigit(b::UInt8) is b being interpreted as a code point or a code unit? In this particular case, they give the same answer; perhaps it doesn't matter.

@stevengj
Copy link
Member

stevengj commented Feb 6, 2018

A codepoint. Fortunately, because of the self-synchronizing property of UTF-8, you will never get false positives for codeunits that are part of multi-byte encoded codepoints.

@samoconnor
Copy link
Contributor Author

Right, the RFC says: "[UTF-8] has the quality of preserving the full US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one octet having the normal US-ASCII value, and any octet with such a value can only stand for a US-ASCII character, and nothing else."

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Feb 6, 2018

It seems like we should just broaden the character predicates to take integer code point arguments. This has the advantage that for some integer types like bytes, the answer is considerably easier to compute that way. There's probably also some optimization work to be done on conversion from UInt8 to Char. I did play around with some hand optimizations but it was not clearly better than what LLVM was generating already—LLVM is really good at partially specializing bit manipulation code on types and known values. Specifically, I tried these definitions:

function Char(b::UInt8)
    u = b % UInt32
    b < 0x80 && return reinterpret(Char, u << 24)
    m = 0xc2800000 + (UInt32(b  0xc0) << 24)
    reinterpret(Char, ((u & 0x3f) << 16) | m)
end

function Char(b::Int8)
    0  b || code_point_err(u)::Union{}
    reinterpret(Char, (b % UInt32) << 24)
end

But the generated code is nearly identical to the above for the existing definitions.

@nalimilan
Copy link
Member

It sounds too bad that we'd need to allow things like isdigit(1) === false because of performance. It could be confusing for people who don't know what's a codepoint and it's inconsistent with the separation of characters and integers in the rest of the API.

The ideal solution would be that the compiler removes unnecessary conversions. Given how smart LLVM can be, I'm surprised that even simple things like code_native(x->Char(x) == '0', (UInt8,)) generate so many instructions. (But I really can't help improving that...)

@StefanKarpinski
Copy link
Member

It's because the Int8 to Char conversion is no longer just a zero extension, although it should be possible for the compiler to figure out that for fixes characters to compare with a lot of this work doesn't matter. Given that the fast path here is just compare and shift, it's worth looking at if this is actually a performance issue or not. Did you benchmark it by any chance, @samoconnor?

@StefanKarpinski
Copy link
Member

A thought that @Keno and I talked about at some point was having a CodePoint type that represents a code point in a way that is decoupled from the representation of Char, which would allow expressing this as isdigit(CodePoint(b)). A slight wrinkle is that CodePoint is essentially the same as a as-yet-hypothetical CharU32 representing a character encoded in UTF-32.

@Keno
Copy link
Member

Keno commented Feb 6, 2018

Oh yeah, this was basically the reason I wanted that (because I didn't like defining these predicate functions on integers).

@stevengj
Copy link
Member

stevengj commented Feb 6, 2018

myisdigit(x::UInt8) = UInt8('0') ≤ x ≤ UInt8('9') seems to be about 1.7× faster than isdigit(Char(x)) on my machine with a recent Julia 0.7.

@samoconnor
Copy link
Contributor Author

@stevengj Note that x in UInt8('0'):UInt8('9') seems to produce less instructions than UInt8('0') ≤ x ≤ UInt8('9') (see dump in original post above).

@StefanKarpinski, I haven't done a specific benchmark on this. I've been benchmarking and studying code_native in my LazyJSON code. I noticed that a function I thought should reduce down to only a handful of instructions was a bit big and had large int constants like $3229614080 from the Unicode code.

[Aside: I'm making progress, in both speed and memory allocation:]

Access value close to start:
LazyJSON.jl:  0.000558 seconds (3.23 k allocations: 121.719 KiB)
JSON.jl:      1.025800 seconds (9.34 M allocations: 913.682 MiB, 12.53% gc time)

Access 2 values close to end:
LazyJSON.jl:  0.162446 seconds (2.66 k allocations: 71.250 KiB)
JSON.jl:      1.019581 seconds (9.34 M allocations: 913.682 MiB, 12.22% gc time)

Another consideration is how the definition works alongside other code. e.g.

if !isdigit(c)  ||
    isjsonws(c) ||
    c == '}'    ||
    c == ']'

I'm guessing that if isdigit has a nice simple definition the compiler/llvm has a better chance of reducing the whole expression to something optimal. I would worry that in the worst case a more complex unicode isdigit would not get inlined and you might end up with a function call in an inner loop.

@samoconnor
Copy link
Contributor Author

Related:

LLVM does (or did) have a pass to convert a sequence of integer comparisons to a jump table.
(from #18285 (comment))

@stevengj
Copy link
Member

stevengj commented Feb 9, 2018

x in UInt8('0'):UInt8('9') seems to produce less instructions than UInt8('0') ≤ x ≤ UInt8('9')

I think that's because ... ≤ x ≤ … is short-circuiting; range containment seems equivalent to ifelse(UInt8('0') ≤ x, x ≤ UInt8('9'), false). According to @btime, however, they are both the same speed, presumably due to branch prediction.

To get a benchmark that benefits less from branch prediction (which makes doing the same isdigit call over and over deceptive), I tried benchmarking sum(isdigit, b) for b = [UInt8(rand(Bool) ? '3' : 'a') for i=1:1000]. Results:

  • sum(b): 74ns
  • isdigit(Char(x)): 764ns
  • UInt8('0') ≤ x ≤ UInt8('9'): 285ns
  • x in UInt8('0'):UInt8('9') or ifelse version: 216ns

So, in the absence of branch prediction, you can get a 4–5× speedup (not counting the time for the sum operations) by operating directly on bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

No branches or pull requests

7 participants