Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncodeunits(c::Char): fast equivalent of ncodeunits(string(c)) #29153

Merged
merged 2 commits into from
Sep 14, 2018

Conversation

StefanKarpinski
Copy link
Member

No description provided.

base/char.jl Outdated
@@ -134,7 +141,7 @@ function decode_overlong(c::Char)
end

"""
decode_overlong(c::AbstractChar)
decode_overlong(c::AbstractChar) -> UInt32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this return type for all AbstractChar or only for Char?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result is a code point which there's no reason to represent as anything but a UInt32.

base/char.jl Outdated
ncodeunits(c::Char) = max(1, 4 - (trailing_zeros(reinterpret(UInt32, c)) >> 3))

"""
codepoint(c::AbstractChar) -> UInt32
Copy link
Member

@KristofferC KristofferC Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps UInt32->Integer if this is for AbstractChar or add an extra line

codepoint(c::AbstractChar) -> Integer
codepoint(c::Char) -> UInt32

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, there's not reason not to represent a code point as a UInt32.

Copy link
Member

@KristofferC KristofferC Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I was thinking about the docs for this function that explicitly say

For `Char`, this is a `UInt32` value, but
`AbstractChar` types that represent only a subset of Unicode may
return a different-sized integer (e.g. `UInt8`).

The signature and the docs seem at odds now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess someone could do that although it seems kind of silly to me. I didn't write these docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who wrote the docs are not really that important though... Just that the end result is consistent and doesn't say in the signature that codepoint(c::AbstractChar) has to return a UInt32 and in the documentation just below it, that it can return a UInt8.

@stevengj
Copy link
Member

stevengj commented Sep 13, 2018

You could define a fallback method ncodeunits(c::AbstractChar) = write(devnull, c)? Maybe not, since write will return a count of bytes, which might not be code-units if c represents some other encoding?

@stevengj
Copy link
Member

Interestingly, according to @btime, the ncodeunits implementation here takes exactly the same amount of time as write(devnull, c).

@@ -91,7 +98,7 @@ end
# not to support malformed or overlong encodings.

"""
ismalformed(c::AbstractChar)
ismalformed(c::AbstractChar) -> Bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still using this syntax or are we transitioning to ismalformed(c::AbstractChar)::Bool in documentation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm aware we're still using this syntax most places.

@StefanKarpinski
Copy link
Member Author

This is pretty efficient:

julia> @code_native write(devnull, 'x')
    bswapl	%edi
    xorl	%eax, %eax
    nopw	%cs:(%rax,%rax)
L16:
    shrl	$8, %edi
    addq	$1, %rax
    testl	%edi, %edi
    jne	L16
    retq

Versus:

julia> @code_native ncodeunits('x')
	tzcntl	%edi, %eax
	shrl	$3, %eax
	movl	$4, %ecx
	subq	%rax, %rcx
	testq	%rcx, %rcx
	movl	$1, %eax
	cmovgq	%rcx, %rax
	retq

The write version might be slower if the loop branch is unpredictable, i.e. characters with different sizes. The ncodeunits version could be faster if I could get rid of the max in it. There may be some clever way to do that but I'm not sure that this really warrants that much micro-optimization.

@vtjnash
Copy link
Member

vtjnash commented Sep 13, 2018

Intel predicts the second is faster, by 10%:

$ ./usr/tools/llvm-mca
    bswapl	%edi
    xorl	%eax, %eax
    nopw	%cs:(%rax,%rax)
L16:
    shrl	$8, %edi
    addq	$1, %rax
    testl	%edi, %edi
    jne	L16
^D
Iterations:        100
Instructions:      700
Total Cycles:      205
Total uOps:        700

Dispatch Width:    6
uOps Per Cycle:    3.41
IPC:               3.41
Block RThroughput: 1.2


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.50                        bswapl	%edi
 1      1     0.25                        xorl	%eax, %eax
 1      1     0.17                        nopw	%cs:(%rax,%rax)
 1      1     0.50                        shrl	$8, %edi
 1      1     0.25                        addq	$1, %rax
 1      1     0.25                        testl	%edi, %edi
 1      1     0.50                        jne	L16


Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     1.50   1.50    -      -      -     1.50   1.50    -     

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -     0.50    -      -      -     0.50    -      -     bswapl	%edi
 -      -     0.49    -      -      -      -     0.50   0.01    -     xorl	%eax, %eax
 -      -      -      -      -      -      -      -      -      -     nopw	%cs:(%rax,%rax)
 -      -     0.50    -      -      -      -      -     0.50    -     shrl	$8, %edi
 -      -      -     0.50    -      -      -     0.50    -      -     addq	$1, %rax
 -      -      -     0.50    -      -      -      -     0.50    -     testl	%edi, %edi
 -      -     0.51    -      -      -      -      -     0.49    -     jne	L16


$ ./usr/tools/llvm-mca
	tzcntl	%edi, %eax
	shrl	$3, %eax
	movl	$4, %ecx
	subq	%rax, %rcx
	testq	%rcx, %rcx
	movl	$1, %eax
	cmovgq	%rcx, %rax
^D
Iterations:        100
Instructions:      700
Total Cycles:      184
Total uOps:        700

Dispatch Width:    6
uOps Per Cycle:    3.80
IPC:               3.80
Block RThroughput: 1.2


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      3     1.00                        tzcntl	%edi, %eax
 1      1     0.50                        shrl	$3, %eax
 1      1     0.25                        movl	$4, %ecx
 1      1     0.25                        subq	%rax, %rcx
 1      1     0.25                        testq	%rcx, %rcx
 1      1     0.25                        movl	$1, %eax
 1      1     0.50                        cmovgq	%rcx, %rax


Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     1.76   1.74    -      -      -     1.73   1.77    -     

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -     1.00    -      -      -      -      -      -     tzcntl	%edi, %eax
 -      -     0.17    -      -      -      -      -     0.83    -     shrl	$3, %eax
 -      -     0.02    -      -      -      -     0.96   0.02    -     movl	$4, %ecx
 -      -     0.41   0.17    -      -      -     0.03   0.39    -     subq	%rax, %rcx
 -      -     0.31   0.31    -      -      -     0.07   0.31    -     testq	%rcx, %rcx
 -      -     0.01   0.26    -      -      -     0.67   0.06    -     movl	$1, %eax
 -      -     0.84    -      -      -      -      -     0.16    -     cmovgq	%rcx, %rax

But that's actually only for my native CPU, if we look back in time (ivybridge, broadwell, haswell, nehalem), we see that the predicted performance of the first has been relatively unchanged over time, while the performance of the second has been steadily improving.

What's I think is likely happening is that the first loop is actually much cheaper for the processor to execute (much lower latency), so it has always done fairly well in a benchmarking loop. Whereas the second loop actually requires more transistors to reach the same level of performance (the above output is truncated, the full output includes some graphs to illustrate this point). I could be wrong, since I'm just reverse-engineering the output of a static-prediction tool, but that would be my analysis.

There was a non-public `codelen(c::Char)` method which previously did
this. This also replaces internal uses of this with `ncodeunits(c)`.
@StefanKarpinski
Copy link
Member Author

I replaced the internal-only Base.codelen function with this new method of ncodeunits.

@StefanKarpinski StefanKarpinski merged commit 3b02991 into master Sep 14, 2018
@StefanKarpinski StefanKarpinski deleted the sk/ncodeunits-char branch September 14, 2018 15:11
@StefanKarpinski
Copy link
Member Author

Also, went with the write(devnull, c) definition since I figure that something that works well on a larger span of older and newer CPUs is somewhat better. If this changes in the future we could always use a different definition.

fredrikekre added a commit that referenced this pull request Dec 5, 2018
changes between Julia 1.0 and 1.1, including:

- Custom .css-style for compat admonitions.

- Information about compat annotations to CONTRIBUTING.md.

- NEWS.md entry for PRs #30090, #30035, #30022, #29978,
  #29969, #29858, #29845, #29754, #29638, #29636, #29615,
  #29600, #29506, #29469, #29316, #29259, #29178, #29153,
  #29033, #28902, #28761, #28745, #28708, #28696, #29997,
  #28790, #29092, #29108, #29782

- Compat annotation for PRs #30090, #30013, #29978,
  #29890, #29858, #29827, #29754, #29679, #29636, #29623,
  #29600, #29440, #29316, #29259, #29178, #29157, #29153,
  #29033, #28902, #28878, #28761, #28708, #28156, #29733,
  #29670, #29997, #28790, #29092, #29108, #29782, #25278

- Documentation for broadcasting CartesianIndices (#30230).
- Documentation for Base.julia_cmd().
- Documentation for colon constructor of CartesianIndices (#29440).
- Documentation for ^(::Matrix, ::Number) and ^(::Number, ::Matrix).

- Run NEWS-update.jl.

Co-authored-by: Morten Piibeleht <morten.piibeleht@gmail.com>
Co-authored-by: Fredrik Ekre <ekrefredrik@gmail.com>
fredrikekre added a commit that referenced this pull request Dec 5, 2018
changes between Julia 1.0 and 1.1, including:

- Custom .css-style for compat admonitions.

- Information about compat annotations to CONTRIBUTING.md.

- NEWS.md entry for PRs #30090, #30035, #30022, #29978,
  #29969, #29858, #29845, #29754, #29638, #29636, #29615,
  #29600, #29506, #29469, #29316, #29259, #29178, #29153,
  #29033, #28902, #28761, #28745, #28708, #28696, #29997,
  #28790, #29092, #29108, #29782

- Compat annotation for PRs #30090, #30013, #29978,
  #29890, #29858, #29827, #29754, #29679, #29636, #29623,
  #29600, #29440, #29316, #29259, #29178, #29157, #29153,
  #29033, #28902, #28878, #28761, #28708, #28156, #29733,
  #29670, #29997, #28790, #29092, #29108, #29782, #25278

- Documentation for broadcasting CartesianIndices (#30230).
- Documentation for Base.julia_cmd().
- Documentation for colon constructor of CartesianIndices (#29440).
- Documentation for ^(::Matrix, ::Number) and ^(::Number, ::Matrix).

- Run NEWS-update.jl.

Co-authored-by: Morten Piibeleht <morten.piibeleht@gmail.com>
Co-authored-by: Fredrik Ekre <ekrefredrik@gmail.com>
fredrikekre added a commit that referenced this pull request Dec 5, 2018
Addition of NEWS and compat admonitions for important changes between Julia 1.0 and 1.1, including:

- Custom .css-style for compat admonitions.

- Information about compat annotations to CONTRIBUTING.md.

- NEWS.md entry for PRs #30090, #30035, #30022, #29978,
  #29969, #29858, #29845, #29754, #29638, #29636, #29615,
  #29600, #29506, #29469, #29316, #29259, #29178, #29153,
  #29033, #28902, #28761, #28745, #28708, #28696, #29997,
  #28790, #29092, #29108, #29782

- Compat annotation for PRs #30090, #30013, #29978,
  #29890, #29858, #29827, #29754, #29679, #29636, #29623,
  #29600, #29440, #29316, #29259, #29178, #29157, #29153,
  #29033, #28902, #28878, #28761, #28708, #28156, #29733,
  #29670, #29997, #28790, #29092, #29108, #29782, #25278

- Documentation for broadcasting CartesianIndices (#30230).
- Documentation for Base.julia_cmd().
- Documentation for colon constructor of CartesianIndices (#29440).
- Documentation for ^(::Matrix, ::Number) and ^(::Number, ::Matrix).

- Run NEWS-update.jl.


Co-authored-by: Morten Piibeleht <morten.piibeleht@gmail.com>
Co-authored-by: Fredrik Ekre <ekrefredrik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants