-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] [performance] {.inline.} guarantees, --stacktrace:inline
, and --ignoreInline
#198
Comments
--stacktrace:inline
, and --ignoreInline
--stacktrace:inline
, and --ignoreInline
--stacktrace:inline
, and --ignoreInline
So we make debugging harder, introduce a bunch of new arguments/pragmas that make everything unneccessarily more complex, instead of just using the already existing |
so your solution when a program (eg, nim itself) compiled with --stacktrace:on is too slow is to:
You can't just permanently add I don't see how that's practical, and that's obviously much more complex for users than what I've proposed and implemented: simply compile with
no, there's a mode that preserves pre-existing semantics; it's just a question of whether to default to old mode or to new mode. |
All that's really needed IMHO is that we map Nim's .inline to C++'s always_inline. |
well I've updated nim-lang/Nim#13582 so that existing semantics are unchanged, instead the new --stacktrace:noinline can give you large speedups while typically producing identical stracktrace, ie, a good compromise bw speed and debuggability, see nim-lang/Nim#13582 (comment) for details |
I did some further investigation, here are the conclusions using clang on OSX (should mostly generalize to gcc on linux):
here's a minimal but illustrative example: when defined case5:
import times
when defined case_normal:
proc fun[T](x, i: T): T = x + i
when defined case_inline:
proc fun[T](x, i: T): T {.inline.} = x + i
when defined case_noinline:
proc fun[T](x, i: T): T {.noinline.} = x + i
when defined case_alwaysinline:
# modify `N_INLINE` in nimbase.h and replace `inline` by `__attribute__((__always_inline__))`
proc fun[T](x, i: T): T {.inline.} = x + i
when defined case_template:
template fun[T](x, i: T): T = x + i
proc main() =
var x: uint = 1
let n: uint = 1_000_000_000
var i = uint(0)
while true:
x = fun(x,i)
if i == n: break
i = i + 1
doAssert x != 1234
let t = cpuTime()
main()
echo cpuTime() - t
when defined case5_runner:
import os,strformat
proc fun(opt: string) =
const nim = getCurrentCompilerExe()
let input = currentSourcePath
let cmd = fmt"{nim} c -r --skipParentCfg --skipUserCfg --hints:off -d:case5 --stacktrace:off -f {opt} {input}"
echo cmd
doAssert execShellCmd(cmd) == 0
for opt2 in ["", "--passC:-Og", "--passC:-O", "-d:danger"]:
echo "\nopt2: ", opt2
fun fmt"-d:case_normal {opt2}"
fun fmt"-d:case_inline {opt2}"
fun fmt"-d:case_noinline {opt2}"
fun fmt"-d:case_template {opt2}"
# fun fmt"-d:case_alwaysinline {opt2}" performance numbers for debug buildsas you can see:
links |
It's not about speed really, |
this would be bad for performance. It's not as simple as that; source code for a proc declaration has very limited context to decide on whether to inline, unlike C compiler which has much more relevant context (compile flags, params resolved to concrete types, candidate insertion inside a inlining context); hence those articles I linked in previous post which show that force_inline may be slower than In fact, the same I actually started with a branch (which i decided not to submit as PR) where I introduced another pragma
but I realized it's really not the main thing at play here, the main thing was passing --opt:speed (or --passC:-O or even Speed should be the main criterion here, correctness is already guaranteed by calling convention regardless of whether it ends up being inlined or not (eg, there won't be link or multiple definition errors regardless of whether it gets inlined or not). |
I have nothing to add really. inline should inline, period. It's a way to override the compiler's optimizer by saying "I do know better, trust me". There are no other reasons for inline's existance. |
yes there is a very important one: it tells the compiler that, when crossing module boundaries, the foreign function may be inlined.
here's a minimal test case showing this:
this shows so Otherwise you'll end up in a situation where there are only bad choices:
|
Well yes, I'm aware of the cross module inlining issue but I hope link-time optimization is good enough these days to make it a non-issue. |
As an alternative we could have |
furthermore,
conclusion:
I want the other way around, the common case is to let do the C compiler what it's good at (optimizing what should be inlined, based on whole context) and providing an option to force inlining for the very rare case where we want to override the compiler smarts. That's pretty much exactly what my unsubmitted PR is about: see timotheecour/Nim#57 (you can see a clean diff by selecting a range of the last few commits only) proc fun() {.inline.} = discard # same semantics as before, common case
proc fun() {.alwaysinline.} = discard # forces inlining, mapping to compiler-specific attributes maybe I can rename it to {.forceinline.}, it's a trivial change
|
No, I cannot provide such an example (for now!) so the status quo seems sufficient. But I personally do not like non-inlined inline functions, I think I know what I'm doing when I use the
Actually a factor of 3 is suprisingly good IMO. And compile-times are insignificant for release builds of software that is shipped to thousands of customers. |
At least we now agree we need 2 (at least 2) inlining annotations (whether it's my recommendation (hint = {.inline.}, force = {.forceinline.}) or (hint = {.hot.}, force = {.inline.}); for the specific case of allowing cross-module inlining.
pgo for inlining decisionsNim can learn some way to map a profiler output to force-inline or force-noinline or hint-inline (or other optimization strategies, eg branch prediction / likely / unlikely) on hotspots, that's definitely something that should be done at some point and wouldn't even be too hard to do. All you'd need is:
this can be made robust to avoid having to rerun pgo after source code changes (leading to different mangled cgen'd names), there's lot of possibilities there.
of course, but that's very use case dependent (eg during development you may want to balance compile time vs performance) links
|
Nim handles stack traces by explicitly adding redundant code that pushes and pops data, adding code size and memory traffic - in the world of stack traces, this is not at all necessary: the CPU already records all information necessary to generate a stack trace as part of its "normal" control flow implementation: this information can be extracted using https://github.com/status-im/nim-libbacktrace There's no way around that the current implementation will always be ridiculously slow compared to simply reading the stack with libbacktrace, thus the feature should really focus on debuggability rather than performance and as such, complicating things with special rules for inlining etc seems counterproductive. |
some of these (especially P1) will result in a large speedup for builds with --stacktrace:off (eg debug builds)
links
RFC proposals
P1 an {.inline.} proc should by default NOT generate
nimfr_
+popFrame
(regardless of -d:release or not) even with --stacktrace:on. Otherwise --stacktrace:on can be very slowP2 a new option should be introduced
--stacktrace:inline
(implies--stacktrace:on
) to generatenimfr_
+popFrame
even for{.inline.}
P3 currently
proc fun2x(a: int): int {.inline.}
generates:regardless of
-d:danger
or notaccording to https://stackoverflow.com/questions/25602813/force-a-function-to-be-inline-in-clang-llvm it seems
static
should maybe not be used:we should investigate to make sure it works as intended, and wheter
__attribute__((always_inline))
is more appropriate (cf hints vs guarantees)P4
#define N_INLINE_PTR(rettype, name) rettype (*name)
is dead code: no use ofN_INLINE_PTR
(and no idea what it's for)P5 we should have a compiler option
--ignoreInline
way to skip inlining (for debug and maybe even release builds, it's and orthogonal decision), helps w debugging esp stacktraces=> this would only control nim code that cgen's inline, and would NOT be conflated with C-specific flags that controls inlining, so as not to interfere with other libraries; user can always pass
--passC:XXX
in addition to that if neededP6
--ignoreInline
shall be usable in cmdline as well as in {.push.}/{.pop.} sections just like--rangeChecks
+ friendsP7 an imported {.inline.} proc currently gets its C source code duplicated in every module it's used, which could potentially slow down c code compilation (but no need to fix if benchmarking reveals it's not a bottleneck). So it's some kind of double inlining since we already have the
N_INLINE
macrostatic N_INLINE(NI, fun2x__rxUfcN4ZvcApCx0waSVH8At10303b)(NI a) {...}
Instead we can use what's recommended here https://gustedt.wordpress.com/2010/11/29/myth-and-reality-about-inline-in-c99/ and here https://stackoverflow.com/a/43442230/1426932 and introduce the inline proc definition in a header file,
inline fun2x__rxUfcN4ZvcApCx0waSVH8At10303b(NI a) { ... }
imported by all clients of it, and then useextern
in a single location in a source fileThe text was updated successfully, but these errors were encountered: