Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primitive operator functions are not inlined #198

Closed
ChengCat opened this issue Sep 29, 2018 · 6 comments
Closed

Primitive operator functions are not inlined #198

ChengCat opened this issue Sep 29, 2018 · 6 comments

Comments

@ChengCat
Copy link
Contributor

Compile the following program with dalec a.dt -s ir -O4 --no-dale-stdlib (Dale is compiled with LLVM 6.0.1),

(import cstdio)

(def main (fn extern-c int (void)
  (def sum (var auto \ 0))
  (for (i \ 0) (< i 500000000) (incv i)
    (setv sum (+ sum i)))
  (printf "%d\n" sum)
  0))

I get:

[...]

; Function Attrs: alwaysinline norecurse nounwind readonly
define weak_odr i32 @"_Z1$2bii"(i32 %a, i32 %b) local_unnamed_addr #0 {
entry:
  %0 = add nsw i32 %b, %a
  ret i32 %0
}

[...]

; Function Attrs: nounwind
define i32 @main() local_unnamed_addr #2 {
entry:
  %0 = tail call i8 @"_Z1$3cii"(i32 0, i32 500000000)
  %1 = and i8 %0, 1
  %2 = icmp eq i8 %1, 0
  br i1 %2, label %_dale_internal_label_breaklabel_2, label %_dale_internal_label_continuelabel_1.preheader

_dale_internal_label_continuelabel_1.preheader:   ; preds = %entry
  br label %_dale_internal_label_continuelabel_1

_dale_internal_label_continuelabel_1:             ; preds = %_dale_internal_label_continuelabel_1.preheader, %_dale_internal_label_continuelabel_1
  %.03 = phi i32 [ %3, %_dale_internal_label_continuelabel_1 ], [ 0, %_dale_internal_label_continuelabel_1.preheader ]
  %.012 = phi i32 [ %4, %_dale_internal_label_continuelabel_1 ], [ 0, %_dale_internal_label_continuelabel_1.preheader ]
  %3 = tail call i32 @"_Z1$2bii"(i32 %.03, i32 %.012)
  %4 = tail call i32 @"_Z1$2bii"(i32 %.012, i32 1)
  %5 = tail call i8 @"_Z1$3cii"(i32 %4, i32 500000000)
  %6 = and i8 %5, 1
  %7 = icmp eq i8 %6, 0
  br i1 %7, label %_dale_internal_label_breaklabel_2, label %_dale_internal_label_continuelabel_1

_dale_internal_label_breaklabel_2:                ; preds = %_dale_internal_label_continuelabel_1, %entry
  %.0.lcssa = phi i32 [ 0, %entry ], [ %3, %_dale_internal_label_continuelabel_1 ]
  %8 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @_dvidrl0, i64 0, i64 0), i32 %.0.lcssa)
  ret i32 0
}

[...]

Note those _Z1$2bii calls, the primitive operator functions are not inlined, despite being marked as alwaysinline. This leads to very bad performance, and on my machine it's 10x slower than an equivalent C program compiled with gcc.

After some investigation, I think this is caused by inefficient use of LLVM. As indicated by LLVM command line tools, LLVM can easily optimize the whole thing into printing a constant. With the following commands,

> llvm-as a.dt.ll
> opt -O1 a.dt.bc -o O1.bc
> llvm-dis O1.bc

A file O1.ll is produced:

[...]

; Function Attrs: nounwind
define i32 @main() local_unnamed_addr #2 {
_dale_internal_label_breaklabel_2:
  %0 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @_dvidrl0, i64 0, i64 0), i32 1711656320)
  ret i32 0
}

[...]

I also think --no-dale-stdlib should always be enabled. Those Dale run-time functions are very small, and without them being properly inlined, performance of both compiled programs and compilation time (to run the macros) is largely degraded.

tomhrr added a commit that referenced this issue Oct 4, 2018
tomhrr added a commit that referenced this issue Oct 4, 2018
tomhrr added a commit that referenced this issue Oct 6, 2018
This means that those functions can always be linked statically, which
in turn means that they are inlined regardless of optimisation
settings.
tomhrr added a commit that referenced this issue Oct 6, 2018
tomhrr added a commit that referenced this issue Oct 6, 2018
@tomhrr
Copy link
Owner

tomhrr commented Oct 6, 2018

Thanks for this: I had noticed some of these problems when working on
some recent updates to support newer versions of LLVM, but forgot to
go back and fix them afterwards. The basic arithmetical/relational
functions are now always inlined, and for the example program
(a.dt), O1 or higher will cause it to be optimised into the
printing of a constant.

@tomhrr
Copy link
Owner

tomhrr commented Oct 7, 2018

There was an additional problem with those basic functions being retained
in the output even when they weren't needed, which has now been fixed.

@ChengCat
Copy link
Contributor Author

ChengCat commented Oct 8, 2018

Hi, I have just tried to help testing the fixes. Dale now works beautifully with that example. Thanks!

I noticed two minor issues.

; in a.dt
(import cstdio)
(import macros)
(import b)

(def m (macro extern (void)
  (def sum (var auto \ 0))
  (for (i \ 0) (< i 500000000) (incv i)
    (setv sum (+ sum i)))
  (std.macros.mnfv mc sum)))
(def f (fn extern int (void)
  (def sum (var auto \ 0))
  (for (i \ 0) (< i 500000000) (incv i)
    (setv sum (+ sum i)))
  sum))

(def main (fn extern-c int (void)
  (printf "%d\n" (m))
  (printf "%d\n" (f))
  (printf "%d\n" (b.m))
  (printf "%d\n" (b.f))))
; in b.dt
(module b)
(import stdlib)
(import macros)
(namespace b
  (def m (macro extern (void)
    (def sum (var auto \ 0))
    (for (i \ 0) (< i 500000000) (incv i)
      (setv sum (+ sum i)))
    (std.macros.mnfv mc sum)))
  (def f (fn extern int (void)
    (def sum (var auto \ 0))
    (for (i \ 0) (< i 500000000) (incv i)
      (setv sum (+ sum i)))
    sum)))

The above two files are compiled as:

dalec -c b.dt -O4
dalec a.dt -O4 -o a
  1. It is slow to use a macro from the same module, leading to slow compile time. Currently, it is already efficient to use macros from any other module, so I think this is minor issue.
  2. The basic operator functions are retained in libb.bc.

@tomhrr
Copy link
Owner

tomhrr commented Oct 11, 2018

Thanks for the extra comments. 2 is now fixed, but I don't think there's any way to fix 1, since the slowness appears to be due to compilation of the macro.

@ChengCat
Copy link
Contributor Author

I thought the slowness of 1 was mainly caused by that, JIT compilation is another code path, and inline is not properly handled in that code path. But I am not sure.

I am closing this issue for now, since it is not important anyway.

@ChengCat
Copy link
Contributor Author

I forgot to say.. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants