-
Notifications
You must be signed in to change notification settings - Fork 0
/
Simon-nofib-notes
386 lines (309 loc) · 12.2 KB
/
Simon-nofib-notes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
--------------------------------------------
Notes on nofib programs
--------------------------------------------
Every time I do performance tuning on GHC I re-discover wrinkles in
particular nofib programs. I intend to record these wrinkles herein.
NOTE: With 4.09 we introduced Unicode. That means that (C# x) always allocates,
whereas it didn't before. So allocations go up a bit.
---------------------------------------
Imaginary suite
---------------------------------------
integrate
~~~~~~~~~
integrate1D is strict in its second argument 'u', but it also passes 'u' to
function 'f'. Hence it now does some reboxing, which pushes up allocation
slightly.
gen_regexps
~~~~~~~~~~~
I found that there were some very bad loss-of-arity cases in PrelShow.
In particular, we had:
showl "" = showChar '"' s
showl ('"':xs) = showString "\\\"" . showl xs
showl (x:xs) = showLitChar x . showl xs
Trouble is, we get
showl = \xs -> case xs of
...
(x:xs) -> let f = showLitChar x
g = showl xs
in \s -> f (g s)
which is TERRIBLE. We can't spot that showLitChar has arity 2 because
it looks like this:
...other eqns...
showLitChar c = showString ('\\' : asciiTab!!ord c)
notice that the (asciiTab!!orc c) is outside the \s, so GHC can't rewrite it to
showLitChar c = \s -> showString ('\\' : asciiTab!!ord c) s
So I've changed PrelShow.showLitChar to use explicit \s. Even then, showl
doesn't work, because GHC can't see that showl xs can be pushed inside the \s.
So I've put an explict \s there too.
showl "" s = showChar '"' s
showl ('"':xs) s = showString "\\\"" (showl xs s)
showl (x:xs) s = showLitChar x (showl xs s)
Net result: imaginary/gen_regexps more than halves in allocation!
x2n1
~~~~
Relies quite heavily on specialisations for Num (Complex Double). If
this test suddenly gets worse by a factor of 2 or so, chances are that
specialisation is broken.
---------------------------------------
Spectral suite
---------------------------------------
rewrite
~~~~~~~
It's important to inline p_ident.
There's a very delicate CSE in p_expr
p_expr = seQ q_op [p_term1, p_op, p_term2] ## p_term3
(where all the pterm1,2,3 are really just p_term).
This expands into
p_expr s = case p_term1 s of
Nothing -> p_term3 s
Just (v,s') -> case p_op s' of ...
Nothing -> p_term3 s
Notice that if the bit before (##) fails, we call p_term3 s, and if
CSE notices we can re-use the result of the first call. But that
depends on how much inlining takes place. In particular, the follow-up
calls p_term3 can get hidden inside join points, which don't "see" the first
call.
cse
~~~
The functions 'go' and 'go1', called from draw, get arity 1, but they
should get arity 2. We need a better arity analysis, a la Gill. 'go' looks
like this:
go_r1og =
\ (ds_aGI :: [GHC.Base.Char]) ->
case ds_aGI of wild_aGJ {
[] -> z_r1oe;
: y_aGN ys_aGO ->
let {
xs7_s1i8 :: GHC.Prim.Int# -> [GHC.Base.Char]
[Str: DmdType]
xs7_s1i8 = go_r1og ys_aGO
} in
\ (m_XWf :: GHC.Prim.Int#) ->
case GHC.Prim.<=# m_XWf 1 of wild1_aSI {
GHC.Base.False ->
GHC.Base.: @ GHC.Base.Char y_aGN (xs7_s1i8 (GHC.Prim.-# m_XWf 1));
GHC.Base.True -> GHC.Base.: @ GHC.Base.Char y_aGN lvl13_r1n0
}
}
Notice the 'let' which stops the lambda moving out.
Eliza
~~~~~
In June 2002, GHC 5.04 emitted four successive
NOTE: Simplifier still going after 4 iterations; bailing out.
messages. I suspect that the simplifer is looping somehow.
Expert
~~~~~~
In spectral/expert/Search.ask there's a statically visible CSE. Catching this
depends almost entirely on chance, which is a pity.
Fish
~~~~
The performance of fish depends crucially on inlining scale_vec2.
It turns out to be right on the edge of GHC's normal threshold size, so
small changes to the compiler can knock it to and fro.
[Sept 08] The new instance-declaration story makes show slightly worse.
Look at Main.showl in the simplified core. You'll see this:
a79_aRZ :: GHC.Types.Int
-> (GHC.Types.Int, GHC.Types.Int, GHC.Types.Int, GHC.Types.Int)
-> GHC.Show.ShowS
[Str: DmdType]
a79_aRZ =
GHC.Show.showsPrec9 @ Int @ Int @ Int @ Int
GHC.Show.$f16 GHC.Show.$f16 GHC.Show.$f16 GHC.Show.$f16
Rec {
showl_sWK :: [Main.Line_segment] -> String -> String
[Arity 2
Str: DmdType SL]
showl_sWK =
\ (ds_dQJ :: [Main.Line_segment]) (s_alB :: GHC.Base.String) ->
case ds_dQJ of wild_X1d [ALWAYS Just A] {
[] -> GHC.Types.: @ GHC.Types.Char lvl_sYT s_alB;
: x_alD [ALWAYS Just L] xs_alF [ALWAYS Just L] ->
GHC.Base.++
@ GHC.Types.Char
a_s12R
(a79_aRZ a_s1uS x_alD (showl_sWK xs_alF s_alB))
}
end Rec }
Note the non-inlined call to GHC.Show.showsPrec9, which looks like this:
showsPrec9 :: forall a b c d. (Show a, Show b, Show c, Show d) =>
GHC.Types.Int -> (a, b, c, d) -> GHC.Show.ShowS
{- Arity: 4 Strictness: LLLL
Unfolding: (\ @ a b c d
$dShow :: GHC.Show.Show a
$dShow1 :: GHC.Show.Show b
$dShow2 :: GHC.Show.Show c
$dShow3 :: GHC.Show.Show d ->
let {
shows1 :: d -> GHC.Show.ShowS = GHC.Show.shows @ d $dShow3
} in
let {
shows2 :: c -> GHC.Show.ShowS = GHC.Show.shows @ c $dShow2
} in
let {
shows3 :: b -> GHC.Show.ShowS = GHC.Show.shows @ b $dShow1
} in
let {
shows4 :: a -> GHC.Show.ShowS = GHC.Show.shows @ a $dShow
} in
\ ds :: GHC.Types.Int ds1 :: (a, b, c, d) s :: GHC.Base.String ->
case @ GHC.Base.String ds1 of wild { (a79, b, c, d) ->
GHC.Show.show_tuple
(GHC.Types.:
@ GHC.Show.ShowS
(shows4 a79)
(GHC.Types.:
@ GHC.Show.ShowS
(shows3 b)
(GHC.Types.:
@ GHC.Show.ShowS
(shows2 c)
(GHC.Types.:
@ GHC.Show.ShowS
(shows1 d)
(GHC.Types.[] @ GHC.Show.ShowS)))))
s }) -}
We would do better to inpline showsPrec9 but it looks too big. Before
it was inlined regardless by the instance-decl stuff. So perf drops slightly.
Integer
~~~~~~~
A good benchmark for beating on big-integer arithmetic
Knights
~~~~~~~
In knights/KnightHeuristic, we don't find that possibleMoves is strict
(with important knock-on effects) unless we apply rules before floating
out the literal list [A,B,C...].
Similarly, in f_se (F_Cmp ...) in listcompr (but a smaller effect)
Lambda
~~~~~~
This program shows the cost of the non-eta-expanded lambdas that arise from
a state monad.
mandel2
~~~~~~~
check_perim's several calls to point_colour lead to opportunities for CSE
which may be more or less well taken.
Mandel
~~~~~~
Relies heavily on having a specialised version of Complex.magnitude
(:: Complex Double -> Double) available.
Mandel.mandelset has a go-loop inside an enumDeltaToIntegerFB, out of which
4.08.2 managed to float a constant expression, but HEAD did not. I think
this is because the pre-let-floating simplification did too little inlining;
in particular, it did not inline windowToViewport
Multiplier
~~~~~~~~~~
In spectral/multiplier, we have
xor = lift21 forceBit f
where f :: Bit -> Bit -> Bit
f 0 0 = 0
f 0 1 = 1
f 1 0 = 1
f 1 1 = 0
Trouble is, f is CPR'd, and that means that instead of returning
the constants I# 0, I# 1, it returns 0,1 and then boxes them.
So allocation goes up. I don't see a way around this.
Parstof
~~~~~~~
spectral/hartel/parstof ends up saying
case (unpackCString "x") of { c:cs -> ... }
quite a bit. We should spot these and behave accordingly.
Power
~~~~~
With GHC 4.08, for some reason the arithmetic defaults to Double. The
right thing is to default to Rational, which accounts for the big increase
in runtime after 4.08
Puzzle
~~~~~~
The main function is 'transfer'. It has some complicated join points, and
a big issue is the full laziness can float out many small MFEs that then
make much bigger closures. It's quite delicate: small changes can make
big differences, and I spent far too long gazing at it.
I found that in my experimental proto 4.09 compiler I had
let ds = go xs in
let $j = .... ds ... in
case e of
p1 -> $j
p2 -> $j
p3 -> ds
But ds wasn't getting inlined because we couldn't spot that it was actually
only being used once. Keith's analyser should find this!
Also, making concat into a good producer made a large gain.
My proto 4.09 still allocates more, partly because of more full laziness relative
to 4.08; I don't know why that happens
Extra allocation is happening in 5.02 as well; perhaps for the same reasons. There is
at least one instance of floating that prevents fusion; namely the enumerated lists
in 'transfer'.
Sphere
~~~~~~
A key function is vecsub, which looks like this (after w/w)
$wvecsub
= \ ww :: Double ww1 :: Double ww2 :: Double
ww3 :: Double ww4 :: Double ww5 :: Double ->
let { a = case ww of wild { D# x ->
case ww3 of wild1 { D# y ->
let { a1 = -## x y
} in $wD# a1
} } } in
let { a1 = case ww1 of wild { D# x ->
case ww4 of wild1 { D# y ->
let { a2 = -## x y
} in $wD# a2
} } } in
let { a2 = case ww2 of wild { D# x ->
case ww5 of wild1 { D# y ->
let { a3 = -## x y
} in $wD# a3
} }
} in (# a, a1, a2 #)
Currently it gets guidance: IF_ARGS 6 [2 2 2 2 2 2] 25 4
It occurs in a context like
case $wvecsub a b c d e f of (# p, q, r #) ->
case p of D# p' ->
case q of D# q' ->
case r of D# r' ->
So it would be much, much better to inline it. But the result-discounting
doesn't "see" that the three results are each taken apart, and that's the
problem.
One question is this: why did minusDouble get inlined? If it didn't then
$vecsub would look much smaller:
$wvecsub
= \ ww :: Double ww1 :: Double ww2 :: Double
ww3 :: Double ww4 :: Double ww5 :: Double ->
let { a = minusDouble ww ww3 } in
let { a1 = minusDouble ww1 ww4 } in
let { a2 = minusDouble ww2 ww5 } in
(# a, a1, a2 #)
Reason minusDouble gets inlined: because we are trying to get update in place.
So I added a flag -funfolding-update-in-place to enable this "optimisation".
Omitting the flag gives much better inlining for $wvecsub at least.
Sphere also does 60,000 calls to hPutStr, so I/O plays a major role. Currently
this I/O does a *lot* of allocation, much of it since the adddition of thread-safety.
treejoin
~~~~~~~~
Does a lot of IO.readFile. In GHC.IO.Encoding.UTF8 the demand
analyser sees a strict function with type
a_s1gj :: GHC.IO.Buffer.Buffer GHC.Word.Word8
-> GHC.IO.Buffer.Buffer GHC.Types.Char
-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld,
(GHC.IO.Encoding.Types.CodingProgress,
GHC.IO.Buffer.Buffer GHC.Word.Word8,
GHC.IO.Buffer.Buffer GHC.Types.Char) #)
Unboxing both Buffer arguments makes a HUGE difference (halves
allocation); but that makes the worker function have 12 arguments. A
good reason for unboxing even if the worker gets a lot of args.
sorting
~~~~~~~
Same issue with GHC.IO.Encoding.UTF8 as treejoin
---------------------------------------
Real suite
---------------------------------------
gg
~~
Same issue with GHC.IO.Encoding.UTF8 as treejoin
Maillist
~~~~~~~~
Uses appendFile repeatedly rather than opening the output file once,
which leads to numerous file opens/closes. Allocations will rise with
the new I/O subsystem in 5.02 because the I/O buffer will be
re-allocated on the heap for each open, whereas previously it would be
allocated on the C heap and therefore not show up in the stats.