-
Notifications
You must be signed in to change notification settings - Fork 6
/
bye-bye-boostsimd-welcome-xsimd.html
303 lines (294 loc) · 27.5 KB
/
bye-bye-boostsimd-welcome-xsimd.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Pythran stories - Bye bye boost.simd, welcome xsimd</title>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/normalize/8.0.1/normalize.min.css"/>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.2/css/all.min.css"/>
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto+Slab|Ruda"/>
<link rel="stylesheet" type="text/css" href="./theme/css/main.css"/>
<link href="http://serge-sans-paille.github.io/pythran-stories/
feeds/all.atom.xml"
type="application/atom+xml" rel="alternate" title="Pythran stories Atom Feed"/>
</head>
<body>
<style>.github-corner:hover .octo-arm {
animation: octocat-wave 560ms ease-in-out
}
@keyframes octocat-wave {
0%, 100% {
transform: rotate(0)
}
20%, 60% {
transform: rotate(-25deg)
}
40%, 80% {
transform: rotate(10deg)
}
}
@media (max-width: 500px) {
.github-corner:hover .octo-arm {
animation: none
}
.github-corner .octo-arm {
animation: octocat-wave 560ms ease-in-out
}
}</style><div id="container">
<header>
<h1><a href="./">Pythran stories</a></h1>
<ul class="social-media">
<li><a href="https://github.com/serge-sans-paille/pythran"><i class="fab fa-github fa-lg" aria-hidden="true"></i></a></li>
<li><a href="http://serge-sans-paille.github.io/pythran-stories/
feeds/all.atom.xml"
type="application/atom+xml" rel="alternate"><i class="fa fa-rss fa-lg"
aria-hidden="true"></i></a></li>
</ul>
<p><em></em></p>
</header>
<nav>
<ul>
<li><a href="./category/benchmark.html"> benchmark </a></li>
<li><a href="./category/compilation.html"> compilation </a></li>
<li><a class="active" href="./category/engineering.html"> engineering </a></li>
<li><a href="./category/examples.html"> examples </a></li>
<li><a href="./category/mozilla.html"> mozilla </a></li>
<li><a href="./category/optimisation.html"> optimisation </a></li>
<li><a href="./category/release.html"> release </a></li>
</ul>
</nav>
<main>
<article>
<h1>Bye bye boost.simd, welcome xsimd</h1>
<aside>
<ul>
<li>
<time datetime="2018-10-31 00:00:00+01:00">Oct 31, 2018</time>
</li>
<li>
Categories:
<a href="./category/engineering.html"><em>engineering</em></a>
</li>
</li>
</ul>
</aside>
<p><a class="reference external" href="https://github.com/NumScale/boost.simd">boost.simd</a> provides a C++
abstraction of vector type, allowing for efficient vectorization of array
computations. It has been (optionally) used as part of the expression template
engine of Pythran for a long time, a great collaboration that led to several
patches in boost.simd, and great performance for Pythran.</p>
<p>Unfortunately, the project has been silent over the last months (see for
instance <a class="reference external" href="https://github.com/NumScale/boost.simd/issues/546">this issue</a>) and
I had to maintain a few custom patches. Turns out the project has been
re-branded as <strong>bSIMD</strong> with a more restrictive license, as detailed in <a class="reference external" href="https://github.com/NumScale/boost.simd/issues/545">another
issue</a>. From the Pythran
perspective, this is no good news.</p>
<p>Fortunately, the people from <a class="reference external" href="http://quantstack.net/">QuantStack</a> have put
tremendous effort into providing an equivalent to boost.simd, <a class="reference external" href="http://quantstack.net/xsimd.html">xsimd</a>. And their library actually provides some
improvements in the context of Pythran, it's under a <em>BSD-3-Clause license</em>, so
when they proposed to fund the move to <em>xsimd</em>, it was just perfect.</p>
<p>So here is the deal: I do the port, report any API and/or performance issue,
and eventually provide patches when relevant. That's what I did over the last
three months, let's have a look at the results.</p>
<div class="section" id="user-level-changes">
<h2>User-level Changes</h2>
<p>In order to activate explicit vectorisation, one must pass <tt class="docutils literal"><span class="pre">-DUSE_XSIMD</span> <span class="pre">-march=native</span></tt> to the Pythran compiler, in place of <tt class="docutils literal"><span class="pre">-DUSE_BOOST_SIMD</span> <span class="pre">-march=native</span></tt>. Fair enough.</p>
<p>For instance, consider the following kernel, taken from the <a class="reference external" href="https://github.com/serge-sans-paille/numpy-benchmarks/">numpy benchmarks</a> suite I
maintain.</p>
<div class="highlight"><pre><span></span><span class="c1">#pythran export arc_distance(float64 [], float64[], float64[], float64[])</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">arc_distance</span><span class="p">(</span><span class="n">theta_1</span><span class="p">,</span> <span class="n">phi_1</span><span class="p">,</span>
<span class="n">theta_2</span><span class="p">,</span> <span class="n">phi_2</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> Calculates the pairwise arc distance between all points in vector a and b.</span>
<span class="sd"> """</span>
<span class="n">temp</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">((</span><span class="n">theta_2</span><span class="o">-</span><span class="n">theta_1</span><span class="p">)</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">theta_1</span><span class="p">)</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">theta_2</span><span class="p">)</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">((</span><span class="n">phi_2</span><span class="o">-</span><span class="n">phi_1</span><span class="p">)</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span>
<span class="n">distance_matrix</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arctan2</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">temp</span><span class="p">),</span><span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">temp</span><span class="p">)))</span>
<span class="k">return</span> <span class="n">distance_matrix</span>
</pre></div>
<p>When compiled with GCC 7.3 and benchmarked with the <a class="reference external" href="https://pypi.org/project/perf/">perf</a> module, one gets</p>
<div class="highlight"><pre><span></span><span class="nv">CC</span><span class="o">=</span>clang<span class="w"> </span><span class="nv">CXX</span><span class="o">=</span>clang++<span class="w"> </span>pythran<span class="w"> </span>arc_distance.py<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native
python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">'N = 10000 ; import numpy as np ; np.random.seed(0); t0, p0, t1, p1 = np.random.randn(N), np.random.randn(N), np.random.randn(N), np.random.randn(N); from arc_distance import arc_distance'</span><span class="w"> </span><span class="s1">'arc_distance(t0, p0, t1, p1)'</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">1</span>.48<span class="w"> </span>ms<span class="w"> </span>+-<span class="w"> </span><span class="m">0</span>.01<span class="w"> </span>ms
</pre></div>
<p>That's our base line. If we recompile it with <tt class="docutils literal"><span class="pre">-DUSE_XSIMD</span></tt>, we get an extra speedup (AVX instructions are available on my laptop, and enabled by <tt class="docutils literal"><span class="pre">-march=native</span></tt>).</p>
<div class="highlight"><pre><span></span><span class="nv">CC</span><span class="o">=</span>clang<span class="w"> </span><span class="nv">CXX</span><span class="o">=</span>clang++<span class="w"> </span>python<span class="w"> </span>-m<span class="w"> </span>pythran.run<span class="w"> </span>arc_distance.py<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>-DUSE_XSIMD
python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">'N = 10000 ; import numpy as np ; np.random.seed(0); t0, p0, t1, p1 = np.random.randn(N), np.random.randn(N), np.random.randn(N), np.random.randn(N); from arc_distance import arc_distance'</span><span class="w"> </span><span class="s1">'arc_distance(t0, p0, t1, p1)'</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">199</span><span class="w"> </span>us<span class="w"> </span>+-<span class="w"> </span><span class="m">4</span><span class="w"> </span>us
</pre></div>
<p>That's roughly 7 times faster. And using Pythran 0.8.7, the last release with boost.simd support, we have</p>
<div class="highlight"><pre><span></span><span class="nv">CC</span><span class="o">=</span>clang<span class="w"> </span><span class="nv">CXX</span><span class="o">=</span>clang++<span class="w"> </span>python<span class="w"> </span>-m<span class="w"> </span>pythran.run<span class="w"> </span>arc_distance.py<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>-DUSE_BOOST_SIMD
python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">'N = 10000 ; import numpy as np ; np.random.seed(0); t0, p0, t1, p1 = np.random.randn(N), np.random.randn(N), np.random.randn(N), np.random.randn(N); from arc_distance import arc_distance'</span><span class="w"> </span><span class="s1">'arc_distance(t0, p0, t1, p1)'</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">284</span><span class="w"> </span>us<span class="w"> </span>+-<span class="w"> </span><span class="m">8</span><span class="w"> </span>us
</pre></div>
<p>This is slightly slower, but within the same magnitude order. Out of curiosity, I ran the same three experiments using Clang 6 as a backend compiler and I get the following timings:</p>
<pre class="code literal-block">
clang + boost.simd: 220 us +- 8 us
clang + xsimd: 273 us +- 11 us
clang: 1.41 ms +- 0.04 ms
</pre>
<p>Interestingly, <strong>on that example</strong>, Clang generates better code for the boost.simd version. So let's be wary of hasty conclusion and just state that with both engines, I can get efficient vectorization of my code.</p>
</div>
<div class="section" id="complex-numbers">
<h2>Complex Numbers</h2>
<p>Thanks to xsimd, Pythran is now able to <em>naively</em> support complex number
vectorization. I state <em>naively</em> because we don't support changing internal
representation from array-of-struct to struct-of-array, as we stick to numpy's
layout. Still that's something new for Pythran as showcased by the following kernel:</p>
<div class="highlight"><pre><span></span><span class="c1">#pythran export normalize_complex_arr(complex[])</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">normalize_complex_arr</span><span class="p">(</span><span class="n">a</span><span class="p">):</span>
<span class="n">a_oo</span> <span class="o">=</span> <span class="n">a</span> <span class="o">-</span> <span class="n">a</span><span class="o">.</span><span class="n">real</span><span class="o">.</span><span class="n">min</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="n">j</span><span class="o">*</span><span class="n">a</span><span class="o">.</span><span class="n">imag</span><span class="o">.</span><span class="n">min</span><span class="p">()</span> <span class="c1"># origin offsetted</span>
<span class="k">return</span> <span class="n">a_oo</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">abs</span><span class="p">(</span><span class="n">a_oo</span><span class="p">)</span><span class="o">.</span><span class="n">max</span><span class="p">()</span>
</pre></div>
<p>Pythran provides a vectorized version of <tt class="docutils literal">np.min</tt> and <tt class="docutils literal">np.max</tt> operators, so thanks to complex support, it should provide some decent acceleration. Note that the two calls to <tt class="docutils literal">np.min()</tt> do not involve complex numbers, but that the remaining parts of the expression do. Let's check that!</p>
<p>First, the reference numpy version:</p>
<div class="highlight"><pre><span></span>python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">'import numpy as np; np.random.seed(0); N = 100000; x = np.random.random(N) + 1j * np.random.random(N); from normalize_complex_arr import normalize_complex_arr'</span><span class="w"> </span><span class="s1">'normalize_complex_arr(x)'</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">3</span>.19<span class="w"> </span>ms<span class="w"> </span>+-<span class="w"> </span><span class="m">0</span>.02<span class="w"> </span>ms
</pre></div>
<p>Then with Pythran, no explicit vectorization:</p>
<div class="highlight"><pre><span></span><span class="nv">CC</span><span class="o">=</span>gcc<span class="w"> </span><span class="nv">CXX</span><span class="o">=</span>g++<span class="w"> </span>pythran<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>-O3<span class="w"> </span>normalize_complex_arr.py
python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">'import numpy as np; np.random.seed(0); N = 100000; x = np.random.random(N) + 1j * np.random.random(N); from normalize_complex_arr import normalize_complex_arr'</span><span class="w"> </span><span class="s1">'normalize_complex_arr(x)'</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">2</span>.84<span class="w"> </span>ms<span class="w"> </span>+-<span class="w"> </span><span class="m">0</span>.01<span class="w"> </span>ms
</pre></div>
<p>And with vectorization on .</p>
<div class="highlight"><pre><span></span><span class="nv">CC</span><span class="o">=</span>gcc<span class="w"> </span><span class="nv">CXX</span><span class="o">=</span>g++<span class="w"> </span>pythran<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>-O3<span class="w"> </span>make_decision.py<span class="w"> </span>-DUSE_XSIMD
python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">'import numpy as np; np.random.seed(0); N = 100000; x = np.random.random(N) + 1j * np.random.random(N); from normalize_complex_arr import normalize_complex_arr'</span><span class="w"> </span><span class="s1">'normalize_complex_arr(x)'</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">723</span><span class="w"> </span>us<span class="w"> </span>+-<span class="w"> </span><span class="m">14</span><span class="w"> </span>us
</pre></div>
<p>Cool! Speedup for complex! For the record, the numpy version already ran at roughly <tt class="docutils literal">3.19 ms +- 0.02 ms</tt>.</p>
</div>
<div class="section" id="scalar-version">
<h2>Scalar Version</h2>
<p>That's probably a detail for many xsimd users, but thanks to this cooperation,
xsimd now exposes a scalar version of all the mathematical function inside the
<tt class="docutils literal">xsimd::</tt> namespace. That way one can write higher level functions based on
xsimd, and they would work for both scalar and vector version:</p>
<div class="highlight"><pre><span></span><span class="k">template</span><span class="o"><</span><span class="k">class</span><span class="w"> </span><span class="nc">T</span><span class="o">></span>
<span class="n">T</span><span class="w"> </span><span class="n">euclidian_distance_squared</span><span class="p">(</span><span class="n">T</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">T</span><span class="w"> </span><span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
<span class="w"> </span><span class="k">auto</span><span class="w"> </span><span class="n">tmp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xsimd</span><span class="o">::</span><span class="n">hypot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">);</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">tmp</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">tmp</span><span class="p">;</span>
<span class="p">}</span>
</pre></div>
<p>In the context of Pythran, this makes the expression template engine easier to
write. Good point.</p>
</div>
<div class="section" id="compilation-time">
<h2>Compilation Time</h2>
<p>Pythran is an <em>Ahead of Time</em> compiler, so compilation time is generally not a
good metric. But there's one situation where it matters to me: Continuous
Integration. Because Travis has time limits, the faster we compile, the more
tests we can pass! As Pythran validates for Python2 and Python3, for Clang and
GCC, with and without SIMD, with and without OpenMP, that's a lot of
configurations to test. Roughly... 20hours of cumulated tests actually, see
<a class="reference external" href="https://travis-ci.com/serge-sans-paille/pythran/builds/89663340">this recent build</a> for
instance.</p>
<p>In pre-xsimd setting, compiling the above <tt class="docutils literal">arc_distance.py</tt> file in simd mode is relatively slow. As a reference consider the compilation of the sequential version:</p>
<div class="highlight"><pre><span></span><span class="nb">time</span><span class="w"> </span>pythran<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>normalize_complex_arr.py<span class="w"> </span>-E<span class="w"> </span><span class="c1"># generate the .cpp</span>
<span class="m">0</span>.91s<span class="w"> </span>user<span class="w"> </span><span class="m">0</span>.28s<span class="w"> </span>system<span class="w"> </span><span class="m">130</span>%<span class="w"> </span>cpu<span class="w"> </span><span class="m">0</span>.908<span class="w"> </span>total
<span class="nb">time</span><span class="w"> </span>pythran<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>arc_distance.cpp
<span class="m">5</span>.67s<span class="w"> </span>user<span class="w"> </span><span class="m">0</span>.61s<span class="w"> </span>system<span class="w"> </span><span class="m">104</span>%<span class="w"> </span>cpu<span class="w"> </span><span class="m">6</span>.001<span class="w"> </span>total
</pre></div>
<p>Ok, roughly 5 seconds in sequential mode. What about vectorized version? With boost, it's pretty damn slow:</p>
<div class="highlight"><pre><span></span><span class="nb">time</span><span class="w"> </span>pythran<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>normalize_complex_arr.cpp<span class="w"> </span>-DUSE_BOOST_SIMD
<span class="m">12</span>.10s<span class="w"> </span>user<span class="w"> </span><span class="m">0</span>.79s<span class="w"> </span>system<span class="w"> </span><span class="m">102</span>%<span class="w"> </span>cpu<span class="w"> </span><span class="m">12</span>.616<span class="w"> </span>total
</pre></div>
<p>With xsimd, it's slightly faster (no boost dependencies, and less C++ magic):</p>
<div class="highlight"><pre><span></span><span class="nb">time</span><span class="w"> </span>pythran<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>arc_distance.cpp<span class="w"> </span>-DUSE_XSIMD
<span class="m">10</span>.32s<span class="w"> </span>user<span class="w"> </span><span class="m">0</span>.65s<span class="w"> </span>system<span class="w"> </span><span class="m">102</span>%<span class="w"> </span>cpu<span class="w"> </span><span class="m">10</span>.688<span class="w"> </span>total
</pre></div>
</div>
<div class="section" id="performance-of-basic-functions">
<h2>Performance of Basic Functions</h2>
<p>Using <a class="reference external" href="https://github.com/airspeed-velocity/asv">airspeed velocity</a>, I've compared how well xsimd behaves for simple operations on 1D array. All the benchmarks hereafter have the following form:</p>
<div class="highlight"><pre><span></span><span class="c1">#pythran export cos_array(float64 [])</span>
<span class="c1">#setup: import numpy as np ; np.random.seed(0); N = 10000 ; x = np.random.random(N) * 2 * np.pi</span>
<span class="c1">#run: cos_array(x)</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">cos_array</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</pre></div>
<p>The results are obtained through the <tt class="docutils literal">asv compare commit_id0 commit_id1</tt> command.</p>
<pre class="code literal-block">
All benchmarks:
before after ratio
[99d8234f] [60632651]
9.90μs 9.89μs 1.00 benchmarks.TimeSuite.time_abs_array
+ 36.82μs 58.44μs 1.59 benchmarks.TimeSuite.time_acos_array
36.25μs 33.60μs 0.93 benchmarks.TimeSuite.time_asin_array
- 50.47μs 33.03μs 0.65 benchmarks.TimeSuite.time_atan_array
- 48.62μs 35.72μs 0.73 benchmarks.TimeSuite.time_cos_array
- 73.82μs 43.81μs 0.59 benchmarks.TimeSuite.time_cosh_array
- 47.55μs 35.52μs 0.75 benchmarks.TimeSuite.time_sin_array
- 91.45μs 47.86μs 0.52 benchmarks.TimeSuite.time_sinh_array
18.35μs 17.91μs 0.98 benchmarks.TimeSuite.time_sqrt_array
9.60μs 10.05μs 1.05 benchmarks.TimeSuite.time_square_array
- 71.71μs 33.35μs 0.47 benchmarks.TimeSuite.time_tan_array
- 84.63μs 42.28μs 0.50 benchmarks.TimeSuite.time_tanh_array
</pre>
<p>Looks pretty good! Apart from a regression on <tt class="docutils literal">acos</tt>, this is either on-par or faster than before.</p>
<p>Out of curiosity, I also ran the same benchmark, but using Clang as back-end.</p>
<pre class="code literal-block">
All benchmarks:
before after ratio
[99d8234f] [60632651]
9.57μs 10.00μs 1.05 benchmarks.TimeSuite.time_abs_array
+ 34.20μs 58.53μs 1.71 benchmarks.TimeSuite.time_acos_array
36.09μs 33.91μs 0.94 benchmarks.TimeSuite.time_asin_array
- 45.02μs 33.86μs 0.75 benchmarks.TimeSuite.time_atan_array
+ 39.44μs 45.48μs 1.15 benchmarks.TimeSuite.time_cos_array
- 65.98μs 44.78μs 0.68 benchmarks.TimeSuite.time_cosh_array
+ 39.39μs 45.48μs 1.15 benchmarks.TimeSuite.time_sin_array
- 110.62μs 48.44μs 0.44 benchmarks.TimeSuite.time_sinh_array
18.18μs 18.54μs 1.02 benchmarks.TimeSuite.time_sqrt_array
10.05μs 9.56μs 0.95 benchmarks.TimeSuite.time_square_array
- 56.82μs 45.32μs 0.80 benchmarks.TimeSuite.time_tan_array
- 98.85μs 44.16μs 0.45 benchmarks.TimeSuite.time_tanh_array
</pre>
<p>Wow, that's significant changes. Regression on both <tt class="docutils literal">cos</tt>, <tt class="docutils literal">sin</tt> and <tt class="docutils literal">acos</tt> are not good news.</p>
<p>What conclusion should we draw? My take on this is that these benchmarks are
not synthetic enough to state <em>xsimd implementation of function X is better or
worse than boost.simd implementation</em>. But maybe there are bad interactions
with Pythran's expression templates? A single register spill could wreak havoc
in the performance, and I know there is room for improvement there.</p>
</div>
<div class="section" id="conclusions">
<h2>Conclusions</h2>
<p>I'm indeed very happy with the changes. The xsimd team is very reactive, it's
cool to chat with them about performance, Python, C++... And did I say xsimd
supports NEON, AVX512? I should try to run cross-compiled Pythran code on a
Raspberry, but... That's for another story!</p>
<p>Again thanks a lot to (alphabetical order) <a class="reference external" href="https://twitter.com/JohanMabille">Johan</a>, <a class="reference external" href="https://twitter.com/renou_martin">Martin</a>, <a class="reference external" href="https://twitter.com/SylvainCorlay">Sylvain</a> and <a class="reference external" href="https://twitter.com/wuoulf">Wolf</a>.
Let's meet again in front of a generous choucroute!</p>
</div>
</article>
<section class="post-nav">
<div id="left-page">
<div id="left-link">
</div>
</div>
<div id="right-page">
<div id="right-link">
</div>
</div>
</section>
<div>
</div>
</main>
<footer>
<h6>
Rendered by <a href="http://getpelican.com/">Pelican</a> • Theme by <a
href="https://github.com/aleylara/Peli-Kiera">Peli-Kiera</a> • Copyright
© ‑ serge-sans-paille and other pythraners </h6>
</footer>
</div>
</body>
</html>