bye-bye-boostsimd-welcome-xsimd.html

<!DOCTYPE html>
<html lang="en">
<head>
        <meta charset="utf-8"/>
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <meta http-equiv="X-UA-Compatible" content="ie=edge">
        <title>Pythran stories - Bye bye boost.simd, welcome xsimd</title>
        <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/normalize/8.0.1/normalize.min.css"/>
        <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.2/css/all.min.css"/>
        <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto+Slab|Ruda"/>
        <link rel="stylesheet" type="text/css" href="./theme/css/main.css"/>
            <link href="http://serge-sans-paille.github.io/pythran-stories/
feeds/all.atom.xml"
                  type="application/atom+xml" rel="alternate" title="Pythran stories Atom Feed"/>

</head>
<body>
<style>.github-corner:hover .octo-arm {
    animation: octocat-wave 560ms ease-in-out
}
@keyframes octocat-wave {
    0%, 100% {
        transform: rotate(0)
    }
    20%, 60% {
        transform: rotate(-25deg)
    }
    40%, 80% {
        transform: rotate(10deg)
    }
}
@media (max-width: 500px) {
    .github-corner:hover .octo-arm {
        animation: none
    }

    .github-corner .octo-arm {
        animation: octocat-wave 560ms ease-in-out
    }
}</style><div id="container">
    <header>
        <h1><a href="./">Pythran stories</a></h1>
            <ul class="social-media">
                    <li><a href="https://github.com/serge-sans-paille/pythran"><i class="fab fa-github fa-lg" aria-hidden="true"></i></a></li>
                    <li><a href="http://serge-sans-paille.github.io/pythran-stories/
feeds/all.atom.xml"
                           type="application/atom+xml" rel="alternate"><i class="fa fa-rss fa-lg"
                                                                          aria-hidden="true"></i></a></li>
            </ul>
        <p><em></em></p>
    </header>
    <nav>
        <ul>
                    <li><a href="./category/benchmark.html"> benchmark </a></li>
                    <li><a href="./category/compilation.html"> compilation </a></li>
                    <li><a                         class="active" href="./category/engineering.html"> engineering </a></li>
                    <li><a href="./category/examples.html"> examples </a></li>
                    <li><a href="./category/mozilla.html"> mozilla </a></li>
                    <li><a href="./category/optimisation.html"> optimisation </a></li>
                    <li><a href="./category/release.html"> release </a></li>
        </ul>
    </nav>
<main>
    <article>
        <h1>Bye bye boost.simd, welcome xsimd</h1>
        
        <aside>
            <ul>
                <li>
                    <time datetime="2018-10-31 00:00:00+01:00">Oct 31, 2018</time>
                </li>
                <li>
                    Categories:
                    <a href="./category/engineering.html"><em>engineering</em></a>
                </li>
                </li>
            </ul>
        </aside>
        <p><a class="reference external" href="https://github.com/NumScale/boost.simd">boost.simd</a> provides a C++
abstraction of vector type, allowing for efficient vectorization of array
computations. It has been (optionally) used as part of the expression template
engine of Pythran for a long time, a great collaboration that led to several
patches in boost.simd, and great performance for Pythran.</p>
<p>Unfortunately, the project has been silent over the last months (see for
instance <a class="reference external" href="https://github.com/NumScale/boost.simd/issues/546">this issue</a>) and
I had to maintain a few custom patches. Turns out the project has been
re-branded as <strong>bSIMD</strong> with a more restrictive license, as detailed in <a class="reference external" href="https://github.com/NumScale/boost.simd/issues/545">another
issue</a>. From the Pythran
perspective, this is no good news.</p>
<p>Fortunately, the people from <a class="reference external" href="http://quantstack.net/">QuantStack</a> have put
tremendous effort into providing an equivalent to boost.simd, <a class="reference external" href="http://quantstack.net/xsimd.html">xsimd</a>. And their library actually provides some
improvements in the context of Pythran, it's under a <em>BSD-3-Clause license</em>, so
when they proposed to fund the move to <em>xsimd</em>, it was just perfect.</p>
<p>So here is the deal: I do the port, report any API and/or performance issue,
and eventually provide patches when relevant. That's what I did over the last
three months, let's have a look at the results.</p>
<div class="section" id="user-level-changes">
<h2>User-level Changes</h2>
<p>In order to activate explicit vectorisation, one must pass <tt class="docutils literal"><span class="pre">-DUSE_XSIMD</span> <span class="pre">-march=native</span></tt> to the Pythran compiler, in place of <tt class="docutils literal"><span class="pre">-DUSE_BOOST_SIMD</span> <span class="pre">-march=native</span></tt>. Fair enough.</p>
<p>For instance, consider the following kernel, taken from the <a class="reference external" href="https://github.com/serge-sans-paille/numpy-benchmarks/">numpy benchmarks</a> suite I
maintain.</p>
<div class="highlight"><pre><span></span><span class="c1">#pythran export arc_distance(float64 [], float64[], float64[], float64[])</span>

<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">arc_distance</span><span class="p">(</span><span class="n">theta_1</span><span class="p">,</span> <span class="n">phi_1</span><span class="p">,</span>
                       <span class="n">theta_2</span><span class="p">,</span> <span class="n">phi_2</span><span class="p">):</span>
<span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd">    Calculates the pairwise arc distance between all points in vector a and b.</span>
<span class="sd">    &quot;&quot;&quot;</span>
    <span class="n">temp</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">((</span><span class="n">theta_2</span><span class="o">-</span><span class="n">theta_1</span><span class="p">)</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">theta_1</span><span class="p">)</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">theta_2</span><span class="p">)</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">((</span><span class="n">phi_2</span><span class="o">-</span><span class="n">phi_1</span><span class="p">)</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span>
    <span class="n">distance_matrix</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arctan2</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">temp</span><span class="p">),</span><span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">temp</span><span class="p">)))</span>
    <span class="k">return</span> <span class="n">distance_matrix</span>
</pre></div>
<p>When compiled with GCC 7.3 and benchmarked with the <a class="reference external" href="https://pypi.org/project/perf/">perf</a> module, one gets</p>
<div class="highlight"><pre><span></span><span class="nv">CC</span><span class="o">=</span>clang<span class="w"> </span><span class="nv">CXX</span><span class="o">=</span>clang++<span class="w"> </span>pythran<span class="w"> </span>arc_distance.py<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native
python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">&#39;N = 10000 ; import numpy as np ; np.random.seed(0); t0, p0, t1, p1 = np.random.randn(N), np.random.randn(N), np.random.randn(N), np.random.randn(N); from arc_distance import arc_distance&#39;</span><span class="w"> </span><span class="s1">&#39;arc_distance(t0, p0, t1, p1)&#39;</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">1</span>.48<span class="w"> </span>ms<span class="w"> </span>+-<span class="w"> </span><span class="m">0</span>.01<span class="w"> </span>ms
</pre></div>
<p>That's our base line. If we recompile it with <tt class="docutils literal"><span class="pre">-DUSE_XSIMD</span></tt>, we get an extra speedup (AVX instructions are available on my laptop, and enabled by <tt class="docutils literal"><span class="pre">-march=native</span></tt>).</p>
<div class="highlight"><pre><span></span><span class="nv">CC</span><span class="o">=</span>clang<span class="w"> </span><span class="nv">CXX</span><span class="o">=</span>clang++<span class="w"> </span>python<span class="w"> </span>-m<span class="w"> </span>pythran.run<span class="w"> </span>arc_distance.py<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>-DUSE_XSIMD
python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">&#39;N = 10000 ; import numpy as np ; np.random.seed(0); t0, p0, t1, p1 = np.random.randn(N), np.random.randn(N), np.random.randn(N), np.random.randn(N); from arc_distance import arc_distance&#39;</span><span class="w"> </span><span class="s1">&#39;arc_distance(t0, p0, t1, p1)&#39;</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">199</span><span class="w"> </span>us<span class="w"> </span>+-<span class="w"> </span><span class="m">4</span><span class="w"> </span>us
</pre></div>
<p>That's roughly 7 times faster. And using Pythran 0.8.7, the last release with boost.simd support, we have</p>
<div class="highlight"><pre><span></span><span class="nv">CC</span><span class="o">=</span>clang<span class="w"> </span><span class="nv">CXX</span><span class="o">=</span>clang++<span class="w"> </span>python<span class="w"> </span>-m<span class="w"> </span>pythran.run<span class="w"> </span>arc_distance.py<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>-DUSE_BOOST_SIMD
python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">&#39;N = 10000 ; import numpy as np ; np.random.seed(0); t0, p0, t1, p1 = np.random.randn(N), np.random.randn(N), np.random.randn(N), np.random.randn(N); from arc_distance import arc_distance&#39;</span><span class="w"> </span><span class="s1">&#39;arc_distance(t0, p0, t1, p1)&#39;</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">284</span><span class="w"> </span>us<span class="w"> </span>+-<span class="w"> </span><span class="m">8</span><span class="w"> </span>us
</pre></div>
<p>This is slightly slower, but within the same magnitude order. Out of curiosity, I ran the same three experiments using Clang 6 as a backend compiler and I get the following timings:</p>
<pre class="code literal-block">
clang + boost.simd: 220 us +- 8 us
clang + xsimd:      273 us +- 11 us
clang:             1.41 ms +- 0.04 ms
</pre>
<p>Interestingly, <strong>on that example</strong>, Clang generates better code for the boost.simd version. So let's be wary of hasty conclusion and just state that with both engines, I can get efficient vectorization of my code.</p>
</div>
<div class="section" id="complex-numbers">
<h2>Complex Numbers</h2>
<p>Thanks to xsimd, Pythran is now able to <em>naively</em> support complex number
vectorization. I state <em>naively</em> because we don't support changing internal
representation from array-of-struct to struct-of-array, as we stick to numpy's
layout. Still that's something new for Pythran as showcased by the following kernel:</p>
<div class="highlight"><pre><span></span><span class="c1">#pythran export normalize_complex_arr(complex[])</span>

<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">normalize_complex_arr</span><span class="p">(</span><span class="n">a</span><span class="p">):</span>
    <span class="n">a_oo</span> <span class="o">=</span> <span class="n">a</span> <span class="o">-</span> <span class="n">a</span><span class="o">.</span><span class="n">real</span><span class="o">.</span><span class="n">min</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="n">j</span><span class="o">*</span><span class="n">a</span><span class="o">.</span><span class="n">imag</span><span class="o">.</span><span class="n">min</span><span class="p">()</span> <span class="c1"># origin offsetted</span>
    <span class="k">return</span> <span class="n">a_oo</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">abs</span><span class="p">(</span><span class="n">a_oo</span><span class="p">)</span><span class="o">.</span><span class="n">max</span><span class="p">()</span>
</pre></div>
<p>Pythran provides a vectorized version of <tt class="docutils literal">np.min</tt> and <tt class="docutils literal">np.max</tt> operators, so thanks to complex support, it should provide some decent acceleration. Note that the two calls to <tt class="docutils literal">np.min()</tt> do not involve complex numbers, but that the remaining parts of the expression do. Let's check that!</p>
<p>First, the reference numpy version:</p>
<div class="highlight"><pre><span></span>python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">&#39;import numpy as np; np.random.seed(0); N = 100000; x = np.random.random(N) + 1j *  np.random.random(N); from normalize_complex_arr import normalize_complex_arr&#39;</span><span class="w"> </span><span class="s1">&#39;normalize_complex_arr(x)&#39;</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">3</span>.19<span class="w"> </span>ms<span class="w"> </span>+-<span class="w"> </span><span class="m">0</span>.02<span class="w"> </span>ms
</pre></div>
<p>Then with Pythran, no explicit vectorization:</p>
<div class="highlight"><pre><span></span><span class="nv">CC</span><span class="o">=</span>gcc<span class="w"> </span><span class="nv">CXX</span><span class="o">=</span>g++<span class="w"> </span>pythran<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>-O3<span class="w"> </span>normalize_complex_arr.py
python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">&#39;import numpy as np; np.random.seed(0); N = 100000; x = np.random.random(N) + 1j *  np.random.random(N); from normalize_complex_arr import normalize_complex_arr&#39;</span><span class="w"> </span><span class="s1">&#39;normalize_complex_arr(x)&#39;</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">2</span>.84<span class="w"> </span>ms<span class="w"> </span>+-<span class="w"> </span><span class="m">0</span>.01<span class="w"> </span>ms
</pre></div>
<p>And with vectorization on .</p>
<div class="highlight"><pre><span></span><span class="nv">CC</span><span class="o">=</span>gcc<span class="w"> </span><span class="nv">CXX</span><span class="o">=</span>g++<span class="w"> </span>pythran<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>-O3<span class="w"> </span>make_decision.py<span class="w"> </span>-DUSE_XSIMD
python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">&#39;import numpy as np; np.random.seed(0); N = 100000; x = np.random.random(N) + 1j *  np.random.random(N); from normalize_complex_arr import normalize_complex_arr&#39;</span><span class="w"> </span><span class="s1">&#39;normalize_complex_arr(x)&#39;</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">723</span><span class="w"> </span>us<span class="w"> </span>+-<span class="w"> </span><span class="m">14</span><span class="w"> </span>us
</pre></div>
<p>Cool! Speedup for complex! For the record, the numpy version already ran at roughly <tt class="docutils literal">3.19 ms +- 0.02 ms</tt>.</p>
</div>
<div class="section" id="scalar-version">
<h2>Scalar Version</h2>
<p>That's probably a detail for many xsimd users, but thanks to this cooperation,
xsimd now exposes a scalar version of all the mathematical function inside the
<tt class="docutils literal">xsimd::</tt> namespace. That way one can write higher level functions based on
xsimd, and they would work for both scalar and vector version:</p>
<div class="highlight"><pre><span></span><span class="k">template</span><span class="o">&lt;</span><span class="k">class</span><span class="w"> </span><span class="nc">T</span><span class="o">&gt;</span>
<span class="n">T</span><span class="w"> </span><span class="n">euclidian_distance_squared</span><span class="p">(</span><span class="n">T</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">T</span><span class="w"> </span><span class="n">y</span><span class="p">)</span>
<span class="p">{</span>
<span class="w">    </span><span class="k">auto</span><span class="w"> </span><span class="n">tmp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xsimd</span><span class="o">::</span><span class="n">hypot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">);</span>
<span class="w">    </span><span class="k">return</span><span class="w"> </span><span class="n">tmp</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">tmp</span><span class="p">;</span>
<span class="p">}</span>
</pre></div>
<p>In the context of Pythran, this makes the expression template engine easier to
write. Good point.</p>
</div>
<div class="section" id="compilation-time">
<h2>Compilation Time</h2>
<p>Pythran is an <em>Ahead of Time</em> compiler, so compilation time is generally not a
good metric. But there's one situation where it matters to me: Continuous
Integration. Because Travis has time limits, the faster we compile, the more
tests we can pass! As Pythran validates for Python2 and Python3, for Clang and
GCC, with and without SIMD, with and without OpenMP, that's a lot of
configurations to test. Roughly... 20hours of cumulated tests actually, see
<a class="reference external" href="https://travis-ci.com/serge-sans-paille/pythran/builds/89663340">this recent build</a> for
instance.</p>
<p>In pre-xsimd setting, compiling the above  <tt class="docutils literal">arc_distance.py</tt> file in simd mode is relatively slow. As a reference consider the compilation of the sequential version:</p>
<div class="highlight"><pre><span></span><span class="nb">time</span><span class="w"> </span>pythran<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>normalize_complex_arr.py<span class="w"> </span>-E<span class="w"> </span><span class="c1"># generate the .cpp</span>
<span class="m">0</span>.91s<span class="w"> </span>user<span class="w"> </span><span class="m">0</span>.28s<span class="w"> </span>system<span class="w"> </span><span class="m">130</span>%<span class="w"> </span>cpu<span class="w"> </span><span class="m">0</span>.908<span class="w"> </span>total

<span class="nb">time</span><span class="w"> </span>pythran<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>arc_distance.cpp
<span class="m">5</span>.67s<span class="w"> </span>user<span class="w"> </span><span class="m">0</span>.61s<span class="w"> </span>system<span class="w"> </span><span class="m">104</span>%<span class="w"> </span>cpu<span class="w"> </span><span class="m">6</span>.001<span class="w"> </span>total
</pre></div>
<p>Ok, roughly 5 seconds in sequential mode. What about vectorized version? With boost, it's pretty damn slow:</p>
<div class="highlight"><pre><span></span><span class="nb">time</span><span class="w"> </span>pythran<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>normalize_complex_arr.cpp<span class="w"> </span>-DUSE_BOOST_SIMD
<span class="m">12</span>.10s<span class="w"> </span>user<span class="w"> </span><span class="m">0</span>.79s<span class="w"> </span>system<span class="w"> </span><span class="m">102</span>%<span class="w"> </span>cpu<span class="w"> </span><span class="m">12</span>.616<span class="w"> </span>total
</pre></div>
<p>With xsimd, it's slightly faster (no boost dependencies, and less C++ magic):</p>
<div class="highlight"><pre><span></span><span class="nb">time</span><span class="w"> </span>pythran<span class="w"> </span>-O3<span class="w"> </span>-march<span class="o">=</span>native<span class="w"> </span>arc_distance.cpp<span class="w"> </span>-DUSE_XSIMD
<span class="m">10</span>.32s<span class="w"> </span>user<span class="w"> </span><span class="m">0</span>.65s<span class="w"> </span>system<span class="w"> </span><span class="m">102</span>%<span class="w"> </span>cpu<span class="w"> </span><span class="m">10</span>.688<span class="w"> </span>total
</pre></div>
</div>
<div class="section" id="performance-of-basic-functions">
<h2>Performance of Basic Functions</h2>
<p>Using <a class="reference external" href="https://github.com/airspeed-velocity/asv">airspeed velocity</a>, I've compared how well xsimd behaves for simple operations on 1D array. All the benchmarks hereafter have the following form:</p>
<div class="highlight"><pre><span></span><span class="c1">#pythran export cos_array(float64 [])</span>
<span class="c1">#setup: import numpy as np ; np.random.seed(0); N = 10000 ; x = np.random.random(N) * 2 * np.pi</span>
<span class="c1">#run: cos_array(x)</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">cos_array</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</pre></div>
<p>The results are obtained through the <tt class="docutils literal">asv compare commit_id0 commit_id1</tt> command.</p>
<pre class="code literal-block">
All benchmarks:

    before     after       ratio
  [99d8234f] [60632651]
     9.90μs     9.89μs      1.00  benchmarks.TimeSuite.time_abs_array
+   36.82μs    58.44μs      1.59  benchmarks.TimeSuite.time_acos_array
    36.25μs    33.60μs      0.93  benchmarks.TimeSuite.time_asin_array
-   50.47μs    33.03μs      0.65  benchmarks.TimeSuite.time_atan_array
-   48.62μs    35.72μs      0.73  benchmarks.TimeSuite.time_cos_array
-   73.82μs    43.81μs      0.59  benchmarks.TimeSuite.time_cosh_array
-   47.55μs    35.52μs      0.75  benchmarks.TimeSuite.time_sin_array
-   91.45μs    47.86μs      0.52  benchmarks.TimeSuite.time_sinh_array
    18.35μs    17.91μs      0.98  benchmarks.TimeSuite.time_sqrt_array
     9.60μs    10.05μs      1.05  benchmarks.TimeSuite.time_square_array
-   71.71μs    33.35μs      0.47  benchmarks.TimeSuite.time_tan_array
-   84.63μs    42.28μs      0.50  benchmarks.TimeSuite.time_tanh_array
</pre>
<p>Looks pretty good! Apart from a regression on <tt class="docutils literal">acos</tt>, this is either on-par or faster than before.</p>
<p>Out of curiosity, I also ran the same benchmark, but using Clang as back-end.</p>
<pre class="code literal-block">
All benchmarks:

    before     after       ratio
  [99d8234f] [60632651]
     9.57μs    10.00μs      1.05  benchmarks.TimeSuite.time_abs_array
+   34.20μs    58.53μs      1.71  benchmarks.TimeSuite.time_acos_array
    36.09μs    33.91μs      0.94  benchmarks.TimeSuite.time_asin_array
-   45.02μs    33.86μs      0.75  benchmarks.TimeSuite.time_atan_array
+   39.44μs    45.48μs      1.15  benchmarks.TimeSuite.time_cos_array
-   65.98μs    44.78μs      0.68  benchmarks.TimeSuite.time_cosh_array
+   39.39μs    45.48μs      1.15  benchmarks.TimeSuite.time_sin_array
-  110.62μs    48.44μs      0.44  benchmarks.TimeSuite.time_sinh_array
    18.18μs    18.54μs      1.02  benchmarks.TimeSuite.time_sqrt_array
    10.05μs     9.56μs      0.95  benchmarks.TimeSuite.time_square_array
-   56.82μs    45.32μs      0.80  benchmarks.TimeSuite.time_tan_array
-   98.85μs    44.16μs      0.45  benchmarks.TimeSuite.time_tanh_array
</pre>
<p>Wow, that's significant changes. Regression on both <tt class="docutils literal">cos</tt>, <tt class="docutils literal">sin</tt> and <tt class="docutils literal">acos</tt> are not good news.</p>
<p>What conclusion should we draw? My take on this is that these benchmarks are
not synthetic enough to state <em>xsimd implementation of function X is better or
worse than boost.simd implementation</em>. But maybe there are bad interactions
with Pythran's expression templates? A single register spill could wreak havoc
in the performance, and I know there is room for improvement there.</p>
</div>
<div class="section" id="conclusions">
<h2>Conclusions</h2>
<p>I'm indeed very happy with the changes. The xsimd team is very reactive, it's
cool to chat with them about performance, Python, C++... And did I say xsimd
supports NEON, AVX512? I should try to run cross-compiled Pythran code on a
Raspberry, but... That's for another story!</p>
<p>Again thanks a lot to (alphabetical order) <a class="reference external" href="https://twitter.com/JohanMabille">Johan</a>, <a class="reference external" href="https://twitter.com/renou_martin">Martin</a>, <a class="reference external" href="https://twitter.com/SylvainCorlay">Sylvain</a> and <a class="reference external" href="https://twitter.com/wuoulf">Wolf</a>.
Let's meet again in front of a generous choucroute!</p>
</div>

    </article>
    <section class="post-nav">
        <div id="left-page">
            <div id="left-link">
            </div>
        </div>
        <div id="right-page">
            <div id="right-link">
            </div>
        </div>
    </section>
    <div>
    </div>
</main>
    <footer>
        <h6>
            Rendered by <a href="http://getpelican.com/">Pelican</a> &nbsp;&bull;&nbsp; Theme by <a
                href="https://github.com/aleylara/Peli-Kiera">Peli-Kiera</a> &nbsp;&bull;&nbsp; Copyright
                &copy &nbsp;&#8209;&nbsp; serge-sans-paille and other pythraners         </h6>
    </footer>
</div>
</body>
</html>