-
Notifications
You must be signed in to change notification settings - Fork 6
/
source-to-source-transformation-of-a-python-kernel.html
220 lines (216 loc) · 17.5 KB
/
source-to-source-transformation-of-a-python-kernel.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Pythran stories - Source-to-source transformation of a Python kernel</title>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/normalize/8.0.1/normalize.min.css"/>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.2/css/all.min.css"/>
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto+Slab|Ruda"/>
<link rel="stylesheet" type="text/css" href="./theme/css/main.css"/>
<link href="http://serge-sans-paille.github.io/pythran-stories/
feeds/all.atom.xml"
type="application/atom+xml" rel="alternate" title="Pythran stories Atom Feed"/>
</head>
<body>
<style>.github-corner:hover .octo-arm {
animation: octocat-wave 560ms ease-in-out
}
@keyframes octocat-wave {
0%, 100% {
transform: rotate(0)
}
20%, 60% {
transform: rotate(-25deg)
}
40%, 80% {
transform: rotate(10deg)
}
}
@media (max-width: 500px) {
.github-corner:hover .octo-arm {
animation: none
}
.github-corner .octo-arm {
animation: octocat-wave 560ms ease-in-out
}
}</style><div id="container">
<header>
<h1><a href="./">Pythran stories</a></h1>
<ul class="social-media">
<li><a href="https://github.com/serge-sans-paille/pythran"><i class="fab fa-github fa-lg" aria-hidden="true"></i></a></li>
<li><a href="http://serge-sans-paille.github.io/pythran-stories/
feeds/all.atom.xml"
type="application/atom+xml" rel="alternate"><i class="fa fa-rss fa-lg"
aria-hidden="true"></i></a></li>
</ul>
<p><em></em></p>
</header>
<nav>
<ul>
<li><a href="./category/benchmark.html"> benchmark </a></li>
<li><a class="active" href="./category/compilation.html"> compilation </a></li>
<li><a href="./category/engineering.html"> engineering </a></li>
<li><a href="./category/examples.html"> examples </a></li>
<li><a href="./category/mozilla.html"> mozilla </a></li>
<li><a href="./category/optimisation.html"> optimisation </a></li>
<li><a href="./category/release.html"> release </a></li>
</ul>
</nav>
<main>
<article>
<h1>Source-to-source transformation of a Python kernel</h1>
<aside>
<ul>
<li>
<time datetime="2019-03-06 00:00:00+01:00">Mar 06, 2019</time>
</li>
<li>
Categories:
<a href="./category/compilation.html"><em>compilation</em></a>
</li>
</li>
</ul>
</aside>
<p>If you're curious or genuinely interested into how Pythran transforms your
code, but not brave enough to dive into the generated C++ code, Pythran
provides a compilation switch to dump the refined Python code, after
optimization and before it gets translated to C++. Internally, this relies on
the fact we have two backends: a C++ backend and a Python backend.</p>
<div class="section" id="using-pythran-as-a-source-to-source-compiler">
<h2>Using Pythran as a Source-to-Source Compiler</h2>
<p>Pythran can be used as a source-to-source engine through the <tt class="docutils literal"><span class="pre">-P</span></tt> flag.</p>
<div class="highlight"><pre><span></span>><span class="w"> </span>cat<span class="w"> </span>sample.py
def<span class="w"> </span>fibo<span class="o">(</span>n<span class="o">)</span>:
<span class="w"> </span><span class="k">return</span><span class="w"> </span>n<span class="w"> </span><span class="k">if</span><span class="w"> </span>n<span class="w"> </span><<span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="k">else</span><span class="w"> </span>fibo<span class="o">(</span>n<span class="w"> </span>-<span class="w"> </span><span class="m">1</span><span class="o">)</span><span class="w"> </span>+<span class="w"> </span>fibo<span class="o">(</span>n<span class="w"> </span>-<span class="w"> </span><span class="m">2</span><span class="o">)</span>
def<span class="w"> </span>test<span class="o">()</span>:
<span class="w"> </span>print<span class="o">(</span>fibo<span class="o">(</span><span class="m">10</span><span class="o">))</span>
><span class="w"> </span>pythran<span class="w"> </span>-P<span class="w"> </span>sample.py
def<span class="w"> </span>fibo<span class="o">(</span>n<span class="o">)</span>:
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="o">(</span>n<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="o">(</span>n<span class="w"> </span><<span class="w"> </span><span class="m">2</span><span class="o">)</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="o">(</span>fibo<span class="o">((</span>n<span class="w"> </span>-<span class="w"> </span><span class="m">1</span><span class="o">))</span><span class="w"> </span>+<span class="w"> </span>fibo<span class="o">((</span>n<span class="w"> </span>-<span class="w"> </span><span class="m">2</span><span class="o">))))</span>
def<span class="w"> </span>test<span class="o">()</span>:
<span class="w"> </span>__builtin__.print<span class="o">(</span><span class="m">55</span><span class="o">)</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span>__builtin__.None
</pre></div>
<p>What happened? Pythran analyzed the body of <tt class="docutils literal">fibo</tt> and found out it was a
pure function (no effect on global state nor arguments) called with a literal,
so it performed aggressive constant propagation. It also computed def-use
chains which helps making every builtin explicit (<tt class="docutils literal">__builtin__.print</tt>). Based
on the the control flow graph of each function, it also adds <tt class="docutils literal">return None</tt>
wherever Python would implicit add it.</p>
</div>
<div class="section" id="advanced-transformations">
<h2>Advanced Transformations</h2>
<p>An alluring aspect of Python for scientists is the high level constructs it
proposes. For instance, the following code implements an (arguably) high level
way of computing the wighted sum between five integers:</p>
<div class="highlight"><pre><span></span><span class="c1"># wsum.py</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">wsum</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">sum</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">v</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">])</span> <span class="o">*</span> <span class="p">(</span><span class="mf">.1</span><span class="p">,</span> <span class="mf">.2</span><span class="p">,</span> <span class="mf">.3</span><span class="p">,</span> <span class="mf">.2</span><span class="p">,</span> <span class="mf">.1</span><span class="p">))</span>
</pre></div>
<p>This code is okay from Numpy point of view, but how does Pythran handle it?
Surely, building a temporary array just for the sake of performing a
point-to-point array operation is not the most efficient way of performing
these operation!</p>
<div class="highlight"><pre><span></span>><span class="w"> </span>pythran<span class="w"> </span>-P<span class="w"> </span>wsum.py
import<span class="w"> </span>numpy<span class="w"> </span>as<span class="w"> </span>__pythran_import_numpy
def<span class="w"> </span>wsum<span class="o">(</span>v,<span class="w"> </span>w,<span class="w"> </span>x,<span class="w"> </span>y,<span class="w"> </span>z<span class="o">)</span>:
<span class="w"> </span><span class="k">return</span><span class="w"> </span>__builtin__.sum<span class="o">(((</span>v<span class="w"> </span>*<span class="w"> </span><span class="m">0</span>.1<span class="o">)</span>,<span class="w"> </span><span class="o">(</span>w<span class="w"> </span>*<span class="w"> </span><span class="m">0</span>.2<span class="o">)</span>,<span class="w"> </span><span class="o">(</span>x<span class="w"> </span>*<span class="w"> </span><span class="m">0</span>.3<span class="o">)</span>,<span class="w"> </span><span class="o">(</span>y<span class="w"> </span>*<span class="w"> </span><span class="m">0</span>.2<span class="o">)</span>,<span class="w"> </span><span class="o">(</span>z<span class="w"> </span>*<span class="w"> </span><span class="m">0</span>.1<span class="o">)))</span>
</pre></div>
<p>Fascinating! (Yes, I'm self-congratulating there). Pythran understood that a
Numpy operation on fixed-size array was involved, so it first performed the
broadcasting on its own, resulting in:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">__pythran_import_numpy</span>
<span class="k">def</span> <span class="nf">wsum</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">):</span>
<span class="k">return</span> <span class="n">__builtin__</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">__pythran_import_numpy</span><span class="o">.</span><span class="n">array</span><span class="p">([(</span><span class="n">v</span> <span class="o">*</span> <span class="mf">0.1</span><span class="p">),</span> <span class="p">(</span><span class="n">w</span> <span class="o">*</span> <span class="mf">0.2</span><span class="p">),</span> <span class="p">(</span><span class="n">x</span> <span class="o">*</span> <span class="mf">0.3</span><span class="p">),</span> <span class="p">(</span><span class="n">y</span> <span class="o">*</span> <span class="mf">0.2</span><span class="p">),</span> <span class="p">(</span><span class="n">z</span> <span class="o">*</span> <span class="mf">0.1</span><span class="p">)]))</span>
</pre></div>
<p>Then it used the fact that sum can take any iterable as parameter to prune the
call to <tt class="docutils literal">np.array</tt>. The nice thing with tuple of homogeneous type as
parameter is that the C++ backend can use it to generate something equivalent
to <tt class="docutils literal"><span class="pre">std::array<double,</span> 5></tt>, avoiding a heap allocation.</p>
</div>
<div class="section" id="the-assembly-worker">
<h2>The Assembly Worker</h2>
<p>Let's inspect the assembly generated from the above code, instantiated with the
Pythran annotation <tt class="docutils literal">#pythran export wsum(float64, float64, float64, float64,
float64)</tt> and compiled with Clang 6.0</p>
<div class="highlight"><pre><span></span>><span class="w"> </span><span class="nv">CXX</span><span class="o">=</span>clang++<span class="w"> </span>pythran<span class="w"> </span>wsum.py
><span class="w"> </span>objdump<span class="w"> </span>-S<span class="w"> </span>-C<span class="w"> </span>wsum.*.so
<span class="o">[</span>...<span class="o">]</span>
...<span class="w"> </span>movsd<span class="w"> </span>0x12d4<span class="o">(</span>%rip<span class="o">)</span>,%xmm0
...<span class="w"> </span>movsd<span class="w"> </span>0x18<span class="o">(</span>%rsp<span class="o">)</span>,%xmm2
...<span class="w"> </span>mulsd<span class="w"> </span>%xmm0,%xmm2
...<span class="w"> </span>movsd<span class="w"> </span>0x12ca<span class="o">(</span>%rip<span class="o">)</span>,%xmm1
...<span class="w"> </span>movsd<span class="w"> </span>0x10<span class="o">(</span>%rsp<span class="o">)</span>,%xmm3
...<span class="w"> </span>mulsd<span class="w"> </span>%xmm1,%xmm3
...<span class="w"> </span>movsd<span class="w"> </span>0x8<span class="o">(</span>%rsp<span class="o">)</span>,%xmm4
...<span class="w"> </span>mulsd<span class="w"> </span>0x12ba<span class="o">(</span>%rip<span class="o">)</span>,%xmm4
...<span class="w"> </span>movsd<span class="w"> </span><span class="o">(</span>%rsp<span class="o">)</span>,%xmm5
...<span class="w"> </span>mulsd<span class="w"> </span>%xmm0,%xmm5
...<span class="w"> </span>movsd<span class="w"> </span>0x20<span class="o">(</span>%rsp<span class="o">)</span>,%xmm0
...<span class="w"> </span>mulsd<span class="w"> </span>%xmm1,%xmm0
...<span class="w"> </span>addsd<span class="w"> </span>%xmm5,%xmm0
...<span class="w"> </span>addsd<span class="w"> </span>%xmm4,%xmm0
...<span class="w"> </span>addsd<span class="w"> </span>%xmm3,%xmm0
...<span class="w"> </span>addsd<span class="w"> </span>%xmm2,%xmm0
<span class="o">[</span>...<span class="o">]</span>
</pre></div>
<p>No single call to a memory allocator, no branching, just a plain listing of
<tt class="docutils literal">movsd</tt>, <tt class="docutils literal">mulsd</tt> and <tt class="docutils literal">addsd</tt>. And quite some register pressure, but
that's how it is.</p>
</div>
<div class="section" id="just-perf-it">
<h2>Just <tt class="docutils literal">perf</tt> it</h2>
<p>As a tribute to Victor Stinner's <tt class="docutils literal">perf</tt> module, and as a conclusion to this
small experiment, let's ensure we get some speedup, event for such a small
kernel:</p>
<div class="highlight"><pre><span></span>><span class="w"> </span>rm<span class="w"> </span>*.so
><span class="w"> </span>python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">'from wsum import wsum'</span><span class="w"> </span><span class="s1">'wsum(1.,2.,3.,4.,5.)'</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">3</span>.73<span class="w"> </span>us<span class="w"> </span>+-<span class="w"> </span><span class="m">0</span>.11<span class="w"> </span>us
><span class="w"> </span><span class="nv">CXX</span><span class="o">=</span>clang++<span class="w"> </span>pythran<span class="w"> </span>wsum.py
><span class="w"> </span>python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">'from wsum import wsum'</span><span class="w"> </span><span class="s1">'wsum(1.,2.,3.,4.,5.)'</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">190</span><span class="w"> </span>ns<span class="w"> </span>+-<span class="w"> </span><span class="m">3</span><span class="w"> </span>ns
</pre></div>
<p>And out of curiosity, let's check the timing with the transformed Python kernel.</p>
<div class="highlight"><pre><span></span>><span class="w"> </span>rm<span class="w"> </span>*.so
><span class="w"> </span>pythran<span class="w"> </span>-P<span class="w"> </span>wsum.py<span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span><span class="s1">'s,__builtin__.,,'</span><span class="w"> </span>><span class="w"> </span>wsum2.py
><span class="w"> </span>python<span class="w"> </span>-m<span class="w"> </span>perf<span class="w"> </span>timeit<span class="w"> </span>-s<span class="w"> </span><span class="s1">'from wsum2 import wsum'</span><span class="w"> </span><span class="s1">'wsum(1.,2.,3.,4.,5.)'</span>
.....................
Mean<span class="w"> </span>+-<span class="w"> </span>std<span class="w"> </span>dev:<span class="w"> </span><span class="m">308</span><span class="w"> </span>ns<span class="w"> </span>+-<span class="w"> </span><span class="m">7</span><span class="w"> </span>ns
</pre></div>
<p>Fine! Pythran did the job in both cases :-)</p>
</div>
<div class="section" id="final-words">
<h2>Final Words</h2>
<p>The optimisations done by Pythran are meant at optimising its internal
representation so that translated code compiles to an efficient native library.
Still, being able to debug it at Python level is very valuable, and it can even
generate faster Python code in some cases!</p>
</div>
</article>
<section class="post-nav">
<div id="left-page">
<div id="left-link">
</div>
</div>
<div id="right-page">
<div id="right-link">
</div>
</div>
</section>
<div>
</div>
</main>
<footer>
<h6>
Rendered by <a href="http://getpelican.com/">Pelican</a> • Theme by <a
href="https://github.com/aleylara/Peli-Kiera">Peli-Kiera</a> • Copyright
© ‑ serge-sans-paille and other pythraners </h6>
</footer>
</div>
</body>
</html>