-
Notifications
You must be signed in to change notification settings - Fork 0
Description
算法与调度
当编写高性能计算的代码(例如图像处理,张量计算)时,我们需要为特定的硬件架构精心构造代码,包括但不限于使用以下方法:
- 将外层循环并行化 -> 利用多核 CPU
- 将内层循环向量化 -> 利用 CPU SIMD 指令
- 拆分并交换迭代顺序 -> 利用局部缓存
- ......
但是这会带来一系列问题
- 这些代码的可读性,可维护性都很差
- 每个不同的硬件架构,甚至缓存大小不同,都需要不同的代码实现,所以我们要维护多套代码实现
- 需要开发者非常了解硬件架构,并且进行大量测试
Halide 是一门新的编程语言,Halide 的核心理念是,之所以会存在这些问题,是因为之前的编程都混淆了算法(Algorithm)和调度(Scheduler)两个概念
之所以我们没有察觉,是因为这个问题在编写不需要高性能的普通应用程序时并不明显,我们关心的重点是程序逻辑,即算法的部分,调度无关紧要
而在编写高性能计算代码时,调度变得重要,我们就无法忽视它的存在了
举个例子,我们计算两个矩阵相加
算法就是:对于两个矩阵同一位置的两个元素 x, y,执行 f(x, y) = x + y,结果存放在目标矩阵相应位置
算法很好理解,矩阵加就是这么做的,那么调度是什么呢?先不着急,我们先实现一下这个算法
m = 4
n = 8
# 最常见的
arr1 = np.zeros((m, n))
for x in range(0, m):
for y in range(0, n):
arr1[x, y] = x + y
print(f"x = {x}, y = {y}, x + y = {arr1[x, y]}")
print(arr1)打印的矩阵为
[[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 1. 2. 3. 4. 5. 6. 7. 8.]
[ 2. 3. 4. 5. 6. 7. 8. 9.]
[ 3. 4. 5. 6. 7. 8. 9. 10.]]
上面的代码应该是大部分人都会实现的逻辑,但是也有部分同学会调换一下 x 和 y 先后迭代的顺序
# 交换 x, y 顺序
arr2 = np.zeros((m, n))
for y in range(0, n):
for x in range(0, m):
arr2[x, y] = x + y
print(f"x = {x}, y = {y}, x + y = {arr2[x, y]}")
print(arr2)打印的矩阵仍然是
[[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 1. 2. 3. 4. 5. 6. 7. 8.]
[ 2. 3. 4. 5. 6. 7. 8. 9.]
[ 3. 4. 5. 6. 7. 8. 9. 10.]]
这有什么不同吗?之所以我们认为"差不多",是因为这两个代码在算法上是一样的,不一样的地方就是调度
前者的迭代顺序如下
x = 0, y = 0, x + y = 0.0
x = 0, y = 1, x + y = 1.0
x = 0, y = 2, x + y = 2.0
x = 0, y = 3, x + y = 3.0
x = 0, y = 4, x + y = 4.0
x = 0, y = 5, x + y = 5.0
x = 0, y = 6, x + y = 6.0
x = 0, y = 7, x + y = 7.0
x = 1, y = 0, x + y = 1.0
x = 1, y = 1, x + y = 2.0
x = 1, y = 2, x + y = 3.0
x = 1, y = 3, x + y = 4.0
x = 1, y = 4, x + y = 5.0
x = 1, y = 5, x + y = 6.0
x = 1, y = 6, x + y = 7.0
x = 1, y = 7, x + y = 8.0
x = 2, y = 0, x + y = 2.0
x = 2, y = 1, x + y = 3.0
x = 2, y = 2, x + y = 4.0
x = 2, y = 3, x + y = 5.0
x = 2, y = 4, x + y = 6.0
x = 2, y = 5, x + y = 7.0
x = 2, y = 6, x + y = 8.0
x = 2, y = 7, x + y = 9.0
x = 3, y = 0, x + y = 3.0
x = 3, y = 1, x + y = 4.0
x = 3, y = 2, x + y = 5.0
x = 3, y = 3, x + y = 6.0
x = 3, y = 4, x + y = 7.0
x = 3, y = 5, x + y = 8.0
x = 3, y = 6, x + y = 9.0
x = 3, y = 7, x + y = 10.0
后者如下
x = 0, y = 0, x + y = 0.0
x = 1, y = 0, x + y = 1.0
x = 2, y = 0, x + y = 2.0
x = 3, y = 0, x + y = 3.0
x = 0, y = 1, x + y = 1.0
x = 1, y = 1, x + y = 2.0
x = 2, y = 1, x + y = 3.0
x = 3, y = 1, x + y = 4.0
x = 0, y = 2, x + y = 2.0
x = 1, y = 2, x + y = 3.0
x = 2, y = 2, x + y = 4.0
x = 3, y = 2, x + y = 5.0
x = 0, y = 3, x + y = 3.0
x = 1, y = 3, x + y = 4.0
x = 2, y = 3, x + y = 5.0
x = 3, y = 3, x + y = 6.0
x = 0, y = 4, x + y = 4.0
x = 1, y = 4, x + y = 5.0
x = 2, y = 4, x + y = 6.0
x = 3, y = 4, x + y = 7.0
x = 0, y = 5, x + y = 5.0
x = 1, y = 5, x + y = 6.0
x = 2, y = 5, x + y = 7.0
x = 3, y = 5, x + y = 8.0
x = 0, y = 6, x + y = 6.0
x = 1, y = 6, x + y = 7.0
x = 2, y = 6, x + y = 8.0
x = 3, y = 6, x + y = 9.0
x = 0, y = 7, x + y = 7.0
x = 1, y = 7, x + y = 8.0
x = 2, y = 7, x + y = 9.0
x = 3, y = 7, x + y = 10.0
这里你可能认为,这只是 a + b 和 b + a 的区别,结果是一样的,算法复杂度没有任何区别
一点没错,实际上,调度就是 a + b 和 b + a 的区别,但是这里的 a + b 和 b + a 是严重影响程序效率的,因为算法复杂度并不是实际的执行效率,相同算法复杂度的程序,一个利用了并行,缓存,SIMD,一个没有,效率是天差地别的,调度关心的就是这些。
接下来再看一些调度,我们还可以把循环拆分
# 拆分
arr3 = np.zeros((m, n))
n_inner = 2
n_outer = n // 2
for x in range(0, m):
for y_outer in range(0, n_outer):
for y_inner in range(0, n_inner):
y = y_outer * n_inner + y_inner
arr3[x, y] = x + y
print(f"x = {x}, y = {y}, x + y = arr3[x, y]")
print(arr3)单独拆分没什么意思,迭代顺序还是一样的,不过拆分后还可以再排序
arr3 = np.zeros((m, n))
n_inner = 2
n_outer = n // 2
for y_outer in range(0, n_outer):
for x in range(0, m):
for y_inner in range(0, n_inner):
y = y_outer * n_inner + y_inner
arr3[x, y] = x + y
print(f"x = {x}, y = {y}, x + y = arr3[x, y]")
print(arr3)可以展开内层循环
# unrolling
arr5 = np.zeros((m, n))
n_inner = 2
n_outer = n // 2
for y_outer in range(0, n_outer):
for x in range(0, m):
y = y_outer * n_inner + 0
arr5[x, y] = x + y
print(f"x = {x}, y = {y}, x + y = {arr5[x, y]}")
y = y_outer * n_inner + 1
arr5[x, y] = x + y
print(f"x = {x}, y = {y}, x + y = {arr5[x, y]}")
print(arr5)可以合并循环
# fusing
arr6 = np.zeros((m, n))
for xy in range(0, m*n):
x = xy // n
y = xy % n
arr6[x, y] = x + y
print(f"x = {x}, y = {y}, x + y = {arr6[x, y]}")
print(arr6)以上展示了一个算法的许多版本的调度,接下来我们看一下 Halide 如何解决这些问题的