Skip to content

算子优化一:调度与算法分离 #9

@xtyi

Description

@xtyi

算法与调度

当编写高性能计算的代码(例如图像处理,张量计算)时,我们需要为特定的硬件架构精心构造代码,包括但不限于使用以下方法:

  • 将外层循环并行化 -> 利用多核 CPU
  • 将内层循环向量化 -> 利用 CPU SIMD 指令
  • 拆分并交换迭代顺序 -> 利用局部缓存
  • ......

但是这会带来一系列问题

  • 这些代码的可读性,可维护性都很差
  • 每个不同的硬件架构,甚至缓存大小不同,都需要不同的代码实现,所以我们要维护多套代码实现
  • 需要开发者非常了解硬件架构,并且进行大量测试

Halide 是一门新的编程语言,Halide 的核心理念是,之所以会存在这些问题,是因为之前的编程都混淆了算法(Algorithm)和调度(Scheduler)两个概念

之所以我们没有察觉,是因为这个问题在编写不需要高性能的普通应用程序时并不明显,我们关心的重点是程序逻辑,即算法的部分,调度无关紧要

而在编写高性能计算代码时,调度变得重要,我们就无法忽视它的存在了

举个例子,我们计算两个矩阵相加

算法就是:对于两个矩阵同一位置的两个元素 x, y,执行 f(x, y) = x + y,结果存放在目标矩阵相应位置

算法很好理解,矩阵加就是这么做的,那么调度是什么呢?先不着急,我们先实现一下这个算法

m = 4
n = 8

# 最常见的
arr1 = np.zeros((m, n))

for x in range(0, m):
    for y in range(0, n):
        arr1[x, y] = x + y
        print(f"x = {x}, y = {y}, x + y = {arr1[x, y]}")

print(arr1)

打印的矩阵为

[[ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 1.  2.  3.  4.  5.  6.  7.  8.]
 [ 2.  3.  4.  5.  6.  7.  8.  9.]
 [ 3.  4.  5.  6.  7.  8.  9. 10.]]

上面的代码应该是大部分人都会实现的逻辑,但是也有部分同学会调换一下 x 和 y 先后迭代的顺序

# 交换 x, y 顺序
arr2 = np.zeros((m, n))

for y in range(0, n):
    for x in range(0, m):
        arr2[x, y] = x + y
        print(f"x = {x}, y = {y}, x + y = {arr2[x, y]}")

print(arr2)

打印的矩阵仍然是

[[ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 1.  2.  3.  4.  5.  6.  7.  8.]
 [ 2.  3.  4.  5.  6.  7.  8.  9.]
 [ 3.  4.  5.  6.  7.  8.  9. 10.]]

这有什么不同吗?之所以我们认为"差不多",是因为这两个代码在算法上是一样的,不一样的地方就是调度

前者的迭代顺序如下

x = 0, y = 0, x + y = 0.0
x = 0, y = 1, x + y = 1.0
x = 0, y = 2, x + y = 2.0
x = 0, y = 3, x + y = 3.0
x = 0, y = 4, x + y = 4.0
x = 0, y = 5, x + y = 5.0
x = 0, y = 6, x + y = 6.0
x = 0, y = 7, x + y = 7.0
x = 1, y = 0, x + y = 1.0
x = 1, y = 1, x + y = 2.0
x = 1, y = 2, x + y = 3.0
x = 1, y = 3, x + y = 4.0
x = 1, y = 4, x + y = 5.0
x = 1, y = 5, x + y = 6.0
x = 1, y = 6, x + y = 7.0
x = 1, y = 7, x + y = 8.0
x = 2, y = 0, x + y = 2.0
x = 2, y = 1, x + y = 3.0
x = 2, y = 2, x + y = 4.0
x = 2, y = 3, x + y = 5.0
x = 2, y = 4, x + y = 6.0
x = 2, y = 5, x + y = 7.0
x = 2, y = 6, x + y = 8.0
x = 2, y = 7, x + y = 9.0
x = 3, y = 0, x + y = 3.0
x = 3, y = 1, x + y = 4.0
x = 3, y = 2, x + y = 5.0
x = 3, y = 3, x + y = 6.0
x = 3, y = 4, x + y = 7.0
x = 3, y = 5, x + y = 8.0
x = 3, y = 6, x + y = 9.0
x = 3, y = 7, x + y = 10.0

后者如下

x = 0, y = 0, x + y = 0.0
x = 1, y = 0, x + y = 1.0
x = 2, y = 0, x + y = 2.0
x = 3, y = 0, x + y = 3.0
x = 0, y = 1, x + y = 1.0
x = 1, y = 1, x + y = 2.0
x = 2, y = 1, x + y = 3.0
x = 3, y = 1, x + y = 4.0
x = 0, y = 2, x + y = 2.0
x = 1, y = 2, x + y = 3.0
x = 2, y = 2, x + y = 4.0
x = 3, y = 2, x + y = 5.0
x = 0, y = 3, x + y = 3.0
x = 1, y = 3, x + y = 4.0
x = 2, y = 3, x + y = 5.0
x = 3, y = 3, x + y = 6.0
x = 0, y = 4, x + y = 4.0
x = 1, y = 4, x + y = 5.0
x = 2, y = 4, x + y = 6.0
x = 3, y = 4, x + y = 7.0
x = 0, y = 5, x + y = 5.0
x = 1, y = 5, x + y = 6.0
x = 2, y = 5, x + y = 7.0
x = 3, y = 5, x + y = 8.0
x = 0, y = 6, x + y = 6.0
x = 1, y = 6, x + y = 7.0
x = 2, y = 6, x + y = 8.0
x = 3, y = 6, x + y = 9.0
x = 0, y = 7, x + y = 7.0
x = 1, y = 7, x + y = 8.0
x = 2, y = 7, x + y = 9.0
x = 3, y = 7, x + y = 10.0

这里你可能认为,这只是 a + bb + a 的区别,结果是一样的,算法复杂度没有任何区别

一点没错,实际上,调度就是 a + bb + a 的区别,但是这里的 a + bb + a 是严重影响程序效率的,因为算法复杂度并不是实际的执行效率,相同算法复杂度的程序,一个利用了并行,缓存,SIMD,一个没有,效率是天差地别的,调度关心的就是这些。

接下来再看一些调度,我们还可以把循环拆分

# 拆分
arr3 = np.zeros((m, n))

n_inner = 2
n_outer = n // 2

for x in range(0, m):
    for y_outer in range(0, n_outer):
        for y_inner in range(0, n_inner):
            y = y_outer * n_inner + y_inner
            arr3[x, y] = x + y
            print(f"x = {x}, y = {y}, x + y = arr3[x, y]")

print(arr3)

单独拆分没什么意思,迭代顺序还是一样的,不过拆分后还可以再排序

arr3 = np.zeros((m, n))

n_inner = 2
n_outer = n // 2

for y_outer in range(0, n_outer):
    for x in range(0, m):
        for y_inner in range(0, n_inner):
            y = y_outer * n_inner + y_inner
            arr3[x, y] = x + y
            print(f"x = {x}, y = {y}, x + y = arr3[x, y]")

print(arr3)

可以展开内层循环

# unrolling
arr5 = np.zeros((m, n))

n_inner = 2
n_outer = n // 2

for y_outer in range(0, n_outer):
    for x in range(0, m):
        y = y_outer * n_inner + 0
        arr5[x, y] = x + y
        print(f"x = {x}, y = {y}, x + y = {arr5[x, y]}")

        y = y_outer * n_inner + 1
        arr5[x, y] = x + y
        print(f"x = {x}, y = {y}, x + y = {arr5[x, y]}")

print(arr5)

可以合并循环

# fusing

arr6 = np.zeros((m, n))

for xy in range(0, m*n):
    x = xy // n
    y = xy % n
    arr6[x, y] = x + y
    print(f"x = {x}, y = {y}, x + y = {arr6[x, y]}")

print(arr6)

以上展示了一个算法的许多版本的调度,接下来我们看一下 Halide 如何解决这些问题的

Halide

Halide 简单介绍

Halide 示例

Metadata

Metadata

Assignees

No one assigned

    Labels

    dl深度学习

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions