Run nn.Graph by VM #9884

daquexian · 2023-02-21T09:19:00Z

相关 issue：https://github.com/Oneflow-Inc/OneTeam/issues/1657

这个 PR 实现了一个实验性的功能：在 ONEFLOW_RUN_GRAPH_BY_VM=1 时用 VM 来跑 nn.Graph，这可以让 nn.Graph 接受动态输入形状（只支持了单卡），目前阶段这个方式并不完全可靠因为无法排除存在某些 op 或者图优化强依赖了 build graph 时的输入形状，要等待有了完善的 symbolic shape 的支持之后才能完全解决这个问题。

在 SD1.5 上测试用 VM 跑 Graph 和用 actor 跑 Graph 速度并没有很大的区别，不过显存稍多：

	VM	actor
SD 1.5	17.70 it/s 6968MB	17.75 it/s 6292MB

Signed-off-by: daquexian <daquexian566@gmail.com>

oneflow/core/job/job_interpreter.cpp

Signed-off-by: daquexian <daquexian566@gmail.com>

strint · 2023-02-23T04:29:45Z

python/oneflow/test/graph/test_run_graph_by_vm.py

+    print(g)
+    assert "broadcast_sub" not in capsys.readouterr().out
+    assert "cast" not in capsys.readouterr().out
+    assert "broadcast_mul" not in capsys.readouterr().out


这个怎么看起来不像标准的 unittest，ci 能跑到这个 case 么

可以，是 pytest 的写法，比 python 自带的 unittest 好用不少，CI 已经在用 pytest 跑了

strint · 2023-02-23T04:40:12Z

oneflow/core/job/job_interpreter.cpp

+  const auto& job = graph->job();
+  auto env = *JUST(InitEnv(graph_inputs, graph));
+
+  const auto dead_tensors = GetDeadTensorVector(job);


dead tensor 的含义是什么意思呢，可以注释下

好的，在 GetDeadTensorVector 的定义处有一个注释，我再在这里指明一下

Signed-off-by: daquexian <daquexian566@gmail.com>

github-actions · 2023-02-23T04:47:33Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

strint · 2023-02-23T04:52:46Z

oneflow/core/job/job_interpreter.cpp

+
+// tensors in dead_tensors[i] will not be accessed any more after i-th op
+// so they can be released once i-th op's execution finishes.
+std::vector<std::vector<std::string>> GetDeadTensorVector(const Job& job) {


dead tensor 看起来主要就是会 outdated 的 activation tensor ？

是的，dead_tensors[i] 表示第 i 个 op 之后会变为 dead 的 tensors，如果有更好的名字也可以提出

OudatedTensorAfterOp?

可以 :good: 已修改

strint

LGTM

Signed-off-by: daquexian <daquexian566@gmail.com>

github-actions · 2023-02-23T14:55:47Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions · 2023-02-24T03:06:51Z

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.0ms (= 14098.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 146.5ms (= 14654.0ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.04 (= 146.5ms / 141.0ms)

OneFlow resnet50 time: 80.5ms (= 8047.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.9ms (= 8386.3ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.04 (= 83.9ms / 80.5ms)

OneFlow resnet50 time: 48.4ms (= 9687.6ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 61.0ms (= 12201.5ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.26 (= 61.0ms / 48.4ms)

OneFlow resnet50 time: 32.2ms (= 6431.3ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 48.8ms (= 9750.4ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.52 (= 48.8ms / 32.2ms)

OneFlow resnet50 time: 24.9ms (= 4986.6ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.1ms (= 7826.4ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.57 (= 39.1ms / 24.9ms)

OneFlow swin dataloader time: 0.238s (= 47.591s / 200, num_workers=1)
PyTorch swin dataloader time: 0.156s (= 31.283s / 200, num_workers=1)
Relative speed: 0.657 (= 0.156s / 0.238s)

OneFlow swin dataloader time: 0.066s (= 13.223s / 200, num_workers=4)
PyTorch swin dataloader time: 0.043s (= 8.540s / 200, num_workers=4)
Relative speed: 0.646 (= 0.043s / 0.066s)

OneFlow swin dataloader time: 0.041s (= 8.211s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.416s / 200, num_workers=8)
Relative speed: 0.538 (= 0.022s / 0.041s)

❌ OneFlow resnet50 time: 152.3ms (= 15232.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 159.9ms (= 15988.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.05 (= 159.9ms / 152.3ms)

OneFlow resnet50 time: 90.7ms (= 9072.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 103.1ms (= 10306.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.14 (= 103.1ms / 90.7ms)

OneFlow resnet50 time: 58.9ms (= 11786.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.7ms (= 15539.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 77.7ms / 58.9ms)

OneFlow resnet50 time: 42.1ms (= 8412.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.3ms (= 14467.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.72 (= 72.3ms / 42.1ms)

OneFlow resnet50 time: 35.5ms (= 7096.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.7ms (= 13944.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.97 (= 69.7ms / 35.5ms)

github-actions · 2023-02-24T03:11:51Z

CI failed when running job: cuda-misc. PR label automerge has been removed

Signed-off-by: daquexian <daquexian566@gmail.com>

github-actions · 2023-02-24T05:59:43Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions · 2023-02-24T06:27:07Z

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.0ms (= 14100.5ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 144.1ms (= 14412.8ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.02 (= 144.1ms / 141.0ms)

OneFlow resnet50 time: 80.7ms (= 8070.7ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.0ms (= 8503.1ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.05 (= 85.0ms / 80.7ms)

OneFlow resnet50 time: 50.0ms (= 9998.2ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 55.7ms (= 11130.8ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.11 (= 55.7ms / 50.0ms)

OneFlow resnet50 time: 33.4ms (= 6688.9ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 43.0ms (= 8596.7ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.29 (= 43.0ms / 33.4ms)

OneFlow resnet50 time: 24.9ms (= 4975.8ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 37.5ms (= 7496.2ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.51 (= 37.5ms / 24.9ms)

OneFlow swin dataloader time: 0.237s (= 47.377s / 200, num_workers=1)
PyTorch swin dataloader time: 0.148s (= 29.675s / 200, num_workers=1)
Relative speed: 0.626 (= 0.148s / 0.237s)

OneFlow swin dataloader time: 0.072s (= 14.341s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.124s / 200, num_workers=4)
Relative speed: 0.566 (= 0.041s / 0.072s)

OneFlow swin dataloader time: 0.039s (= 7.778s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.478s / 200, num_workers=8)
Relative speed: 0.576 (= 0.022s / 0.039s)

❌ OneFlow resnet50 time: 152.4ms (= 15242.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.6ms (= 16159.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.06 (= 161.6ms / 152.4ms)

OneFlow resnet50 time: 91.3ms (= 9127.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.3ms (= 10128.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 101.3ms / 91.3ms)

OneFlow resnet50 time: 59.4ms (= 11887.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.9ms (= 15583.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.31 (= 77.9ms / 59.4ms)

OneFlow resnet50 time: 42.5ms (= 8495.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.8ms (= 13957.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.64 (= 69.8ms / 42.5ms)

OneFlow resnet50 time: 35.8ms (= 7158.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.5ms (= 14495.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 2.02 (= 72.5ms / 35.8ms)

github-actions · 2023-02-24T06:37:26Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9884/

github-actions · 2023-02-25T11:54:32Z

CI failed when running job: cuda-module. PR label automerge has been removed

github-actions · 2023-02-25T11:55:43Z

Speed stats:

github-actions · 2023-02-25T22:01:26Z

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 140.9ms (= 14093.8ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 144.5ms (= 14454.9ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.03 (= 144.5ms / 140.9ms)

OneFlow resnet50 time: 80.6ms (= 8060.8ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.7ms (= 8374.7ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.04 (= 83.7ms / 80.6ms)

OneFlow resnet50 time: 49.7ms (= 9942.3ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 60.8ms (= 12156.7ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.22 (= 60.8ms / 49.7ms)

OneFlow resnet50 time: 33.8ms (= 6762.4ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 49.1ms (= 9816.7ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.45 (= 49.1ms / 33.8ms)

OneFlow resnet50 time: 24.4ms (= 4871.8ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 44.9ms (= 8988.1ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.84 (= 44.9ms / 24.4ms)

OneFlow swin dataloader time: 0.237s (= 47.396s / 200, num_workers=1)
PyTorch swin dataloader time: 0.149s (= 29.704s / 200, num_workers=1)
Relative speed: 0.627 (= 0.149s / 0.237s)

OneFlow swin dataloader time: 0.068s (= 13.693s / 200, num_workers=4)
PyTorch swin dataloader time: 0.042s (= 8.465s / 200, num_workers=4)
Relative speed: 0.618 (= 0.042s / 0.068s)

OneFlow swin dataloader time: 0.042s (= 8.476s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.403s / 200, num_workers=8)
Relative speed: 0.519 (= 0.022s / 0.042s)

❌ OneFlow resnet50 time: 152.5ms (= 15250.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 162.3ms (= 16229.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.06 (= 162.3ms / 152.5ms)

OneFlow resnet50 time: 90.9ms (= 9094.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.0ms (= 10196.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.12 (= 102.0ms / 90.9ms)

OneFlow resnet50 time: 59.1ms (= 11814.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.3ms (= 15654.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 78.3ms / 59.1ms)

OneFlow resnet50 time: 42.3ms (= 8457.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.2ms (= 14043.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.66 (= 70.2ms / 42.3ms)

OneFlow resnet50 time: 36.5ms (= 7292.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.7ms (= 13948.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.91 (= 69.7ms / 36.5ms)

github-actions · 2023-02-25T22:12:58Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9884/

Running global nn.Graph by vm, following #9884 --------- Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: daquexian <daquexian566@gmail.com>

daquexian added 4 commits February 19, 2023 16:38

init InterpretJob

6658783

Signed-off-by: daquexian <daquexian566@gmail.com>

support sd

69decc3

Signed-off-by: daquexian <daquexian566@gmail.com>

cache op exprs

0314b66

Signed-off-by: daquexian <daquexian566@gmail.com>

refine, add test

96addf4

Signed-off-by: daquexian <daquexian566@gmail.com>

daquexian added feature automerge graph graph mode labels Feb 21, 2023

daquexian marked this pull request as ready for review February 21, 2023 09:19

daquexian requested review from chengtbf, strint and BBuf as code owners February 21, 2023 09:19

jackalcooper reviewed Feb 21, 2023

View reviewed changes

oneflow/core/job/job_interpreter.cpp Outdated Show resolved Hide resolved

jackalcooper reviewed Feb 21, 2023

View reviewed changes

oneflow/core/job/job_interpreter.cpp Outdated Show resolved Hide resolved

jackalcooper reviewed Feb 21, 2023

View reviewed changes

oneflow/core/job/job_interpreter.cpp Outdated Show resolved Hide resolved

daquexian added 2 commits February 22, 2023 16:17

Merge branch 'master' into job_vm

f543907

refine

f527293

Signed-off-by: daquexian <daquexian566@gmail.com>

daquexian force-pushed the job_vm branch from 8ba5fa8 to f527293 Compare February 22, 2023 13:08

jackalcooper approved these changes Feb 22, 2023

View reviewed changes

clackhan approved these changes Feb 23, 2023

View reviewed changes

Merge branch 'master' into job_vm

0e7097c

daquexian requested a review from oneflow-ci-bot February 23, 2023 04:20

strint reviewed Feb 23, 2023

View reviewed changes

daquexian and others added 2 commits February 23, 2023 12:45

add more comments

ed42968

Signed-off-by: daquexian <daquexian566@gmail.com>

auto format by CI

2179698

strint reviewed Feb 23, 2023

View reviewed changes

strint approved these changes Feb 23, 2023

View reviewed changes

daquexian and others added 2 commits February 23, 2023 13:18

rename dead_tensors -> outdated_tensors_after_op

fa3e6f1

Signed-off-by: daquexian <daquexian566@gmail.com>

auto format by CI

6a99a3c

auto format by CI

c0ec050

daquexian requested review from oneflow-ci-bot and removed request for oneflow-ci-bot February 23, 2023 15:00

github-actions bot removed the automerge label Feb 24, 2023

daquexian and others added 2 commits February 24, 2023 13:57

restore env var when test finishes

b4a18e6

Signed-off-by: daquexian <daquexian566@gmail.com>

auto format by CI

51846ce

daquexian mentioned this pull request Feb 24, 2023

symbolic shape #9902

Open

3 tasks

daquexian added the automerge label Feb 25, 2023

mergify bot added 3 commits February 25, 2023 06:43

Merge branch 'master' into job_vm

2904eaa

Merge branch 'master' into job_vm

21e8b3e

Merge branch 'master' into job_vm

b015649

github-actions bot removed the automerge label Feb 25, 2023

daquexian added the automerge label Feb 25, 2023

Merge branch 'master' into job_vm

26d5aaa

mergify bot merged commit 7df12c3 into master Feb 25, 2023

mergify bot deleted the job_vm branch February 25, 2023 22:46

rejoicesyc mentioned this pull request May 11, 2023

Global Interpreter #10048

Merged

rejoicesyc added a commit that referenced this pull request Jun 5, 2023

Global Interpreter (#10048)

89b6916

Running global nn.Graph by vm, following #9884 --------- Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: daquexian <daquexian566@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run nn.Graph by VM #9884

Run nn.Graph by VM #9884

daquexian commented Feb 21, 2023

strint Feb 23, 2023

daquexian Feb 23, 2023

strint Feb 23, 2023

daquexian Feb 23, 2023

daquexian Feb 23, 2023

github-actions bot commented Feb 23, 2023

strint Feb 23, 2023

daquexian Feb 23, 2023

strint Feb 23, 2023

daquexian Feb 23, 2023 •

edited

Loading

strint left a comment

github-actions bot commented Feb 23, 2023

github-actions bot commented Feb 24, 2023

github-actions bot commented Feb 24, 2023

github-actions bot commented Feb 24, 2023

github-actions bot commented Feb 24, 2023

github-actions bot commented Feb 24, 2023

github-actions bot commented Feb 25, 2023

github-actions bot commented Feb 25, 2023

github-actions bot commented Feb 25, 2023

github-actions bot commented Feb 25, 2023

Run nn.Graph by VM #9884

Run nn.Graph by VM #9884

Conversation

daquexian commented Feb 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daquexian Feb 23, 2023 • edited Loading

Choose a reason for hiding this comment

strint left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 23, 2023

github-actions bot commented Feb 24, 2023

github-actions bot commented Feb 24, 2023

github-actions bot commented Feb 24, 2023

github-actions bot commented Feb 24, 2023

github-actions bot commented Feb 24, 2023

github-actions bot commented Feb 25, 2023

github-actions bot commented Feb 25, 2023

github-actions bot commented Feb 25, 2023

github-actions bot commented Feb 25, 2023

daquexian Feb 23, 2023 •

edited

Loading