DataLoader threads affinitization #6005

JakubPietrakIntel · 2022-11-18T13:39:49Z

Feature description:
DataLoader threads affintization can be enabled with NodeLoader.enable_cpu_affinity() as a context manager function.
It can affinitize to a user-provided list of CPU IDs or by default to numa0 node cores starting at core1, where core0 is left out for running any other background threads: system-level, OMP, etc. The affinitization is implemented with psutil method and applied to each worker_id at the time of initialization
https://github.com/JakubPietrakIntel/pytorch_geometric/blob/28a4694a85914196962cb9363e53bbd33b5eebba/torch_geometric/loader/node_loader.py#L173

Additional helper function get_numa_nodes() has been added to utils. It is responsible for reading CPU core ids from /sys/devices/system/node/ for Linux machines. For other operating systems standard psutil.cpu_count(logical=False) will be applied.

https://github.com/JakubPietrakIntel/pytorch_geometric/blob/28a4694a85914196962cb9363e53bbd33b5eebba/torch_geometric/loader/utils.py#L210-L253

Benchmarking:
To test the feature, extensive benchmarks were ran on a dual socket system with Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz (ICX) with 28 cores each and 512GB of RAM DDR4. The script to measure inference time was inference_benchmark.py with a GCN model and ogbn-products dataset over a range of hyperparameters:

hidden feature size [16, 128]
number of layers [2,3]
hyperthreading (HT) on/off
sparse or dense tensor matrix multiplication (spmm/scatter_add)
range of DataLoader workers [1, 2, 4, 8, 16]
batch_size [512, 1024, 2048, 4096, 8192]

In the plots presented at the bottom of this page the results are presented for CPU setup without hyperthreading, which yielded overall better results than with HT on. In this scenario 3 example affinitization configurations are presented in comparison with the baseline config.

DL - DataLoader affinitzation only
DL + CPU All - compute threads affinitzation for all cores not used by DataLoader
DL + CPU1 - decreased number of compute threads running on the same NUMA node/CPU socket as DataLoader

Results discussion:
In majority of cases DL threads affinitization yields a measurable improvement up to 2x for small hidden feature size and large batch size. The inference time can be optimized further using compute threads affinitzation that, as of now, can only be done by setting additional OMP parameters, prior to launching python script: limiting number of threads with OMP_NUM_THREADS and pinning them to specific CPUs with GOMP_CPU_AFFINITY (see CPU All and CPU1 cases).
There's no single one-fit-all solution for CPU config that can be applied to all models and architectures. Complexity of the problem arises from mutual dependency between performance of data-loading and compute threads and HW architecture of a CPU model. Therefore, further research will be performed to develop more user-friendly interfaces for thread management.

Further work:

Reserach on possibility of compute threads affinitization from within pytorch multiprocessing library and without external OMP commands.
Increasing dual-socket CPU utilization and performance with multiple DataLoader workers acting in parallel.

Plots:
SPMM

SCATTER

codecov · 2022-11-18T13:44:54Z

Codecov Report

Merging #6005 (da35072) into master (d6a8f67) will decrease coverage by 0.00%.
The diff coverage is 83.11%.

@@            Coverage Diff             @@
##           master    #6005      +/-   ##
==========================================
- Coverage   84.28%   84.27%   -0.01%     
==========================================
  Files         362      362              
  Lines       20382    20456      +74     
==========================================
+ Hits        17178    17239      +61     
- Misses       3204     3217      +13

Impacted Files	Coverage Δ
torch_geometric/profile/profile.py	`36.27% <25.00%> (-0.47%)`	⬇️
torch_geometric/loader/base.py	`75.00% <66.66%> (-3.58%)`	⬇️
torch_geometric/loader/node_loader.py	`87.77% <85.71%> (-2.42%)`	⬇️
torch_geometric/loader/utils.py	`84.42% <92.00%> (+1.08%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

for more information, see https://pre-commit.ci

yanbing-j · 2022-11-21T03:29:17Z

Hi @JakubPietrakIntel ,Thanks for your hard work!
I have some questions about this PR.

Is this PR the replacement of Added cpu_worker_affinity_cores attribute to node_loader init #5746 What's the difference between them?
This PR requires at least 1 DL workers. https://github.com/pyg-team/pytorch_geometric/pull/6005/files#diff-7f627486416900d3cbaa74871470419e5a15c044e1902acd4e0921fbcc684755R200-R203 Is this reasonable? And what about the performance of num_workers = 0?
Do you test on other smaller datasets, and also get the same improvement as ogbn-products?
From the plots, CPU1 can get the smallest inf time relatively. So running in single socket can perform better than running in two sockets, right?

JakubPietrakIntel · 2022-11-21T10:15:05Z

Hi @yanbing-j! Let me answer your questions.

Is this PR the replacement of Added cpu_worker_affinity_cores attribute to node_loader init #5746 What's the difference between them?

This PR is a replacement for #5746 and implements a more complete solution that uses contextmanager to facilitate affinitzation function. It also automatically retrieves information about HT setting and numa core ids.

This PR requires at least 1 DL workers. https://github.com/pyg-team/pytorch_geometric/pull/6005/files#diff-7f627486416900d3cbaa74871470419e5a15c044e1902acd4e0921fbcc684755R200-R203 Is this reasonable? And what about the performance of num_workers = 0?

In most cases included in my benchmarks DL operating in main process (nr_workers=0) is worse than using separate processes for the DL. However, for some corner cases, i.e large hidden feature size, using sparse data repr. and with HT off I've noticed that main process DL performs better. This indicates a perf bottleneck that needs to be eliminated.

Do you test on other smaller datasets, and also get the same improvement as ogbn-products?

Initially, I performed some manual runs that indicated that there will be a perf. boost also while running with a smaller dataset. I will provide benchmark results for a smaller dataset, i.e. Reddit soon.

From the plots, CPU1 can get the smallest inf time relatively. So running in single socket can perform better than running in two sockets, right?

Correct, in majority of cases running on a single socket leads to better performance, because it eliminates the remote memory access which causes significant memory bound. I will be investigating how to utilize dual socket CPU better. It will most likely require compute threads affinitzation & binding. This is also an ongoing research.

for more information, see https://pre-commit.ci

rusty1s

This looks great, thanks for all the hard effort. Most commons are only super nit :)

benchmark/inference/inference_benchmark.py

torch_geometric/loader/node_loader.py

JakubPietrakIntel · 2022-11-22T12:59:24Z

@yanbing-j I am planning to develop this feature further to include compute threads affinitization. When I have all the results ready I will write a user-friendly guide for "CPU best practices for PyG". Please stay tuned.

for more information, see https://pre-commit.ci

JakubPietrakIntel · 2022-11-24T09:00:09Z

The test results for loader affinitization are unstable and I will need to investigate this case further before merging the PR. More info to follow.

JakubPietrakIntel · 2022-11-24T14:32:32Z

The test results for loader affinitization are unstable and I will need to investigate this case further before merging the PR. More info to follow.

The problem has been fixed by adding sleep(1) in https://github.com/JakubPietrakIntel/pytorch_geometric/blob/a705d39d5f48b156c53be71bf5bedf74cec3b3de/test/loader/test_neighbor_loader.py#L594

The issue originates from the fact that DataLoader worker needs time to initialize completely in a parallel subprocess (with worker PID). When subprocess.Popen(taskset...) takes precedence over worker subprocess it reads out wrong value for affinitization. This PR is ready to be merged.

for more information, see https://pre-commit.ci

mszarma

LGTM.

torch_geometric/loader/base.py

update `WorkerInitWrapper` description

JakubPietrakIntel · 2022-11-25T11:53:14Z

@yanbing-j @mszarma @rusty1s Thank you all for the review!
This feature is now merged.

JakubPietrakIntel added 3 commits November 18, 2022 11:22

initial changes

4f2dcb2

add psutil package to setup.py

7874406

clean imports

4a8615f

JakubPietrakIntel requested a review from mszarma November 18, 2022 13:40

JakubPietrakIntel force-pushed the dl_affinity branch 3 times, most recently from 7e3c640 to 28a4694 Compare November 18, 2022 16:15

JakubPietrakIntel changed the title ~~DataLoader threads affinitization~~ [feature] DataLoader threads affinitization Nov 18, 2022

JakubPietrakIntel added feature 0 - Priority P0 loader labels Nov 18, 2022

JakubPietrakIntel force-pushed the dl_affinity branch 2 times, most recently from 2b10686 to 6b76995 Compare November 18, 2022 17:43

JakubPietrakIntel and others added 2 commits November 18, 2022 17:52

tested inference_benchmark with cpu_affinity

851ec90

[pre-commit.ci] auto fixes from pre-commit.com hooks

28a4694

for more information, see https://pre-commit.ci

JakubPietrakIntel requested a review from rusty1s November 18, 2022 18:18

JakubPietrakIntel marked this pull request as ready for review November 18, 2022 18:22

JakubPietrakIntel and others added 4 commits November 18, 2022 19:37

fix typedef

5f70052

[pre-commit.ci] auto fixes from pre-commit.com hooks

6b76995

for more information, see https://pre-commit.ci

fix docstrings to match pyg_sphinx_theme

220ede1

[pre-commit.ci] auto fixes from pre-commit.com hooks

a0cbf9c

for more information, see https://pre-commit.ci

JakubPietrakIntel requested a review from yanbing-j November 21, 2022 08:32

JakubPietrakIntel and others added 3 commits November 21, 2022 10:24

removed worker_init_fn double

d203d6f

formatting

3593e86

[pre-commit.ci] auto fixes from pre-commit.com hooks

ab4e487

for more information, see https://pre-commit.ci

rusty1s reviewed Nov 22, 2022

View reviewed changes

rusty1s assigned JakubPietrakIntel Nov 22, 2022

JakubPietrakIntel and others added 14 commits November 23, 2022 13:53

implement review comments & test WIP

91ad330

[pre-commit.ci] auto fixes from pre-commit.com hooks

35f01aa

for more information, see https://pre-commit.ci

added test for dataloader cpu affinity

34e10ba

[pre-commit.ci] auto fixes from pre-commit.com hooks

2ce828a

for more information, see https://pre-commit.ci

improved neighbor_loader benchmark and formatting

9245797

[pre-commit.ci] auto fixes from pre-commit.com hooks

74d09d7

for more information, see https://pre-commit.ci

docstring

ec15ca8

change test input bc pytest can't take 2 workers

d01c223

condition checks if loader_cores are empty list

56dbbff

[pre-commit.ci] auto fixes from pre-commit.com hooks

31f6ccb

for more information, see https://pre-commit.ci

fix test

328e755

[pre-commit.ci] auto fixes from pre-commit.com hooks

9630cb4

for more information, see https://pre-commit.ci

better exception msg

98203e5

fix test

a561731

increase datasize so loader doesn't die

2db2c38

JakubPietrakIntel and others added 7 commits November 24, 2022 16:23

fixed test with sleep

a705d39

[pre-commit.ci] auto fixes from pre-commit.com hooks

fcc7786

for more information, see https://pre-commit.ci

docstring formatting

1dffc40

[pre-commit.ci] auto fixes from pre-commit.com hooks

6010017

for more information, see https://pre-commit.ci

update changelog

72a0762

formatting

c531c77

Merge branch 'master' into dl_affinity

7c4a85b

mszarma approved these changes Nov 25, 2022

View reviewed changes

torch_geometric/loader/base.py Outdated Show resolved Hide resolved

JakubPietrakIntel added 2 commits November 25, 2022 12:30

Update base.py

a25e5ce

update `WorkerInitWrapper` description

Merge branch 'master' into dl_affinity

da35072

JakubPietrakIntel merged commit 30ed977 into pyg-team:master Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataLoader threads affinitization #6005

DataLoader threads affinitization #6005

JakubPietrakIntel commented Nov 18, 2022 •

edited

Loading

codecov bot commented Nov 18, 2022 •

edited

Loading

yanbing-j commented Nov 21, 2022

JakubPietrakIntel commented Nov 21, 2022 •

edited

Loading

rusty1s left a comment

JakubPietrakIntel commented Nov 22, 2022

JakubPietrakIntel commented Nov 24, 2022

JakubPietrakIntel commented Nov 24, 2022

mszarma left a comment

JakubPietrakIntel commented Nov 25, 2022

DataLoader threads affinitization #6005

DataLoader threads affinitization #6005

Conversation

JakubPietrakIntel commented Nov 18, 2022 • edited Loading

codecov bot commented Nov 18, 2022 • edited Loading

Codecov Report

yanbing-j commented Nov 21, 2022

JakubPietrakIntel commented Nov 21, 2022 • edited Loading

rusty1s left a comment

Choose a reason for hiding this comment

JakubPietrakIntel commented Nov 22, 2022

JakubPietrakIntel commented Nov 24, 2022

JakubPietrakIntel commented Nov 24, 2022

mszarma left a comment

Choose a reason for hiding this comment

JakubPietrakIntel commented Nov 25, 2022

JakubPietrakIntel commented Nov 18, 2022 •

edited

Loading

codecov bot commented Nov 18, 2022 •

edited

Loading

JakubPietrakIntel commented Nov 21, 2022 •

edited

Loading