Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLoader threads affinitization #6005

Merged
merged 36 commits into from
Nov 25, 2022

Conversation

JakubPietrakIntel
Copy link
Contributor

@JakubPietrakIntel JakubPietrakIntel commented Nov 18, 2022

Feature description:
DataLoader threads affintization can be enabled with NodeLoader.enable_cpu_affinity() as a context manager function.
It can affinitize to a user-provided list of CPU IDs or by default to numa0 node cores starting at core1, where core0 is left out for running any other background threads: system-level, OMP, etc. The affinitization is implemented with psutil method and applied to each worker_id at the time of initialization
https://github.com/JakubPietrakIntel/pytorch_geometric/blob/28a4694a85914196962cb9363e53bbd33b5eebba/torch_geometric/loader/node_loader.py#L173

Additional helper function get_numa_nodes() has been added to utils. It is responsible for reading CPU core ids from /sys/devices/system/node/ for Linux machines. For other operating systems standard psutil.cpu_count(logical=False) will be applied.

https://github.com/JakubPietrakIntel/pytorch_geometric/blob/28a4694a85914196962cb9363e53bbd33b5eebba/torch_geometric/loader/utils.py#L210-L253

Benchmarking:
To test the feature, extensive benchmarks were ran on a dual socket system with Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz (ICX) with 28 cores each and 512GB of RAM DDR4. The script to measure inference time was inference_benchmark.py with a GCN model and ogbn-products dataset over a range of hyperparameters:

  • hidden feature size [16, 128]
  • number of layers [2,3]
  • hyperthreading (HT) on/off
  • sparse or dense tensor matrix multiplication (spmm/scatter_add)
  • range of DataLoader workers [1, 2, 4, 8, 16]
  • batch_size [512, 1024, 2048, 4096, 8192]

In the plots presented at the bottom of this page the results are presented for CPU setup without hyperthreading, which yielded overall better results than with HT on. In this scenario 3 example affinitization configurations are presented in comparison with the baseline config.

  • DL - DataLoader affinitzation only
  • DL + CPU All - compute threads affinitzation for all cores not used by DataLoader
  • DL + CPU1 - decreased number of compute threads running on the same NUMA node/CPU socket as DataLoader

Results discussion:
In majority of cases DL threads affinitization yields a measurable improvement up to 2x for small hidden feature size and large batch size. The inference time can be optimized further using compute threads affinitzation that, as of now, can only be done by setting additional OMP parameters, prior to launching python script: limiting number of threads with OMP_NUM_THREADS and pinning them to specific CPUs with GOMP_CPU_AFFINITY (see CPU All and CPU1 cases).
There's no single one-fit-all solution for CPU config that can be applied to all models and architectures. Complexity of the problem arises from mutual dependency between performance of data-loading and compute threads and HW architecture of a CPU model. Therefore, further research will be performed to develop more user-friendly interfaces for thread management.

Further work:

  • Reserach on possibility of compute threads affinitization from within pytorch multiprocessing library and without external OMP commands.
  • Increasing dual-socket CPU utilization and performance with multiple DataLoader workers acting in parallel.

Plots:
SPMM
spmm-feat16HT0
spmm-feat128HT0
SCATTER
scatter-feat16HT0
scatter-feat128HT0

@codecov
Copy link

codecov bot commented Nov 18, 2022

Codecov Report

Merging #6005 (da35072) into master (d6a8f67) will decrease coverage by 0.00%.
The diff coverage is 83.11%.

@@            Coverage Diff             @@
##           master    #6005      +/-   ##
==========================================
- Coverage   84.28%   84.27%   -0.01%     
==========================================
  Files         362      362              
  Lines       20382    20456      +74     
==========================================
+ Hits        17178    17239      +61     
- Misses       3204     3217      +13     
Impacted Files Coverage Δ
torch_geometric/profile/profile.py 36.27% <25.00%> (-0.47%) ⬇️
torch_geometric/loader/base.py 75.00% <66.66%> (-3.58%) ⬇️
torch_geometric/loader/node_loader.py 87.77% <85.71%> (-2.42%) ⬇️
torch_geometric/loader/utils.py 84.42% <92.00%> (+1.08%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@JakubPietrakIntel JakubPietrakIntel force-pushed the dl_affinity branch 3 times, most recently from 7e3c640 to 28a4694 Compare November 18, 2022 16:15
@JakubPietrakIntel JakubPietrakIntel changed the title DataLoader threads affinitization [feature] DataLoader threads affinitization Nov 18, 2022
@JakubPietrakIntel JakubPietrakIntel force-pushed the dl_affinity branch 2 times, most recently from 2b10686 to 6b76995 Compare November 18, 2022 17:43
@JakubPietrakIntel JakubPietrakIntel marked this pull request as ready for review November 18, 2022 18:22
@yanbing-j
Copy link
Contributor

Hi @JakubPietrakIntel ,Thanks for your hard work!
I have some questions about this PR.

  1. Is this PR the replacement of Added cpu_worker_affinity_cores attribute to node_loader init #5746 What's the difference between them?
  2. This PR requires at least 1 DL workers. https://github.com/pyg-team/pytorch_geometric/pull/6005/files#diff-7f627486416900d3cbaa74871470419e5a15c044e1902acd4e0921fbcc684755R200-R203 Is this reasonable? And what about the performance of num_workers = 0?
  3. Do you test on other smaller datasets, and also get the same improvement as ogbn-products?
  4. From the plots, CPU1 can get the smallest inf time relatively. So running in single socket can perform better than running in two sockets, right?

@JakubPietrakIntel
Copy link
Contributor Author

JakubPietrakIntel commented Nov 21, 2022

Hi @yanbing-j! Let me answer your questions.

  1. Is this PR the replacement of Added cpu_worker_affinity_cores attribute to node_loader init #5746 What's the difference between them?

This PR is a replacement for #5746 and implements a more complete solution that uses contextmanager to facilitate affinitzation function. It also automatically retrieves information about HT setting and numa core ids.

  1. This PR requires at least 1 DL workers. https://github.com/pyg-team/pytorch_geometric/pull/6005/files#diff-7f627486416900d3cbaa74871470419e5a15c044e1902acd4e0921fbcc684755R200-R203 Is this reasonable? And what about the performance of num_workers = 0?

In most cases included in my benchmarks DL operating in main process (nr_workers=0) is worse than using separate processes for the DL. However, for some corner cases, i.e large hidden feature size, using sparse data repr. and with HT off I've noticed that main process DL performs better. This indicates a perf bottleneck that needs to be eliminated.

  1. Do you test on other smaller datasets, and also get the same improvement as ogbn-products?

Initially, I performed some manual runs that indicated that there will be a perf. boost also while running with a smaller dataset. I will provide benchmark results for a smaller dataset, i.e. Reddit soon.

  1. From the plots, CPU1 can get the smallest inf time relatively. So running in single socket can perform better than running in two sockets, right?

Correct, in majority of cases running on a single socket leads to better performance, because it eliminates the remote memory access which causes significant memory bound. I will be investigating how to utilize dual socket CPU better. It will most likely require compute threads affinitzation & binding. This is also an ongoing research.

Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks for all the hard effort. Most commons are only super nit :)

benchmark/inference/inference_benchmark.py Outdated Show resolved Hide resolved
torch_geometric/loader/node_loader.py Show resolved Hide resolved
torch_geometric/loader/node_loader.py Outdated Show resolved Hide resolved
torch_geometric/loader/node_loader.py Outdated Show resolved Hide resolved
torch_geometric/loader/node_loader.py Outdated Show resolved Hide resolved
torch_geometric/loader/node_loader.py Outdated Show resolved Hide resolved
torch_geometric/loader/node_loader.py Outdated Show resolved Hide resolved
torch_geometric/loader/node_loader.py Outdated Show resolved Hide resolved
torch_geometric/loader/node_loader.py Outdated Show resolved Hide resolved
torch_geometric/loader/node_loader.py Outdated Show resolved Hide resolved
@JakubPietrakIntel
Copy link
Contributor Author

@yanbing-j I am planning to develop this feature further to include compute threads affinitization. When I have all the results ready I will write a user-friendly guide for "CPU best practices for PyG". Please stay tuned.

@JakubPietrakIntel
Copy link
Contributor Author

The test results for loader affinitization are unstable and I will need to investigate this case further before merging the PR. More info to follow.

@JakubPietrakIntel
Copy link
Contributor Author

The test results for loader affinitization are unstable and I will need to investigate this case further before merging the PR. More info to follow.

The problem has been fixed by adding sleep(1) in https://github.com/JakubPietrakIntel/pytorch_geometric/blob/a705d39d5f48b156c53be71bf5bedf74cec3b3de/test/loader/test_neighbor_loader.py#L594

The issue originates from the fact that DataLoader worker needs time to initialize completely in a parallel subprocess (with worker PID). When subprocess.Popen(taskset...) takes precedence over worker subprocess it reads out wrong value for affinitization. This PR is ready to be merged.

Copy link
Contributor

@mszarma mszarma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

torch_geometric/loader/base.py Outdated Show resolved Hide resolved
@JakubPietrakIntel
Copy link
Contributor Author

@yanbing-j @mszarma @rusty1s Thank you all for the review!
This feature is now merged.

@JakubPietrakIntel JakubPietrakIntel merged commit 30ed977 into pyg-team:master Nov 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants