-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataLoader threads affinitization #6005
DataLoader threads affinitization #6005
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6005 +/- ##
==========================================
- Coverage 84.28% 84.27% -0.01%
==========================================
Files 362 362
Lines 20382 20456 +74
==========================================
+ Hits 17178 17239 +61
- Misses 3204 3217 +13
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
7e3c640
to
28a4694
Compare
2b10686
to
6b76995
Compare
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Hi @JakubPietrakIntel ,Thanks for your hard work!
|
Hi @yanbing-j! Let me answer your questions.
This PR is a replacement for #5746 and implements a more complete solution that uses contextmanager to facilitate affinitzation function. It also automatically retrieves information about HT setting and numa core ids.
In most cases included in my benchmarks DL operating in main process (nr_workers=0) is worse than using separate processes for the DL. However, for some corner cases, i.e large hidden feature size, using sparse data repr. and with HT off I've noticed that main process DL performs better. This indicates a perf bottleneck that needs to be eliminated.
Initially, I performed some manual runs that indicated that there will be a perf. boost also while running with a smaller dataset. I will provide benchmark results for a smaller dataset, i.e. Reddit soon.
Correct, in majority of cases running on a single socket leads to better performance, because it eliminates the remote memory access which causes significant memory bound. I will be investigating how to utilize dual socket CPU better. It will most likely require compute threads affinitzation & binding. This is also an ongoing research. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks for all the hard effort. Most commons are only super nit :)
@yanbing-j I am planning to develop this feature further to include compute threads affinitization. When I have all the results ready I will write a user-friendly guide for "CPU best practices for PyG". Please stay tuned. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
The test results for loader affinitization are unstable and I will need to investigate this case further before merging the PR. More info to follow. |
The problem has been fixed by adding The issue originates from the fact that DataLoader worker needs time to initialize completely in a parallel subprocess (with worker PID). When |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
update `WorkerInitWrapper` description
@yanbing-j @mszarma @rusty1s Thank you all for the review! |
Feature description:
DataLoader threads affintization can be enabled with
NodeLoader.enable_cpu_affinity()
as a context manager function.It can affinitize to a user-provided list of CPU IDs or by default to numa0 node cores starting at core1, where core0 is left out for running any other background threads: system-level, OMP, etc. The affinitization is implemented with psutil method and applied to each worker_id at the time of initialization
https://github.com/JakubPietrakIntel/pytorch_geometric/blob/28a4694a85914196962cb9363e53bbd33b5eebba/torch_geometric/loader/node_loader.py#L173
Additional helper function
get_numa_nodes()
has been added to utils. It is responsible for reading CPU core ids from/sys/devices/system/node/
for Linux machines. For other operating systems standardpsutil.cpu_count(logical=False)
will be applied.https://github.com/JakubPietrakIntel/pytorch_geometric/blob/28a4694a85914196962cb9363e53bbd33b5eebba/torch_geometric/loader/utils.py#L210-L253
Benchmarking:
To test the feature, extensive benchmarks were ran on a dual socket system with Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz (ICX) with 28 cores each and 512GB of RAM DDR4. The script to measure inference time was
inference_benchmark.py
with a GCN model and ogbn-products dataset over a range of hyperparameters:spmm
/scatter_add
)In the plots presented at the bottom of this page the results are presented for CPU setup without hyperthreading, which yielded overall better results than with HT on. In this scenario 3 example affinitization configurations are presented in comparison with the baseline config.
Results discussion:
In majority of cases DL threads affinitization yields a measurable improvement up to 2x for small hidden feature size and large batch size. The inference time can be optimized further using compute threads affinitzation that, as of now, can only be done by setting additional OMP parameters, prior to launching python script: limiting number of threads with OMP_NUM_THREADS and pinning them to specific CPUs with GOMP_CPU_AFFINITY (see CPU All and CPU1 cases).
There's no single one-fit-all solution for CPU config that can be applied to all models and architectures. Complexity of the problem arises from mutual dependency between performance of data-loading and compute threads and HW architecture of a CPU model. Therefore, further research will be performed to develop more user-friendly interfaces for thread management.
Further work:
Plots:
SPMM
SCATTER