-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMDS-2951] Multi-GPU Support for End-to-End ML Pipelines #2050
base: main
Are you sure you want to change the base?
Conversation
src/main/java/org/apache/sysds/runtime/controlprogram/caching/CacheableData.java
Show resolved
Hide resolved
Thanks, @WDRshadow, for initiating the project. |
Thank you for your comment. My partner @KexingLi22 is writing the test classes. We will see it soon. For DNN testing, we were faced with the awkward situation of not having enough suitable GPUs for testing. As I mentioned above, newer graphics cards can not run on Jcuda |
Thanks for clarifying. Unfortunately, at this point, we cannot provide a setup. Once you are done with the project, I can run some performance tests along with our performance test suits. But during the development period, it is not feasible to try every change in our shared node.
|
We got a double RTX2080Ti server and tested scripts in |
Thanks. You do not have to optimize all NN workloads for multi-GPU. Just implementing a robust parfor support is sufficient for this project. |
Thanks for the suggestion. It will be helpful for @KexingLi22 writing the test instances. There is no doubt that SystemDS uses multiple GPUs for the parfor computation. We have used two ways to proof:
We'll show about these in our test code. |
THanks for your suggestion, @phaniarnab . We have written a test class MultiGPUTest.java with single GPU test case, MultipleGPU test case to Everything works well and the execute time of singleGPU is 35 sec 121ms, of the multiGPU is 27 sec 378 ms. And as the advice from @WDRshadow, I also try to add the logger instance into both ParforBody and GPUContext to trace the thread and the GPUContext. And I have already add these into the log4j.properties: Enable detailed logging for specific classeslog4j.logger.org.apache.sysds.runtime.controlprogram.parfor.ParForBody=DEBUG But when I run the test,dml script with the parfor function, nothing about this, which I expected shows out : How can I solve this problems? |
Thanks. The numbers do not look very good. Train just once and write the model in the disk. In a separate script, read the model and infer the test instances within a parfor loop. Here is an example script [1]. You can even use a randomly initialized model, as we are not measuring the accuracy here. I expect at least a 2x improvement. Vary the test size (i.e., the number of iterations of parfor loop) from 10k to 100k. First focus on the development, unit testing and experiments. The logger can be delayed. Instead, extend the ParForStatistics class to report the number of GPUs used by the parfor and other relavent details. These will be printed when -stats is passed. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2050 +/- ##
============================================
- Coverage 68.84% 68.82% -0.02%
- Complexity 40711 40756 +45
============================================
Files 1440 1440
Lines 161565 161693 +128
Branches 31418 31450 +32
============================================
+ Hits 111232 111292 +60
- Misses 41258 41346 +88
+ Partials 9075 9055 -20 ☔ View full report in Codecov by Sentry. |
if (multiGPUs) { | ||
assert extractedNumThreads > 1 : "Test failed: _numThreads is not greater than 1"; | ||
} else { | ||
assert extractedNumThreads == 1 : "Test failed: _numThreads is not equal to 1"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this assertion confirm the use of multiple GPUs?
src/test/scripts/gpu/GPUTest.dml
Outdated
parfor(i in 1:iters) { | ||
beg = ((i-1) * batch_size) %% N + 1 | ||
end = min(N, beg + batch_size - 1) | ||
X_batch = images[beg:end,] | ||
y_batch = labels[beg:end,] | ||
|
||
pred = eff::netPredict(X_batch, model, 1, 28, 28) | ||
partial_accuracies[i,1] = mean(rowIndexMax(pred) == rowIndexMax(y_batch)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good. Run it from 10k to 100k mini-batches and plot the execution time. Compare single-gpu parfor and multi-gpu parfor. For one of the data point (e.g., 10k batches), also report the CPU time. Use all the available CPU cores.
…STEMDS-2951-dev-batch
@phaniarnab We have implemented test cases based on
Test environment:
|
@WDRshadow, thanks for putting the numbers here. The speedup from 2 GPUs is way less than I expected. Can you explain, why the speedup is not consistently 2x? If you are scoring n images, then each GPU gets n/2 images, which should lead to 2x speedup. I do not anticipate any additional overhead for two GPUs for this use case. |
Thanks. Your assumptions are inaccurate. This time is the total execution time, which includes a exactly the same training process before the execution of the |
Okay. In that case, try one of the two options: (1) write the model to disk, create separate dml scripts for inference where you read the model and immediately start the parfor loop. You can find plenty of read, write examples in the test scripts and the reproducibility scripts I shared with you. (2) use time() method before and after the parfor and report only the inference time. You can find an example of using time() here: https://github.com/damslab/reproducibility/blob/master/vldb2022-UPLIFT-p2528/FTBench/systemds/T1.dml For either option, make sure the intermediates are already materialized before the loop starts. SystemDS compiler sometime delays operations till used. You can print the sum of a matrix to force materialization. |
@phaniarnab We have change our code and test again. The time now is only including
Test environment:
Comments:From the table it can be seen that the
|
Okay. Thanks, @WDRshadow, @KexingLi22. |
The driver of GPU and the CUDA version is shared with all GPUs in the same devices. So, yes, both GPUs is set up with CUDA version |
Looks like all tests are passing now. |
Main Updated:
ParForProgramBlock
, where the GPU memory is modified during the process of freeing. Multi-GPU can now be used to accerate theparfor
function and other functions using multiple workers and threads._numThreads
and the actual number of threads is resolved when using multiple GPUs device but only allowing a single GPU (sysds.gpu.availableGPUs=1
) to runparfor
functions with multiple workers and multiple threads.Other bugs in Multi-GPU process:
GPUContext
forcudnnHandle
,cublasHandle
, andcusparseHandle
in Jcuda version10.2.0
, the Native code freezes when executing into Jcuda code. The program cannot continue. This error can easily happen in the test environment of "4070+1080" dual graphics card environment, but sometimes it works fine. This bug is not present in Jcuda version11.8.0
. This is presumed to be a Jcuda issue and cannot be fixed by SystemDS.10.2.0
is not support the GPUs with Ampere Architecture or higher (A100, H100 or products of the same/higher period, RTX 30 and 40 series or higher).Other bugs outside Multi-GPU process we found:
scripts/nn/examples/AttentionExample.dml
can not be runned with even one GPU. Error message isRuntimeException -- Unsupported operator:MAP
. We found that a function (with a high probability that it is themap
function) passes an operator_map
toTernaryOp
class where it is not categorized by a GPU operation.TODO List:
parfor
is actually being implemented.