nnet1: improving the GPU diagnostics, #1532

KarelVesely84 · 2017-04-04T17:39:45Z

we auto-detect the 'compute capability' problems (these appear as the 'invalid device function'),
we also provide guidelines what to try before posting to forum, and which info to send to us,

KarelVesely84 · 2017-04-04T17:42:03Z

I successfuly tested this in 3 situations:

server without GPU,
GPU server with correct setup,
GPU server with `compute capability' mismatch,

All 3 situations were detected and handled correctly.

danpovey · 2017-04-04T18:12:51Z

src/nnetbin/cuda-gpu-available.cc

@@ -41,6 +53,34 @@ int main(int argc, char *argv[]) try {
  CuDevice::Instantiate().SelectGpuId("yes");
  std::cerr
    << "### HURRAY, WE GOT A CUDA GPU FOR COMPUTATION!!! ###"
+    << std::endl << std::endl;
+  std::cerr


There is a lot of std::cerr here.
Can you please use KALDI_LOG or KALDI_WARN or KALDI_ERR at least some of the time, so that
the version-number info and the name of the binary will be printed?
It's fine with multi-line error messages.

Hi, okay, I will add there a KALDI_LOG at the very beginning.

(Otherwise I am consistently using 'std::cerr', which is necessary in the except() { ... } part. I remember from lessons, that it is better not to mix the 'fprintf' with the output to std::cout std::cerr and 'fprintf' is used in KALDI_LOG, KALDI_WARN and KALDI_ERR)

danpovey · 2017-04-04T18:41:23Z

Generally speaking the policy is never to use std::cerr at all, and always use one of the 3 macros. They do work with multi-line output.

…

On Tue, Apr 4, 2017 at 11:31 AM, Karel Vesely ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/nnetbin/cuda-gpu-available.cc <#1532 (comment)>: > @@ -41,6 +53,34 @@ int main(int argc, char *argv[]) try { CuDevice::Instantiate().SelectGpuId("yes"); std::cerr << "### HURRAY, WE GOT A CUDA GPU FOR COMPUTATION!!! ###" + << std::endl << std::endl; + std::cerr Hi, okay, I will add there a KALDI_LOG at the very beginning. (Otherwise I am consistently using 'std::cerr', which is necessary in the except() { ... } part. I remember from lessons, that it is better not to mix the 'fprintf' with the output to std::cout std::cerr and 'fprintf' is used in KALDI_LOG, KALDI_WARN and KALDI_ERR) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1532 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu6AwAiNHI9_nVEcpxESB2eSuEvnkks5rsoxvgaJpZM4MzMX2> .

KarelVesely84 · 2017-04-04T19:00:53Z

Yes, I know, the 3 kaldi macros are used usually.

Here it is really an exceptional case. We need to be able to print the guideline messages in the except (...) { ... } parts, in which we cannot use the standard kaldi macros (that would throw an exception from an exception handler).

Similarly, at the end of every binary we print to std::cerr from the exception handler:

/src$ tail -n7 gmmbin/gmm-est.cc
  } catch(const std::exception &e) {
    std::cerr << e.what() << '\n';
    return -1;
  }
}

I know it does not follow the usual policy, but in some sense it is consistent for this particular binary to print to std::cerr, if we need to print from the except (...) { ... } parts. This is really only 'the single case' where it is like that...

danpovey · 2017-04-04T19:03:09Z

OK, well I'm not going to insist, but I want to point out that it's only KALDI_ERR that throws an exception, the others don't.

…

On Tue, Apr 4, 2017 at 12:00 PM, Karel Vesely ***@***.***> wrote: Yes, I know, the 3 kaldi macros are used usually. Here it is really an exceptional case. We need to be able to print the *guideline messages* in the except (...) { ... } parts, in which we cannot use the standard kaldi macros (that would throw an exception from an exception handler). Similarly, at the end of every binary we print to std::cerr from the exception handler: /src$ tail -n7 gmmbin/gmm-est.cc } catch(const std::exception &e) { std::cerr << e.what() << '\n'; return -1; } } I know it does not follow the usual policy, but in some sense it is consistent for this particular binary to print to std::cerr, if we need to print from the except (...) { ... } parts. This is really only *'the single case'* where it is like that... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1532 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu9mSrLblXOKW0cUEpW76TE12ixxFks5rspNogaJpZM4MzMX2> .

KarelVesely84 · 2017-04-04T19:18:29Z

That is true, only KALDI_ERR throws exception. Actually, now I have notice that the e.what() mesage is always empty. I we were purists, we should not mix output to fprintf with STL streams.

Well, I could rewrite it a bit... But, on the other hand in this single case I like the messages uncluttered with the macro strings, because they will always appear directly in the main training log. You can compare the readability of the two variants of log-prints:

Current:

LOG ([5.1.54~1-c2667]:main():cuda-gpu-available.cc:49) ...

### IS CUDA GPU AVAILABLE? 'dellgpu1.fit.vutbr.cz' ###
LOG ([5.1.54~1-c2667]:IsComputeExclusive():cu-device.cc:263) CUDA setup operating under Compute Exclusive Process Mode.
LOG ([5.1.54~1-c2667]:FinalizeActiveGpu():cu-device.cc:225) The active GPU is [2]: GeForce GTX 980      free:3932M, used:104M, total:4037M, free/total:0.974056 version 5.2
### HURRAY, WE GOT A CUDA GPU FOR COMPUTATION!!! ###

### Testing CUDA setup with a small computation (setup = cuda-toolkit + gpu-driver + kaldi):
### Test OK!

With kaldi macros:

LOG ([5.1.54~1-c2667]:main():cuda-gpu-available.cc:49) ### IS CUDA GPU AVAILABLE? 'dellgpu1.fit.vutbr.cz' ###
LOG ([5.1.54~1-c2667]:IsComputeExclusive():cu-device.cc:263) CUDA setup operating under Compute Exclusive Process Mode.
LOG ([5.1.54~1-c2667]:FinalizeActiveGpu():cu-device.cc:225) The active GPU is [2]: GeForce GTX 980      free:3932M, used:104M, total:4037M, free/total:0.974056 version 5.2
LOG ([5.1.54~1-c2667]:main():cuda-gpu-available.cc:68) ### HURRAY, WE GOT A CUDA GPU FOR COMPUTATION!!! ###

LOG ([5.1.54~1-c2667]:main():cuda-gpu-available.cc:123) ### Testing CUDA setup with a small computation (setup = cuda-toolkit + gpu-driver + kaldi):
LOG ([5.1.54~1-c2667]:main():cuda-gpu-available.cc:156) ### Test OK!

Is the 2nd really more readable? If you think so, I will change it...

danpovey · 2017-04-04T19:46:45Z

It's not really about readability, it's about having enough information that when someone pastes the error message into kaldi-help, we have enough information to diagnose the problem without having to go back and forth with them asking questions. But it's not that big of a deal, I'd merge it anyway.

…

On Tue, Apr 4, 2017 at 12:18 PM, Karel Vesely ***@***.***> wrote: Reopened #1532 <#1532>. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1532 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu1OKDK5fmuIJAVdiZ5Q1vYefC_IGks5rspecgaJpZM4MzMX2> .

KarelVesely84 · 2017-04-04T20:48:51Z

Okay, I'll do one more update :)

KarelVesely84 · 2017-04-04T22:43:49Z

Hm, it seems that Travis failed on 'logistic-regression-test'.

Running logistic-regression-test .../bin/bash: line 1: 81061 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
 1s... FAIL logistic-regression-test

For me, locally, it passed... Any idea what might be the problem?
(I'll do a rebase to re-run the Travis)

- we auto-detect the 'compute capability' problems (these appear as the 'invalid device function'), - we also provide guidelines what to try before posting to forum, and which info to send to us,

KarelVesely84 · 2017-04-05T10:57:25Z

Good! Now the Travis test passed!

- we auto-detect the 'compute capability' problems (these appear as the 'invalid device function'), - we also provide guidelines what to try before posting to forum, and which info to send to us,

danpovey reviewed Apr 4, 2017

View reviewed changes

KarelVesely84 force-pushed the nnet1_detect_gpu_problem branch from c2667d7 to 3130b83 Compare April 4, 2017 18:44

KarelVesely84 closed this Apr 4, 2017

KarelVesely84 reopened this Apr 4, 2017

KarelVesely84 force-pushed the nnet1_detect_gpu_problem branch from 3130b83 to 6196df7 Compare April 4, 2017 21:27

nnet1: improving the GPU diagnostics,

39b96be

- we auto-detect the 'compute capability' problems (these appear as the 'invalid device function'), - we also provide guidelines what to try before posting to forum, and which info to send to us,

KarelVesely84 force-pushed the nnet1_detect_gpu_problem branch from 6196df7 to 39b96be Compare April 4, 2017 22:47

danpovey merged commit 0157686 into kaldi-asr:master Apr 5, 2017

KarelVesely84 deleted the nnet1_detect_gpu_problem branch May 17, 2017 08:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nnet1: improving the GPU diagnostics, #1532

nnet1: improving the GPU diagnostics, #1532

KarelVesely84 commented Apr 4, 2017

KarelVesely84 commented Apr 4, 2017

danpovey Apr 4, 2017

KarelVesely84 Apr 4, 2017

danpovey commented Apr 4, 2017 via email

KarelVesely84 commented Apr 4, 2017

danpovey commented Apr 4, 2017 via email

KarelVesely84 commented Apr 4, 2017 •

edited

Loading

danpovey commented Apr 4, 2017 via email

KarelVesely84 commented Apr 4, 2017

KarelVesely84 commented Apr 4, 2017 •

edited

Loading

KarelVesely84 commented Apr 5, 2017

nnet1: improving the GPU diagnostics, #1532

nnet1: improving the GPU diagnostics, #1532

Conversation

KarelVesely84 commented Apr 4, 2017

KarelVesely84 commented Apr 4, 2017

danpovey Apr 4, 2017

Choose a reason for hiding this comment

KarelVesely84 Apr 4, 2017

Choose a reason for hiding this comment

danpovey commented Apr 4, 2017 via email

KarelVesely84 commented Apr 4, 2017

danpovey commented Apr 4, 2017 via email

KarelVesely84 commented Apr 4, 2017 • edited Loading

danpovey commented Apr 4, 2017 via email

KarelVesely84 commented Apr 4, 2017

KarelVesely84 commented Apr 4, 2017 • edited Loading

KarelVesely84 commented Apr 5, 2017

KarelVesely84 commented Apr 4, 2017 •

edited

Loading

KarelVesely84 commented Apr 4, 2017 •

edited

Loading