Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nnet1: improving the GPU diagnostics, #1532

Merged
merged 1 commit into from
Apr 5, 2017

Conversation

KarelVesely84
Copy link
Contributor

  • we auto-detect the 'compute capability' problems (these appear as the 'invalid device function'),
  • we also provide guidelines what to try before posting to forum, and which info to send to us,

@KarelVesely84
Copy link
Contributor Author

I successfuly tested this in 3 situations:

  • server without GPU,
  • GPU server with correct setup,
  • GPU server with `compute capability' mismatch,

All 3 situations were detected and handled correctly.

@@ -41,6 +53,34 @@ int main(int argc, char *argv[]) try {
CuDevice::Instantiate().SelectGpuId("yes");
std::cerr
<< "### HURRAY, WE GOT A CUDA GPU FOR COMPUTATION!!! ###"
<< std::endl << std::endl;
std::cerr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of std::cerr here.
Can you please use KALDI_LOG or KALDI_WARN or KALDI_ERR at least some of the time, so that
the version-number info and the name of the binary will be printed?
It's fine with multi-line error messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, okay, I will add there a KALDI_LOG at the very beginning.

(Otherwise I am consistently using 'std::cerr', which is necessary in the except() { ... } part. I remember from lessons, that it is better not to mix the 'fprintf' with the output to std::cout std::cerr and 'fprintf' is used in KALDI_LOG, KALDI_WARN and KALDI_ERR)

@danpovey
Copy link
Contributor

danpovey commented Apr 4, 2017 via email

@KarelVesely84
Copy link
Contributor Author

Yes, I know, the 3 kaldi macros are used usually.

Here it is really an exceptional case. We need to be able to print the guideline messages in the except (...) { ... } parts, in which we cannot use the standard kaldi macros (that would throw an exception from an exception handler).

Similarly, at the end of every binary we print to std::cerr from the exception handler:

/src$ tail -n7 gmmbin/gmm-est.cc
  } catch(const std::exception &e) {
    std::cerr << e.what() << '\n';
    return -1;
  }
}

I know it does not follow the usual policy, but in some sense it is consistent for this particular binary to print to std::cerr, if we need to print from the except (...) { ... } parts. This is really only 'the single case' where it is like that...

@danpovey
Copy link
Contributor

danpovey commented Apr 4, 2017 via email

@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented Apr 4, 2017

That is true, only KALDI_ERR throws exception. Actually, now I have notice that the e.what() mesage is always empty. I we were purists, we should not mix output to fprintf with STL streams.

Well, I could rewrite it a bit... But, on the other hand in this single case I like the messages uncluttered with the macro strings, because they will always appear directly in the main training log. You can compare the readability of the two variants of log-prints:

Current:

LOG ([5.1.54~1-c2667]:main():cuda-gpu-available.cc:49) ...

### IS CUDA GPU AVAILABLE? 'dellgpu1.fit.vutbr.cz' ###
LOG ([5.1.54~1-c2667]:IsComputeExclusive():cu-device.cc:263) CUDA setup operating under Compute Exclusive Process Mode.
LOG ([5.1.54~1-c2667]:FinalizeActiveGpu():cu-device.cc:225) The active GPU is [2]: GeForce GTX 980      free:3932M, used:104M, total:4037M, free/total:0.974056 version 5.2
### HURRAY, WE GOT A CUDA GPU FOR COMPUTATION!!! ###

### Testing CUDA setup with a small computation (setup = cuda-toolkit + gpu-driver + kaldi):
### Test OK!

With kaldi macros:

LOG ([5.1.54~1-c2667]:main():cuda-gpu-available.cc:49) ### IS CUDA GPU AVAILABLE? 'dellgpu1.fit.vutbr.cz' ###
LOG ([5.1.54~1-c2667]:IsComputeExclusive():cu-device.cc:263) CUDA setup operating under Compute Exclusive Process Mode.
LOG ([5.1.54~1-c2667]:FinalizeActiveGpu():cu-device.cc:225) The active GPU is [2]: GeForce GTX 980      free:3932M, used:104M, total:4037M, free/total:0.974056 version 5.2
LOG ([5.1.54~1-c2667]:main():cuda-gpu-available.cc:68) ### HURRAY, WE GOT A CUDA GPU FOR COMPUTATION!!! ###

LOG ([5.1.54~1-c2667]:main():cuda-gpu-available.cc:123) ### Testing CUDA setup with a small computation (setup = cuda-toolkit + gpu-driver + kaldi):
LOG ([5.1.54~1-c2667]:main():cuda-gpu-available.cc:156) ### Test OK!

Is the 2nd really more readable? If you think so, I will change it...

@danpovey
Copy link
Contributor

danpovey commented Apr 4, 2017 via email

@KarelVesely84
Copy link
Contributor Author

Okay, I'll do one more update :)

@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented Apr 4, 2017

Hm, it seems that Travis failed on 'logistic-regression-test'.

Running logistic-regression-test .../bin/bash: line 1: 81061 Aborted                 (core dumped) ./$x > $x.testlog 2>&1
 1s... FAIL logistic-regression-test

For me, locally, it passed... Any idea what might be the problem?
(I'll do a rebase to re-run the Travis)

- we auto-detect the 'compute capability' problems (these appear as the 'invalid device function'),
- we also provide guidelines what to try before posting to forum, and which info to send to us,
@KarelVesely84
Copy link
Contributor Author

Good! Now the Travis test passed!

@danpovey danpovey merged commit 0157686 into kaldi-asr:master Apr 5, 2017
david-ryan-snyder pushed a commit to david-ryan-snyder/kaldi that referenced this pull request Apr 12, 2017
- we auto-detect the 'compute capability' problems (these appear as the 'invalid device function'),
- we also provide guidelines what to try before posting to forum, and which info to send to us,
@KarelVesely84 KarelVesely84 deleted the nnet1_detect_gpu_problem branch May 17, 2017 08:29
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
- we auto-detect the 'compute capability' problems (these appear as the 'invalid device function'),
- we also provide guidelines what to try before posting to forum, and which info to send to us,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants