Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Temporary] Multi-GPU predictor #3819

Closed
wants to merge 13 commits into from

Conversation

trivialfis
Copy link
Member

Please ignore this PR. I'm attempting to resolve the bug in #3738 . Creating a temporary PR allows me to peek at Jenkins without polluting @canonizer 's branch.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

@trivialfis FYI, now that you are a committer, you have ability to interrupt any CI jobs as well. This may be helpful when you run into executor starvation on Jenkins.

@trivialfis
Copy link
Member Author

@hcho3 I looked around the Jenkins interface, only relevant button I saw is "Restart Jenkins: Build & Test". Which, doens't do anything.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

@trivialfis Look for the square stop icon on the top right corner:
screen shot 2018-10-22 at 10 53 21 pm

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

Also, current setup involves 3 re-tries, so you may have to press the stop button multiple times.

@trivialfis
Copy link
Member Author

@hcho3
screenshot from 2018-10-23 18-55-39

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

@trivialfis Ugh, I thought all committers had admin rights on Jenkins, but it doesn't seem to be the case. Let me fix this.

@trivialfis
Copy link
Member Author

trivialfis commented Oct 23, 2018

@hcho3 Thanks, that will be a great help!

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

@trivialfis Looks like I had to manually add you to the admin list. Try it again.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

Actually, with a recent Jenkins update, the Stop button now properly works with the retry block. Pressing it once should stop all jobs.

@trivialfis
Copy link
Member Author

@hcho3 I tried it again, it's "HTTP ERROR 404" now.
screenshot from 2018-10-23 20-38-30

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

@trivialfis
Copy link
Member Author

@hcho3 Yes, this one works.

@trivialfis
Copy link
Member Author

@hcho3 But still no access permission (the stop button isn't there).

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

Try stopping it and see if it works.

@trivialfis
Copy link
Member Author

@hcho3 Oh, after a refresh now I see it. It works. Thanks! :)

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

Good to hear that. You are now able to stop and re-start any CI jobs. Now let's see if we can figure out why multi-GPU test is failing.

@trivialfis
Copy link
Member Author

@hcho3 I'm able to stop it, but not restart it. You can see from the log a java exception "java.lang.NullPointerException: Cannot invoke method getBuildName() on null object
" is thrown after restart. But I will trigger it by amend commit so it doesn't really matter.

Thanks a lot! Let me try to figure something out.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

@trivialfis That's curious. For now, let's focus on the multi-GPU test. Jenkins setup is relatively new, and it can be improved in the future.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

@trivialfis FYI, there are two buttons for restart. Make sure to use the one on the top (without text), not the one at bottom with text. I don't know why, but the bottom one never worked for me.
screen shot 2018-10-23 at 12 54 43 am

@trivialfis
Copy link
Member Author

@hcho3 Got it.

@trivialfis
Copy link
Member Author

@hcho3 Can appvayor be cancelled?

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

@trivialfis Yes. Do you see this button on the top right?
screen shot 2018-10-23 at 1 02 31 am
If not, your account is not yet under the tqchen group. For this, we'll have to reach out to Tianqi.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

In the long run, I'm looking to migrate all Windows tests to Jenkins. You can run many tests in parallel, and each worker can be customized.

@trivialfis
Copy link
Member Author

@hcho3 After logging out and log back, I can cancel the appvayor now. :)

That's a good plan, I am also thinking about if we can reduce or combine some expensive tests. Will discuss about it later on.

@trivialfis
Copy link
Member Author

@hcho3 Last commit in this PR solved the problem. I will push it to original PR tomorrow. Thanks for all the help!

@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

@trivialfis No problem. BTW, does your fix work when n_row = 8? Otherwise we should raise an error whenever [number of data row] < [number of GPUs]. Alternatively, we can proactively change number of GPUs to fit the number of data rows.

@trivialfis
Copy link
Member Author

@hcho3 Good point. Let me keep looking.

@trivialfis trivialfis closed this Oct 23, 2018
@hcho3
Copy link
Collaborator

hcho3 commented Oct 23, 2018

@trivialfis I found out that Jenkins was showing 404 for anonymous users. I fixed the configuration to address this.

@trivialfis
Copy link
Member Author

@hcho3 Glad to hear that. :)

@trivialfis trivialfis deleted the mgpu-predictor branch October 25, 2018 01:33
@lock lock bot locked as resolved and limited conversation to collaborators Jan 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants