-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Friction log - xgboost_synthetic on 0.6.2 RC #619
Comments
Issue-Label Bot is automatically applying the label Links: app homepage, dashboard and code for this bot. |
The next problem I hit is that instantiating the ModelServe class fails with a connection timeout. It looks like its trying to create a client for the metadata server.
|
The issue talking to the metadata server appears to be some kind of transient issue; retrying a couple times it appears to succeed. /cc @zhenghuiwang |
The next error I hit is
The problem is the namespace. If we delete that line it will try to run in the namespace the notebook is in and it works. |
The deployer job fails
I think we are missing kfmd from requirements.txt |
I added kfmd to requirements.txt but it looks like I also needed to edit the notebook because requirements.txt wasn't included by the preprocessor in the files uploaded to the builder. So the kaniko context didn't include it and the Dockerfile skips requirements.txt if it doesn't exist. |
Building with kaniko gives an error
|
I think this might be a permission issue inside the docker image; /opt/conda might be owned by jovyan but Dockerfile is trying to switch to root. It also looks like it might an issue with the base container image. I updated and simplified the docker image and that appears to fix it. Will send out a PR soon. |
Curl from the notebook appears to work
|
I added retries around connecting to the client and that seems to have fixed the problem. My suspicion is that since we are using ISTIO we need to wait for the ISTIO side to start. |
* Need to add kfmd to requirements.txt because the training code now uses kfmd to log data. * The Dockerfile didn't build with kaniko; it looks like a permission problem trying to install python files into the conda directory. The problem appears to be fixed by not switching to user root. * Updte the base docker image to 1.13. * Remove some references in the notebook to namespace because the fairing code should now detect namespace automatically and the notebook will no longer be running namespace kubeflow * When running training in a K8s job; the code will now try to contact the metadata server but this can fail if the ISTIO side car hasn't started yet. So we need to wait for ISTIO to start; we do this by trying to contact the metadata server for up to 3 minutes. * Add a lot more explanation in the notebook to explain what is happening. * Related to kubeflow#619
/assign @jlewi |
@jlewi I encountered the same problem when running the following to train model locally
it appears intermittent, often starts working after several tries. Could be race condition - wait for an available connection? And similarly in the subsequent statement for local prediction:
Even with the training statement passed, the predict line still fails intermittently, and could work after manual retries ( or could be just because of waiting period)? |
I already tried updating fairing with the following in the beginning of the sample notebook:
and current fairing version:
|
* Need to add kfmd to requirements.txt because the training code now uses kfmd to log data. * The Dockerfile didn't build with kaniko; it looks like a permission problem trying to install python files into the conda directory. The problem appears to be fixed by not switching to user root. * Updte the base docker image to 1.13. * Remove some references in the notebook to namespace because the fairing code should now detect namespace automatically and the notebook will no longer be running namespace kubeflow * When running training in a K8s job; the code will now try to contact the metadata server but this can fail if the ISTIO side car hasn't started yet. So we need to wait for ISTIO to start; we do this by trying to contact the metadata server for up to 3 minutes. * Add a lot more explanation in the notebook to explain what is happening. * Related to #619
@jhua18 what is the failure mode for predict? That function is just running locally in a notebook so it shouldn't be dependent on ISTIO or networking. Can you open up a separate issue and try to include instructions to reproduce? |
I retried the xgbosst_synthetic example: The notebook now runs successfully to completion. I'm still hitting two issues with Kubeflow setup using the 0.6.2 RC. kubeflow/kubeflow#3900 Will leave this issue open until those issues are fixed and I've verified the fix. |
@jlewi I pulled latest sample code that see some recent changes (comparing to what I pulled last week). I can confirm that it now can run to the end successfully. In addition to the changes of xgboost_synthetic sample, I also noticed the differences in fairing package, which is the dependencies of the sample.
I looked into the package in two notebook servers' environment , and noticed the significant differences in the code and the dependencies. |
@jhua18 Its because we are installing from master and we only bump the version when we do a release. Do you want to open an issue in kubeflow/fairing; we should try to set the version on master to indicate it is from master and not a release. |
Closing this. All the major issues have been addressed. |
I tried running the xgboost example following the release demo script.
http://bit.ly/2ZgOE1v
The first problem I ran into is that the stock Jupyter image has an older version of fairing installed which doesn't have fire.
We should add a line to install from github
The text was updated successfully, but these errors were encountered: