Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Friction log - xgboost_synthetic on 0.6.2 RC #619

Closed
jlewi opened this issue Aug 16, 2019 · 18 comments
Closed

Friction log - xgboost_synthetic on 0.6.2 RC #619

jlewi opened this issue Aug 16, 2019 · 18 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Aug 16, 2019

I tried running the xgboost example following the release demo script.
http://bit.ly/2ZgOE1v

The first problem I ran into is that the stock Jupyter image has an older version of fairing installed which doesn't have fire.

We should add a line to install from github

!pip3 install git+git://github.com/kubeflow/fairing.git@c6c075dece72135f5883abfe2a296894d74a2367
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.69. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 16, 2019

The next problem I hit is that instantiating the ModelServe class fails with a connection timeout. It looks like its trying to create a client for the metadata server.

---------------------------------------------------------------------------
RemoteDisconnected                        Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    599                                                   body=body, headers=headers,
--> 600                                                   chunked=chunked)
    601 

/opt/conda/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    383                     # otherwise it looks like a programming error was the cause.
--> 384                     six.raise_from(e, None)
    385         except (SocketTimeout, BaseSSLError, SocketError) as e:

/opt/conda/lib/python3.6/site-packages/urllib3/packages/six.py in raise_from(value, from_value)

/opt/conda/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    379                 try:
--> 380                     httplib_response = conn.getresponse()
    381                 except Exception as e:

/opt/conda/lib/python3.6/http/client.py in getresponse(self)
   1330             try:
-> 1331                 response.begin()
   1332             except ConnectionError:

/opt/conda/lib/python3.6/http/client.py in begin(self)
    296         while True:
--> 297             version, status, reason = self._read_status()
    298             if status != CONTINUE:

/opt/conda/lib/python3.6/http/client.py in _read_status(self)
    265             # sending a valid response.
--> 266             raise RemoteDisconnected("Remote end closed connection without"
    267                                      " response")

RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

ProtocolError                             Traceback (most recent call last)
<ipython-input-9-869aa6a0a691> in <module>
----> 1 ModelServe(model_file="mockup-model.dat").train()

<ipython-input-8-196919b5df30> in __init__(self, model_file)
     17 
     18         self.model = None
---> 19         self.exec = self.create_execution()
     20 
     21     def train(self):

<ipython-input-8-196919b5df30> in create_execution(self)
     86             workspace=workspace,
     87             run=r,
---> 88             description="execution for training xgboost-synthetic")

/opt/conda/lib/python3.6/site-packages/kfmd/metadata.py in __init__(self, name, workspace, run, description)
    151     response = self.workspace.client.create_execution(
    152         parent=self.EXECUTION_TYPE_NAME,
--> 153         body=self.serialized(),
    154     )
    155     self.id = response.execution.id

/opt/conda/lib/python3.6/site-packages/kfmd/openapi_client/api/metadata_service_api.py in create_execution(self, parent, body, **kwargs)
    398         """
    399         kwargs['_return_http_data_only'] = True
--> 400         return self.create_execution_with_http_info(parent, body, **kwargs)  # noqa: E501
    401 
    402     def create_execution_with_http_info(self, parent, body, **kwargs):  # noqa: E501

/opt/conda/lib/python3.6/site-packages/kfmd/openapi_client/api/metadata_service_api.py in create_execution_with_http_info(self, parent, body, **kwargs)
    491             _preload_content=local_var_params.get('_preload_content', True),
    492             _request_timeout=local_var_params.get('_request_timeout'),
--> 493             collection_formats=collection_formats)
    494 
    495     def create_execution_type(self, body, **kwargs):  # noqa: E501

/opt/conda/lib/python3.6/site-packages/kfmd/openapi_client/api_client.py in call_api(self, resource_path, method, path_params, query_params, header_params, body, post_params, files, response_type, auth_settings, async_req, _return_http_data_only, collection_formats, _preload_content, _request_timeout, _host)
    339                                    response_type, auth_settings,
    340                                    _return_http_data_only, collection_formats,
--> 341                                    _preload_content, _request_timeout, _host)
    342         else:
    343             thread = self.pool.apply_async(self.__call_api, (resource_path,

/opt/conda/lib/python3.6/site-packages/kfmd/openapi_client/api_client.py in __call_api(self, resource_path, method, path_params, query_params, header_params, body, post_params, files, response_type, auth_settings, _return_http_data_only, collection_formats, _preload_content, _request_timeout, _host)
    170             post_params=post_params, body=body,
    171             _preload_content=_preload_content,
--> 172             _request_timeout=_request_timeout)
    173 
    174         self.last_response = response_data

/opt/conda/lib/python3.6/site-packages/kfmd/openapi_client/api_client.py in request(self, method, url, query_params, headers, post_params, body, _preload_content, _request_timeout)
    384                                          _preload_content=_preload_content,
    385                                          _request_timeout=_request_timeout,
--> 386                                          body=body)
    387         elif method == "PUT":
    388             return self.rest_client.PUT(url,

/opt/conda/lib/python3.6/site-packages/kfmd/openapi_client/rest.py in POST(self, url, headers, query_params, post_params, body, _preload_content, _request_timeout)
    274                             _preload_content=_preload_content,
    275                             _request_timeout=_request_timeout,
--> 276                             body=body)
    277 
    278     def PUT(self, url, headers=None, query_params=None, post_params=None,

/opt/conda/lib/python3.6/site-packages/kfmd/openapi_client/rest.py in request(self, method, url, query_params, headers, body, post_params, _preload_content, _request_timeout)
    166                         preload_content=_preload_content,
    167                         timeout=timeout,
--> 168                         headers=headers)
    169                 elif headers['Content-Type'] == 'application/x-www-form-urlencoded':  # noqa: E501
    170                     r = self.pool_manager.request(

/opt/conda/lib/python3.6/site-packages/urllib3/request.py in request(self, method, url, fields, headers, **urlopen_kw)
     70             return self.request_encode_body(method, url, fields=fields,
     71                                             headers=headers,
---> 72                                             **urlopen_kw)
     73 
     74     def request_encode_url(self, method, url, fields=None, headers=None,

/opt/conda/lib/python3.6/site-packages/urllib3/request.py in request_encode_body(self, method, url, fields, headers, encode_multipart, multipart_boundary, **urlopen_kw)
    148         extra_kw.update(urlopen_kw)
    149 
--> 150         return self.urlopen(method, url, **extra_kw)

/opt/conda/lib/python3.6/site-packages/urllib3/poolmanager.py in urlopen(self, method, url, redirect, **kw)
    322             response = conn.urlopen(method, url, **kw)
    323         else:
--> 324             response = conn.urlopen(method, u.request_uri, **kw)
    325 
    326         redirect_location = redirect and response.get_redirect_location()

/opt/conda/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    636 
    637             retries = retries.increment(method, url, error=e, _pool=self,
--> 638                                         _stacktrace=sys.exc_info()[2])
    639             retries.sleep()
    640 

/opt/conda/lib/python3.6/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    366             # Read retry?
    367             if read is False or not self._is_method_retryable(method):
--> 368                 raise six.reraise(type(error), error, _stacktrace)
    369             elif read is not None:
    370                 read -= 1

/opt/conda/lib/python3.6/site-packages/urllib3/packages/six.py in reraise(tp, value, tb)
    683             value = tp()
    684         if value.__traceback__ is not tb:
--> 685             raise value.with_traceback(tb)
    686         raise value
    687 

/opt/conda/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    598                                                   timeout=timeout_obj,
    599                                                   body=body, headers=headers,
--> 600                                                   chunked=chunked)
    601 
    602             # If we're going to release the connection in ``finally:``, then

/opt/conda/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    382                     # Remove the TypeError from the exception chain in Python 3;
    383                     # otherwise it looks like a programming error was the cause.
--> 384                     six.raise_from(e, None)
    385         except (SocketTimeout, BaseSSLError, SocketError) as e:
    386             self._raise_timeout(err=e, url=url, timeout_value=read_timeout)

/opt/conda/lib/python3.6/site-packages/urllib3/packages/six.py in raise_from(value, from_value)

/opt/conda/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    378             except TypeError:  # Python 3
    379                 try:
--> 380                     httplib_response = conn.getresponse()
    381                 except Exception as e:
    382                     # Remove the TypeError from the exception chain in Python 3;

/opt/conda/lib/python3.6/http/client.py in getresponse(self)
   1329         try:
   1330             try:
-> 1331                 response.begin()
   1332             except ConnectionError:
   1333                 self.close()

/opt/conda/lib/python3.6/http/client.py in begin(self)
    295         # read until we get a non-100 response
    296         while True:
--> 297             version, status, reason = self._read_status()
    298             if status != CONTINUE:
    299                 break

/opt/conda/lib/python3.6/http/client.py in _read_status(self)
    264             # Presumably, the server closed the connection before
    265             # sending a valid response.
--> 266             raise RemoteDisconnected("Remote end closed connection without"
    267                                      " response")
    268         try:

ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

@jlewi
Copy link
Contributor Author

jlewi commented Aug 16, 2019

The issue talking to the metadata server appears to be some kind of transient issue; retrying a couple times it appears to succeed.

/cc @zhenghuiwang

@jlewi
Copy link
Contributor Author

jlewi commented Aug 16, 2019

The next error I hit is

cluster_builder = cluster.cluster.ClusterBuilder(registry=DOCKER_REGISTRY,
                                                 base_image=base_image,
                                                 namespace='kubeflow',
                                                 preprocessor=preprocessor,
                                                 pod_spec_mutators=[fairing.cloud.gcp.add_gcp_credentials_if_exists],
                                                 context_source=cluster.gcs_context.GCSContextSource())
cluster_builder.build()
Building image using cluster builder.
Creating docker context: /tmp/fairing_context_2p1gywdv
Converting build-train-deploy.ipynb to build-train-deploy.py
Creating entry point for the class name ModelServe
could not check for secret: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '48941d65-d645-428c-b9f6-303b8ef0a160', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 16 Aug 2019 17:39:19 GMT', 'Content-Length': '306'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"secrets is forbidden: User \"system:serviceaccount:kubeflow-jlewi:default-editor\" cannot list resource \"secrets\" in API group \"\" in the namespace \"kubeflow\"","reason":"Forbidden","details":{"kind":"secrets"},"code":403}

The problem is the namespace. If we delete that line it will try to run in the namespace the notebook is in and it works.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 16, 2019

The deployer job fails

Traceback (most recent call last):
  File "/app/build-train-deploy.py", line 27, in <module>
    import kfmd
ModuleNotFoundError: No module named 'kfmd'

I think we are missing kfmd from requirements.txt

@jlewi
Copy link
Contributor Author

jlewi commented Aug 16, 2019

I added kfmd to requirements.txt but it looks like I also needed to edit the notebook because requirements.txt wasn't included by the preprocessor in the files uploaded to the builder. So the kaniko context didn't include it and the Dockerfile skips requirements.txt if it doesn't exist.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 16, 2019

Building with kaniko gives an error

    copying build/lib/tests/test.py -> /opt/conda/lib/python3.6/site-packages/tests
    error: could not create '/opt/conda/lib/python3.6/site-packages/tests/test.py': Permission denied

    ----------------------------------------
Command "/opt/conda/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-ainvjic7/kfmd/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-x5u7xsj0/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-install-ainvjic7/kfmd/
You are using pip version 19.0.1, however version 19.2.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
error building image: error building stage: waiting for process to exit: exit status 1

@jlewi
Copy link
Contributor Author

jlewi commented Aug 16, 2019

I think this might be a permission issue inside the docker image; /opt/conda might be owned by jovyan but Dockerfile is trying to switch to root.

It also looks like it might an issue with the base container image. I updated and simplified the docker image and that appears to fix it. Will send out a PR soon.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 16, 2019

Curl from the notebook appears to work

!curl http://metadata-service.kubeflow:8080/api/v1alpha1/execution_types
!curl http://metadata-service.kubeflow.svc.cluster.local:8080/api/v1alpha1/execution_types

@jlewi
Copy link
Contributor Author

jlewi commented Aug 16, 2019

I added retries around connecting to the client and that seems to have fixed the problem.

My suspicion is that since we are using ISTIO we need to wait for the ISTIO side to start.

jlewi added a commit to jlewi/examples that referenced this issue Aug 18, 2019
* Need to add kfmd to requirements.txt because the training code now uses
  kfmd to log data.

* The Dockerfile didn't build with kaniko; it looks like a permission problem
  trying to install python files into the conda directory. The problem appears
  to be fixed by not switching to user root.

* Updte the base docker image to 1.13.

* Remove some references in the notebook to namespace because the fairing
  code should now detect namespace automatically and the notebook will no longer
  be running namespace kubeflow

* When running training in a K8s job; the code will now try to contact the
  metadata server but this can fail if the ISTIO side car hasn't started yet.
  So we need to wait for ISTIO to start; we do this by trying to contact
  the metadata server for up to 3 minutes.

* Add a lot more explanation in the notebook to explain what is happening.

* Related to kubeflow#619
@jlewi
Copy link
Contributor Author

jlewi commented Aug 18, 2019

/assign @jlewi

@jhua18
Copy link

jhua18 commented Aug 19, 2019

@jlewi I encountered the same problem when running the following to train model locally

ModelServe(model_file="mockup-model.dat").train()

it appears intermittent, often starts working after several tries. Could be race condition - wait for an available connection?

And similarly in the subsequent statement for local prediction:

(train_X, train_y), (test_X, test_y) =read_synthetic_input()

ModelServe().predict(test_X, None)

Even with the training statement passed, the predict line still fails intermittently, and could work after manual retries ( or could be just because of waiting period)?

@jhua18
Copy link

jhua18 commented Aug 19, 2019

I already tried updating fairing with the following in the beginning of the sample notebook:

!pip3 install -U fairing

and current fairing version:

Requirement already up-to-date: fairing in /opt/conda/lib/python3.6/site-packages (0.5.3)
...

k8s-ci-robot pushed a commit that referenced this issue Aug 19, 2019
* Need to add kfmd to requirements.txt because the training code now uses
  kfmd to log data.

* The Dockerfile didn't build with kaniko; it looks like a permission problem
  trying to install python files into the conda directory. The problem appears
  to be fixed by not switching to user root.

* Updte the base docker image to 1.13.

* Remove some references in the notebook to namespace because the fairing
  code should now detect namespace automatically and the notebook will no longer
  be running namespace kubeflow

* When running training in a K8s job; the code will now try to contact the
  metadata server but this can fail if the ISTIO side car hasn't started yet.
  So we need to wait for ISTIO to start; we do this by trying to contact
  the metadata server for up to 3 minutes.

* Add a lot more explanation in the notebook to explain what is happening.

* Related to #619
@jlewi
Copy link
Contributor Author

jlewi commented Aug 21, 2019

@jhua18 what is the failure mode for predict? That function is just running locally in a notebook so it shouldn't be dependent on ISTIO or networking. Can you open up a separate issue and try to include instructions to reproduce?

@jlewi
Copy link
Contributor Author

jlewi commented Aug 21, 2019

I retried the xgbosst_synthetic example:
http://bit.ly/2ZgOE1v

The notebook now runs successfully to completion.

I'm still hitting two issues with Kubeflow setup using the 0.6.2 RC.

kubeflow/kubeflow#3900
kubeflow/kubeflow#3875

Will leave this issue open until those issues are fixed and I've verified the fix.

@jhua18
Copy link

jhua18 commented Aug 22, 2019

@jlewi I pulled latest sample code that see some recent changes (comparing to what I pulled last week). I can confirm that it now can run to the end successfully.

In addition to the changes of xgboost_synthetic sample, I also noticed the differences in fairing package, which is the dependencies of the sample.
Previously the sample used #!pip3 install fairing
and lately it pulls the same lib with !pip3 install git+git://github.com/kubeflow/fairing.git@c6c075dece72135f5883abfe2a296894d74a2367
both got the fairing-0.5.3. However, the 1st can cause error module object not callable when fairing starts building docker image:

preprocessor = ConvertNotebookPreprocessorWithFire("ModelServe")

I looked into the package in two notebook servers' environment , and noticed the significant differences in the code and the dependencies.
It's very weird seeing the same package versions that appear to be from different releases, which might be an known issue.
KF-Example-xgboost-FairingVersionDiff-01

@jlewi
Copy link
Contributor Author

jlewi commented Aug 23, 2019

@jhua18 Its because we are installing from master and we only bump the version when we do a release.
https://github.com/kubeflow/fairing/blob/c6c075dece72135f5883abfe2a296894d74a2367/setup.py#L8

Do you want to open an issue in kubeflow/fairing; we should try to set the version on master to indicate it is from master and not a release.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 26, 2019

Closing this. All the major issues have been addressed.
More detail in: http://bit.ly/2ZgOE1v

@jlewi jlewi closed this as completed Aug 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants