Skip to content
This repository has been archived by the owner on Jan 31, 2022. It is now read-only.

Use json logging to log information about the issue being processed when exceptions occur #77

Merged
merged 5 commits into from
Jan 19, 2020

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Dec 25, 2019

Use json logging to log information about the issue being processed when exceptions occur

related to: #70 Combine repo specific and universal model


This change is Reviewable

@jlewi
Copy link
Contributor Author

jlewi commented Dec 25, 2019

Generating inference requests with the universal model is currently failing with the stack trace shown below.

My suspicion is that there is a bug in the code for the universal graph model and I'm not properly managing the TF graph.

I wonder if the graph is getting overwritten when we load the repo specific models?

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2463, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2449, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1866, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/py/label_microservice/app.py", line 83, in predict
    predictions = model.predict_issue_labels(title, text)
  File "/py/label_microservice/universal_kind_label_model.py", line 76, in predict_issue_labels
    probs = self.model.predict(x=[vec_body, vec_title]).tolist()[0]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 908, in predict
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 723, in predict
    callbacks=callbacks)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 394, in model_iteration
    batch_outs = f(ins_batch)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3476, in __call__
    run_metadata=self.run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Error while reading resource variable dense_5/kernel from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/dense_5/kernel/N10tensorflow3VarE does not exist.
	 [[{{node dense_5/MatMul/ReadVariableOp}}]]


@hamelsmu
Copy link
Member

@jlewi regarding the stack trace, I vaguely remember encountering this the first time around with Issue Label Bot. Some things that helped me:

  1. Storing the graph in the application state, and initialize the default graph example

  2. When doing inference doing this within the default graph context example

I am not 100% sure this will resolve this problem, but its worth a try?

@@ -0,0 +1,24 @@
"""The models packages defines wrappers around different models."""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using abstract classes to define python interfaces. Not required, but might help catch errors later

results = {}
results.update(left)

for label, probability in right.items():
Copy link
Member

@hamelsmu hamelsmu Dec 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using python groupby - example:

from itertools import groupby

dict1 = {'bug': .9,
         'feature': .74,
         'comment': .12}

dict2 = {'jupyter': .2,
         'feature': .8,
         'docs': .45}

combined_data = list(dict1.items()) + list(dict2.items())

combined_predictions = dict([max(v) for k,v in groupby(combined_data, key=lambda x: x[0])])

assert combined_predictions == {'bug': 0.9, 
                               'feature': 0.8,
                               'comment': 0.12, 
                               'docs': 0.45,  
                               'jupyter': 0.2}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. So I initially tried it out; but then I realized it looks like we need to sort the items before calling groupby. So it seems like it would be less efficient because now we have to do a sort
as opposed to just going through both lists which is linear in time.

@jlewi
Copy link
Contributor Author

jlewi commented Dec 25, 2019

It looks like the problem was that the flask server was using threads and that causes problems; setting thread=False appears to solve this problem.

@jlewi
Copy link
Contributor Author

jlewi commented Dec 26, 2019

Now with the embedding service I'm getting an exception parsing the data

Exception is:
[issue-embedding-server-854c887f4c-9r42j app] ERROR|2019-12-26T17:52:40|Exception occurred in process_dict signal only works in main thread|/flask_app/inference.py|121|

Looks like this might be due to enabling debugging in the flask server.
https://stackoverflow.com/questions/53522052/flask-app-valueerror-signal-only-works-in-main-thread

The existing code had disabled debug mode and had a comment about the code not working in debug mode.

I had enabled debug mode to try to get the advantage of using file sync with skaffolding. To do that we need to retrigger app reloading.

@jlewi
Copy link
Contributor Author

jlewi commented Dec 26, 2019

I think we need to use the older version of textacy 0.7.1 it looks like that one defines RE_EMAIl but the newer one 0.9.1 does not.

@hamelsmu
Copy link
Member

hamelsmu commented Dec 26, 2019

@jlewi question about this PR - I understand that you are combining the predictions of both models. I thought that the global model was obsolete b/c of Issue Templates? Curious as to where you are thinking of taking this .... (or perhaps I'll wait until this PR is out of draft). 🙇

On a side note, incase you are interested if you are trying compare the probabilities from different models, you usually want to calibrate the probabilities first so they can be compared (even if from the same class of model). Resources :1 -explanation of calibration and 2 - isotonic regression, can use this to calibrate, just mentioning this , its not absolutely critical

@jlewi
Copy link
Contributor Author

jlewi commented Dec 27, 2019

@hamelsmu even with the issue templates we still get a lot of issues which aren't being labeled e.g.

kubeflow/kubeflow#4601
kubeflow/kubeflow#4602

Looks like it might be because the person filing the bug might be messing with the template.

So the primary goals of this PR is to reenable the universal model for issue kind for repositories where we have repo specific models enabled

The larger goal is we'd like to start using ML to auto predict

  • Issue kind
  • Issue area

This PR will hopefully make it easier for us to do things like tweak the threshold labels for the area to do a better job predicting area labels.

I'd expect we should be able to auto-assign area labels for a large number of issues.

So a next step after this PR (#79) is to fix the logging so we can easily track how often we add area and kind labels,

@jlewi
Copy link
Contributor Author

jlewi commented Dec 27, 2019

@hamelsmu btw I'll split this up into smaller PRs that are easier to review.

@jlewi
Copy link
Contributor Author

jlewi commented Dec 27, 2019

Fired off 3 initial PRs

#80 embedding_service changes
#81 autorestarter for use with skaffold
#82 the various classes to aid in combining models

@jlewi jlewi force-pushed the multi_model branch 2 times, most recently from f2621a8 to d3ce0b0 Compare December 31, 2019 02:18
@jlewi jlewi force-pushed the multi_model branch 2 times, most recently from bccdd9c to d6a9212 Compare January 18, 2020 00:49
@k8s-ci-robot
Copy link

The following users are mentioned in OWNERS file(s) but are not members of the kubeflow org.

Once all users have been added as members of the org, you can trigger verification by writing /verify-owners in a comment.

  • rileyjbauer
    • kubeflow_clusters/issue-label-bot/kustomize/metadata/OWNERS

@jlewi
Copy link
Contributor Author

jlewi commented Jan 18, 2020

The bulk of the changes in this PR have been split off and merged. In separate PRs.

I've rebased the PR and there are just a couple cosmetic changes left.

* Attach extra metadata information in the log messages.

* Related to kubeflow#70 - use ensemble models
@jlewi jlewi marked this pull request as ready for review January 18, 2020 01:09
@jlewi
Copy link
Contributor Author

jlewi commented Jan 18, 2020

@hamelsmu this is ready for review. Most of the changes were already submitted in other PRs; only thing that remains is some logging.

@jlewi jlewi changed the title Combine repo specific and general model Use json logging to log information about the issue being processed when exceptions occur Jan 18, 2020
@hamelsmu
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 18, 2020
@hamelsmu
Copy link
Member

/lgtm
/approve

@hamelsmu
Copy link
Member

@jlewi looks like you have to fix owners file

@hamelsmu
Copy link
Member

/verify-owners

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 18, 2020
@hamelsmu
Copy link
Member

/lgtm

@jlewi
Copy link
Contributor Author

jlewi commented Jan 18, 2020

Thanks
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hamelsmu, jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jlewi
Copy link
Contributor Author

jlewi commented Jan 19, 2020

/hold cancel

@k8s-ci-robot k8s-ci-robot merged commit c31d84e into kubeflow:master Jan 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants