-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFX trainer component running in Kubeflow fails although it was successful in the Interactive Context #6525
Comments
Can you also share the complete logs from logs explorer for this pipeline or a minimal pipeline code to recreate this issue. For the error which you mentioned in issues, the pipeline is saving the same artifacts in same location which is causing this error. Complete logs will help us understand which artifact is causing this issue and how we can resolve it. Thanks. |
code updated@singhniraj08 Thank you for the quick answer. I couldn't identify where the schema artifact (?) is written twice. Here is the code:
|
I tried running the pipeline code but there where lot of errors in the pipeline code which needs to be fixed, so I was not able to replicate the issue using the example code. I tried running the TFX BigQuery tutorial and it ran without any issues. |
@singhniraj08 My biggest problem is that I cannot view in the Vertex interface or in LogsExplorer, the logs from the component execution, therefore it is almost impossible to debug them blindly. In other tests where I used only Kubeflow components instead of TFX, I was able to view them. Do you have any idea why it happens and what can I do to visualize the components logs? Regarding this TFX pipeline, I follow the following two tutorials:
I can run successfully, both of them independently. But when I combine them as per the above code, the second component (Trainer) fails. These are the differences between the code from tutorial #2 and my code:
def _make_keras_model() -> tf.keras.Model: Returns:
I will update the code in the issue to match the modified code
|
@singhniraj08 Hi, Even without logging, I managed to identify the coding problems in the trainer module and it works now. |
@crbl1122, Normally the component logs should appear in logs explorer, but I will try to verify it with the team and update this issue. |
If the bug is related to a specific library below, please raise an issue in the
respective repo directly:
TensorFlow Data Validation Repo
TensorFlow Model Analysis Repo
TensorFlow Transform Repo
TensorFlow Serving Repo
System information
Interactive Notebook, Google Cloud, etc): Notebook
pip freeze
output):Describe the current behavior
In GCP I run a Kubeflow ML pipeline with TFX components using a custom service account. The pipeline reads data from BigQuery and it has the following components: components = [example_gen, statistics_gen, schema_gen, transform, trainer, pusher]
The main problem is that it fails at the last "trainer" step, although I tested each step in the interactive context and all were OK. The secondary problem is that I cannot display log messages for the trainer module execution code in the main GCP pipeline dashboard (in the logs area). This complicates my debugging attempts. I can only view the logs from Logs Explorer but I cannot display the messages for the python trainer module code, seem only to be the framework messages. In those messages I view only one type of error message. I identified that this operation uses the default service account (not the custom one) and it might not have all permissions needed. I tried to set the trainer component to use the custom SA but it does not use it. How can I set it properly for the custom SA?
Please view details: https://stackoverflow.com/questions/77652732/tfx-trainer-component-running-in-kubeflow-fails-although-it-was-successful-in-th
Describe the expected behavior
Being able to succesfully run the TFX Trainer component with Kubeflow pipeline.
Standalone code to reproduce the issue
Providing a bare minimum test case or step(s) to reproduce the problem will
greatly help us to debug the issue. If possible, please share a link to
Colab/Jupyter/any notebook.
Name of your Organization (Optional)
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem.
If including tracebacks, please include the full traceback. Large logs and files
should be attached.
The text was updated successfully, but these errors were encountered: