Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print cloud build logs when the task fails #1989

Merged
merged 2 commits into from
Sep 4, 2019

Conversation

Bobgy
Copy link
Contributor

@Bobgy Bobgy commented Aug 29, 2019

Fixes #1978

/area testing
/assign @Ark-kun


This change is Reviewable

@Ark-kun
Copy link
Contributor

Ark-kun commented Aug 29, 2019

Is there any particular reason to poll the service instead of just running it synchronously? The timeout can be specified using the --timeout 600s argument or in the cloudbuild.yaml.
If you do not want to print the logs unless there is error, you can write them to a file and only print if there is a problem.

@Ark-kun
Copy link
Contributor

Ark-kun commented Aug 29, 2019

/lgtm

@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 29, 2019

@Ark-kun I didn't realize this wasn't clear, let me add a comment later. It's meant to save time. When we are waiting for cloud build, we can deploy a cluster already. That saves ~4min.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Aug 30, 2019
@Ark-kun
Copy link
Contributor

Ark-kun commented Aug 30, 2019

/lgtm

@Ark-kun
Copy link
Contributor

Ark-kun commented Aug 30, 2019

When we are waiting for cloud build, we can deploy a cluster already. That saves ~4min.

I see. I often use the shell parallelism for such purposes.
Example:

{gcoud clusters create && echo "$?" > /tmp/cluster_exit_code ;} &
cluster_pid=$!
gcloud builds submit ...
wait $cluster_pid
if [ "$(cat /tmp/cluster_exit_code)" != "0" ]; then
  exit 1
fi

@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 31, 2019 via email

@Bobgy
Copy link
Contributor Author

Bobgy commented Sep 3, 2019

@Ark-kun Thanks for the advice, but I don't feel I have enough time on this at the moment. After thinking about the refactor for a while. I realized it isn't as trivial as I thought.

  • There are two cloud build jobs + cluster setup that needs to run in parallel, so we have to make two of them async --- just making cluster setup async is not enough.
  • When we try to make cloud build jobs async, we also need to wait for build_image.sh
  • When I was implementing upgrade_tests: [Testing] KFP standalone test infra for upgradability #1971, it has a different workflow that makes it more complex to refactor.
  • Like before, we don't want to print cloud build logs unless it fails.

For all of the above, I think having a current working version is enough. I don't quite want to spend the extra several hours to refactor now.

But I learned bash parallel features this time and I will start with this approach next time when I need to make something new or change something.

Thoughts?

@Ark-kun
Copy link
Contributor

Ark-kun commented Sep 3, 2019

/lgtm

@Bobgy
Copy link
Contributor Author

Bobgy commented Sep 3, 2019

Thanks. Can you grant approval too?
I don't think I have permission.
/approve

@Bobgy
Copy link
Contributor Author

Bobgy commented Sep 3, 2019

@Ark-kun

@Ark-kun
Copy link
Contributor

Ark-kun commented Sep 3, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ark-kun, Bobgy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Bobgy
Copy link
Contributor Author

Bobgy commented Sep 4, 2019

Thanks!

We got a chance to see how logs are printed when builds fail.

ERROR: Analysis of target '//backend/src/apiserver:apiserver' failed; build aborted: no such package '@com_github_spf13_viper//': failed to fetch com_github_spf13_viper: # cd .; git clone https://github.com/spf13/viper /root/.cache/bazel/_bazel_root/ff9dc353908781674f376ac4c88da873/external/com_github_spf13_viper
Cloning into '/root/.cache/bazel/_bazel_root/ff9dc353908781674f376ac4c88da873/external/com_github_spf13_viper'...
error: RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function.
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
2019/09/03 23:37:50 exit status 128
INFO: Elapsed time: 106.560s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (39 packages loaded, 6149 targets configured)
      
FAILED: Build did NOT complete successfully (39 packages loaded, 6149 targets configured)
The command '/bin/sh -c bazel build -c opt --action_env=PATH --define=grpc_no_ares=true backend/src/apiserver:apiserver' returned a non-zero code: 1
ERROR
ERROR: build step 0 "gcr.io/cloud-builders/docker" failed: exit status 1

/retest

@k8s-ci-robot k8s-ci-robot merged commit 3d2b0ae into kubeflow:master Sep 4, 2019
@Bobgy Bobgy deleted the cloud_build_log branch September 4, 2019 01:18
magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this pull request Oct 22, 2023
Signed-off-by: Suresh Nakkeran <suresh.n@ideas2it.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Presubmit test CloudBuild failures are hard to debug without logs
3 participants