Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

various run build -F/--follow fixes around timeout, lack of pods, and data races #56

Merged

Conversation

gabemontero
Copy link
Member

@gabemontero gabemontero commented Oct 8, 2021

Changes

Fixes #53

  • adds/tweaks integration from k8s --request-timeout and --context options that shp inherits from k8s to apply request or context deadline timeout / aborts on ship build run --follow
  • add callbacks for lack of pod events, including cross reference with build run when those occur, just as we do when pod events do occur (still no explicit buildrun watch to go with pod watch, as multiple watches proves untenable with unit testing, plus over-indulgent, as we only need buildrun for terminal states), to exit --follow
  • slight refactor of SubCommand interface to allow for more initialization in Complete() vs. Run() for --follow
  • which allowed vast simplification of data race prevention on RunCommand
  • and removal of what proved to be kludgy thread coordination in --follow unit tests

Submitter Checklist

  • [ /] Includes tests if functionality changed/was added
  • [n/a ] Includes docs if changes are user-facing
  • [ /] Set a kind label on this PR
  • [/ ] Release notes block has been filled in, or marked NONE

See the contributor guide
for details on coding conventions, github and prow interactions, and the code review process.

/kind bug

Release Notes

fully integrate options '--context' and '--request-timeout' into 'shp build run --follow' 
allow for exit if the underlying Pod for a BuildRun is never created
concurrency fixes

@adambkaplan
@coreydaley
@SaschaSchwarze0
@otaviof

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. release-note labels Oct 8, 2021
@gabemontero gabemontero force-pushed the fix-follow-no-pod branch 2 times, most recently from b368c5a to 044ab73 Compare October 9, 2021 00:04
@gabemontero gabemontero changed the title WIP: utilize --request-timeout in pod watcher; add new pod watcher timeout callback; print message on context/request timeout various fixes around timeout, lack of pods, and data races Oct 9, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 9, 2021
@gabemontero
Copy link
Member Author

@adambkaplan
@coreydaley
@SaschaSchwarze0
@otaviof

this is at a point where I think this fix for @SaschaSchwarze0 is ready for review

it is late'ish Friday .... I'm off Monday, Oct 11. I won't get to any review comments that come Monday Oct 11 until Tuesday

fwiw if any review comments come in over the weekend I'll have some cycles to spend some time on them

thanks

@gabemontero
Copy link
Member Author

manual e2e test ... I'll see about adding an e2e test using our new bats based framework

gmontero ~/go/src/github.com/shipwright-io/cli-1  (fix-follow-no-pod)$ shp build run buildpack-nodejs-build -F
Pod 'buildpack-nodejs-build-l5hc6-9vllp-pod-thsfd' is in state "Pending"...
Pod 'buildpack-nodejs-build-l5hc6-9vllp-pod-thsfd' is in state "Pending"...
Pod 'buildpack-nodejs-build-l5hc6-9vllp-pod-thsfd' is in state "Pending"...
[place-tools] 2021/10/09 13:25:36 Copied /ko-app/entrypoint to /tekton/tools/entrypoint
[build-and-push] ===> DETECTING
[build-and-push] 5 of 9 buildpacks participating
[build-and-push] paketo-buildpacks/ca-certificates 2.4.1
[build-and-push] paketo-buildpacks/node-engine     0.8.0
[build-and-push] paketo-buildpacks/npm-install     0.4.0
[build-and-push] paketo-buildpacks/node-module-bom 0.1.2
[build-and-push] paketo-buildpacks/npm-start       0.4.0
[build-and-push] ===> ANALYZING
[source-default] 2021/10/09 13:25:38 Info: ssh (/usr/bin/ssh): OpenSSH_8.0p1, OpenSSL 1.1.1g FIPS  21 Apr 2020
[source-default] 2021/10/09 13:25:38 Info: git (/usr/bin/git): git version 2.27.0
[source-default] 2021/10/09 13:25:38 Info: git-lfs (/usr/bin/git-lfs): git-lfs/2.11.0 (GitHub; linux amd64; go 1.14.4)
[source-default] 2021/10/09 13:25:38 /usr/bin/git clone -h
[source-default] 2021/10/09 13:25:38 /usr/bin/git submodule -h
[source-default] 2021/10/09 13:25:38 /usr/bin/git clone --quiet --no-tags --single-branch --depth 1 -- https://github.com/shipwright-io/sample-nodejs /workspace/source
[source-default] 2021/10/09 13:25:38 /usr/bin/git -C /workspace/source submodule update --init --recursive --depth 1
[source-default] 2021/10/09 13:25:38 /usr/bin/git -C /workspace/source rev-parse --abbrev-ref HEAD
[source-default] 2021/10/09 13:25:38 Successfully loaded https://github.com/shipwright-io/sample-nodejs (main) into /workspace/source
[source-default] 2021/10/09 13:25:38 /usr/bin/git -C /workspace/source rev-parse --verify HEAD
[source-default] 2021/10/09 13:25:38 /usr/bin/git -C /workspace/source log -1 --pretty=format:%an
[build-and-push] Previous image with name "" not found
[build-and-push] Restoring metadata for "paketo-buildpacks/ca-certificates:helper" from app image
[build-and-push] ===> RESTORING
[build-and-push] ===> BUILDING
[build-and-push] 
[build-and-push] Paketo CA Certificates Buildpack 2.4.1
[build-and-push]   https://github.com/paketo-buildpacks/ca-certificates
[build-and-push]   Launch Helper: Reusing cached layer
[build-and-push] Paketo Node Engine Buildpack 0.8.0
[build-and-push]   Resolving Node Engine version
[build-and-push]     Candidate version sources (in priority order):
[build-and-push]       package.json -> ">= 0.10.12"
[build-and-push]       <unknown>    -> ""
[build-and-push] 
[build-and-push]     Selected Node Engine version (using package.json): 16.9.1
[build-and-push] 
[build-and-push]   Executing build process
[build-and-push]     Installing Node Engine 16.9.1
[build-and-push]       Completed in 2.796s
[build-and-push] 
[build-and-push]   Configuring build environment
[build-and-push]     NODE_ENV     -> "production"
[build-and-push]     NODE_HOME    -> "/layers/paketo-buildpacks_node-engine/node"
[build-and-push]     NODE_VERBOSE -> "false"
[build-and-push] 
[build-and-push]   Configuring launch environment
[build-and-push]     NODE_ENV     -> "production"
[build-and-push]     NODE_HOME    -> "/layers/paketo-buildpacks_node-engine/node"
[build-and-push]     NODE_VERBOSE -> "false"
[build-and-push] 
[build-and-push]     Writing profile.d/0_memory_available.sh
[build-and-push]       Calculates available memory based on container limits at launch time.
[build-and-push]       Made available in the MEMORY_AVAILABLE environment variable.
[build-and-push] 
[build-and-push] Paketo NPM Install Buildpack 0.4.0
[build-and-push]   Resolving installation process
[build-and-push]     Process inputs:
[build-and-push]       node_modules      -> "Not found"
[build-and-push]       npm-cache         -> "Not found"
[build-and-push]       package-lock.json -> "Not found"
[build-and-push] 
[build-and-push]     Selected NPM build process: 'npm install'
[build-and-push] 
[build-and-push]   Executing build process
[build-and-push]     Running 'npm install --unsafe-perm --cache /layers/paketo-buildpacks_npm-install/npm-cache'
[build-and-push]       Completed in 1.013s
[build-and-push] 
[build-and-push]   Configuring launch environment
[build-and-push]     NPM_CONFIG_LOGLEVEL -> "error"
[build-and-push] 
[build-and-push]   Configuring environment shared by build and launch
[build-and-push]     PATH -> "$PATH:/layers/paketo-buildpacks_npm-install/modules/node_modules/.bin"
[build-and-push] 
[build-and-push] 
[build-and-push] Paketo Node Module Bill of Materials Generator Buildpack 0.1.2
[build-and-push]   Resolving CycloneDX Node.js Module version
[build-and-push]     Selected CycloneDX Node.js Module version: 3.0.7
[build-and-push] 
[build-and-push]   Executing build process
[build-and-push]     Installing CycloneDX Node.js Module 3.0.7
[build-and-push]       Completed in 242ms
[build-and-push] 
[build-and-push]   Configuring environment
[build-and-push]     Appending CycloneDX Node.js Module onto PATH
[build-and-push] 
[build-and-push]   Running CycloneDX Node.js Module
[build-and-push]     Running 'cyclonedx-bom -o bom.json'
[build-and-push]       Completed in 236ms
[build-and-push] 
[build-and-push] Paketo NPM Start Buildpack 0.4.0
[build-and-push]   Assigning launch processes
[build-and-push]     web: node server.js
[build-and-push] 
[build-and-push] ===> EXPORTING
[build-and-push] Reusing layers from image 'index.docker.io/gmontero/sample-nodejs@sha256:1cabccf7ba39c8041e2563f43a1aa878ab6c98f085fcff3c5dfc9c436783564c'
[build-and-push] Reusing layer 'paketo-buildpacks/ca-certificates:helper'
[build-and-push] Reusing layer 'paketo-buildpacks/node-engine:node'
[build-and-push] Reusing layer 'paketo-buildpacks/npm-install:modules'
[build-and-push] Reusing 1/1 app layer(s)
[build-and-push] Reusing layer 'launcher'
[build-and-push] Reusing layer 'config'
[build-and-push] Adding label 'io.buildpacks.lifecycle.metadata'
[build-and-push] Adding label 'io.buildpacks.build.metadata'
[build-and-push] Adding label 'io.buildpacks.project.metadata'
[build-and-push] Saving docker.io/gmontero/sample-nodejs:latest...
[build-and-push] *** Images (sha256:9da0768da7a03aeef6e06d42f833e9772e5d07cae9d8ebccb79e75204584922e):
[build-and-push]       docker.io/gmontero/sample-nodejs:latest
[build-and-push] Adding cache layer 'paketo-buildpacks/node-engine:node'
[build-and-push] Adding cache layer 'paketo-buildpacks/npm-install:modules'
[build-and-push] Adding cache layer 'paketo-buildpacks/node-module-bom:cyclonedx-node-module'
Pod 'buildpack-nodejs-build-l5hc6-9vllp-pod-thsfd' has succeeded!
gmontero ~/go/src/github.com/shipwright-io/cli-1  (fix-follow-no-pod)$ 

add new pod watcher timeout callback
print message on context/request timeout
simplify --follow init
remove follow watch lock
eliminate races
@gabemontero gabemontero changed the title various fixes around timeout, lack of pods, and data races various run build -F/--follow fixes around timeout, lack of pods, and data races Oct 9, 2021
@gabemontero
Copy link
Member Author

meets min e2e added

note the TODO / item for follow up, unless a reviewer has a quick/easy solution that I feel is suitable for this PR

Comment on lines +249 to +257
func (r *RunCommand) Log(msg string) {
// concurrent fmt.Fprintf(r.ioStream.Out...) calls need locking to avoid data races, as we 'write' to the stream
r.logLock.Lock()
defer r.logLock.Unlock()
fmt.Fprintf(r.ioStreams.Out, msg)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using a channel for this? the Log function could write to the channel and another go function reads from it and writes the messages to standard out. I don't "think" that the sync.Mutex here will cause any blocking issues, but the channel might be even less likely to. Just a thought I had while looking through this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm positive mutex here is not a problem with the runtime code cause this method is the only one that uses it.

So we prevent the concurrent writes that race detection flagged.

It was only concurent writes in runtime code that we need to be wary of here then.

Also, I say runtime code, cause I do leverage this in the unit tests in one and only one spot to verify contents of the out buffer.

But still, given all these qualifiers, especially as there is not read / write interaction we are coordinating in the non test code, I think this simpler solution is better.

But good consideration to sort out - thanks.

Copy link
Member

@SaschaSchwarze0 SaschaSchwarze0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try it out later with my scenario.

}
if giveUp {
r.Log(msg)
r.Log(fmt.Sprintf("exting 'ship build run --follow' for BuildRun %q", br.Name))
Copy link
Member

@SaschaSchwarze0 SaschaSchwarze0 Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r.Log(fmt.Sprintf("exting 'ship build run --follow' for BuildRun %q", br.Name))
r.Log(fmt.Sprintf("exiting 'shp build run --follow' for BuildRun %q", br.Name))

Comment on lines 126 to 129
if err != nil {
r.Log(fmt.Sprintf("error accessing BuildRun %q: %s", r.buildRunName, err.Error()))
} else {
c := br.Status.GetCondition(buildv1alpha1.Succeeded)
giveUp := false
msg := ""
switch {
case c != nil && c.Status == corev1.ConditionTrue:
giveUp = true
msg = fmt.Sprintf("BuildRun '%s' has been marked as successful.\n", br.Name)
case c != nil && c.Status == corev1.ConditionFalse:
giveUp = true
msg = fmt.Sprintf("BuildRun '%s' has been marked as failed.\n", br.Name)
case br.IsCanceled():
giveUp = true
msg = fmt.Sprintf("BuildRun '%s' has been canceled.\n", br.Name)
case br.DeletionTimestamp != nil:
giveUp = true
msg = fmt.Sprintf("BuildRun '%s' has been deleted.\n", br.Name)
case !br.HasStarted():
r.Log(fmt.Sprintf("BuildRun '%s' has been marked as failed.\n", br.Name))
}
if giveUp {
r.Log(msg)
r.Log(fmt.Sprintf("exting 'ship build run --follow' for BuildRun %q", br.Name))
r.stop()
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err != nil {
r.Log(fmt.Sprintf("error accessing BuildRun %q: %s", r.buildRunName, err.Error()))
} else {
c := br.Status.GetCondition(buildv1alpha1.Succeeded)
giveUp := false
msg := ""
switch {
case c != nil && c.Status == corev1.ConditionTrue:
giveUp = true
msg = fmt.Sprintf("BuildRun '%s' has been marked as successful.\n", br.Name)
case c != nil && c.Status == corev1.ConditionFalse:
giveUp = true
msg = fmt.Sprintf("BuildRun '%s' has been marked as failed.\n", br.Name)
case br.IsCanceled():
giveUp = true
msg = fmt.Sprintf("BuildRun '%s' has been canceled.\n", br.Name)
case br.DeletionTimestamp != nil:
giveUp = true
msg = fmt.Sprintf("BuildRun '%s' has been deleted.\n", br.Name)
case !br.HasStarted():
r.Log(fmt.Sprintf("BuildRun '%s' has been marked as failed.\n", br.Name))
}
if giveUp {
r.Log(msg)
r.Log(fmt.Sprintf("exting 'ship build run --follow' for BuildRun %q", br.Name))
r.stop()
}
}
if err != nil {
r.Log(fmt.Sprintf("error accessing BuildRun %q: %s", r.buildRunName, err.Error()))
return
}
c := br.Status.GetCondition(buildv1alpha1.Succeeded)
switch {
case c != nil && c.Status == corev1.ConditionTrue:
r.Log(fmt.Sprintf("BuildRun '%s' has been marked as successful.\n", br.Name))
r.stop()
case c != nil && c.Status == corev1.ConditionFalse:
r.Log(fmt.Sprintf("BuildRun '%s' has been marked as failed.\n", br.Name))
r.stop()
case br.IsCanceled():
r.Log(fmt.Sprintf("BuildRun '%s' has been canceled.\n", br.Name))
r.stop()
case br.DeletionTimestamp != nil:
r.Log(fmt.Sprintf("BuildRun '%s' has been deleted.\n", br.Name))
r.stop()
case !br.HasStarted():
r.Log(fmt.Sprintf("BuildRun '%s' has been marked as failed.\n", br.Name))
}

I think that this is cleaner and easier to tell exactly what is going on instead of having to track to the end of the switch statement to see what happens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

maybe in one of my iterations prior to push I had something after the else block, but there's no telling :-)

@@ -1,21 +1,21 @@
package build

import (
"runtime"
"bytes"
"github.com/shipwright-io/cli/pkg/shp/reactor"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is in the wrong group

@gabemontero
Copy link
Member Author

/approve

updates based on comments @SaschaSchwarze0 @coreydaley pushed, included in existing commits

ideally, we minimally wait for feedback from @SaschaSchwarze0 per #56 (review) that it addresses his scenario before potentially tagging for merge

so I'll put a

/hold

on it now, in case someone wants to lgtm the code changes, but we don't release the hold until @SaschaSchwarze0 says it worked for him, or that he won't be able to get to it

To be honest, I don't have a ready made, "build is valid, but the build run create will result in a Pod never getting created" scenario, but if I get out from under today, I'll spend some time trying to devise one.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 12, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gabemontero

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 12, 2021
@coreydaley
Copy link
Contributor

@gabemontero were you able to resolve the issues with the builds not succeeding?

@gabemontero
Copy link
Member Author

@gabemontero were you able to resolve the issues with the builds not succeeding?

yeah mean the new e2e's? No. And depending on the timing of the remaining items, I'm inclined to leave the e2e as is (since it can still verify the log is being followed, which is the key), and opening an issue to clean that up when someone has time to look into it.

@coreydaley
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 12, 2021
@gabemontero
Copy link
Member Author

@gabemontero were you able to resolve the issues with the builds not succeeding?

yeah mean the new e2e's? No. And depending on the timing of the remaining items, I'm inclined to leave the e2e as is (since it can still verify the log is being followed, which is the key), and opening an issue to clean that up when someone has time to look into it.

opened #57 for ^^

@gabemontero
Copy link
Member Author

bump @SaschaSchwarze0

do you think you'll be able to try this PR in your env in the next few days?
or should we merge, let you try whenever you get the chance, and open a new issue / reopen etc. if something is still amiss?

thanks

Copy link
Member

@SaschaSchwarze0 SaschaSchwarze0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 17, 2021
@openshift-merge-robot openshift-merge-robot merged commit c91c9b7 into shipwright-io:main Oct 17, 2021
@gabemontero gabemontero deleted the fix-follow-no-pod branch October 18, 2021 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

shp build run --follow hangs if BuildRun fails to start
4 participants