-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix intermittent build failure due to race condition in upstream server deletion #118
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
James, thanks for submitting this and testing it out.
A quick thing on the formatting of co-authored-by:
https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors
builder/openstack/server.go
Outdated
@@ -138,11 +138,6 @@ func DeleteServer(state multistep.StateBag, instance string) error { | |||
} | |||
|
|||
server, err := servers.Get(computeClient, instance).Extract() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 140? That's needed for lines:
stateChange := StateChangeConf{
Pending: []string{"ACTIVE", "BUILD", "REBUILD", "SUSPENDED", "SHUTOFF", "STOPPED"},
Refresh: ServerStateRefreshFunc(computeClient, server),
Target: []string{"DELETED"},
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, of course! I remember now, that bit of code correctly treats 404 as success when checking for a deleted server, which the code you have removed did not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jjo93sa and @JohnGarbutt - @lbajolet-hashicorp and I had a chance to review to better understand what is happening here. With our understanding of the code we suggest you make one of the following two changes to better support this issue.
- The preferred change used in three places would be to drop the pointer to server for just the instance ID in ServerStateRefreshFunc.
// ServerStateRefreshFunc returns a StateRefreshFunc that is used to watch
// an openstack server.
func ServerStateRefreshFunc(
client *gophercloud.ServiceClient, instanceID string) StateRefreshFunc {
return func() (interface{}, string, int, error) {
serverNew, err := servers.Get(client, instanceID).Extract()
if err != nil {
if _, ok := err.(gophercloud.ErrDefault404); ok {
log.Printf("[INFO] 404 on ServerStateRefresh, returning DELETED")
return nil, "DELETED", 0, nil
}
log.Printf("[ERROR] Error on ServerStateRefresh: %s", err)
return nil, "", 0, err
}
return serverNew, serverNew.Status, serverNew.Progress, nil
}
}
- Move lines 140 - 144 before the deletion loop this way you have a handle to the valid server before it gets deleted. The results of
server, err := servers.Get(computeClient, instance).Extract()
will be used byRefresh: ServerStateRefreshFunc(computeClient, server),
on line 143. As server can't be nil, if the call to Get does error then it would be valid to error but it would because of something else not a race condition error.
server, err := servers.Get(computeClient, instance).Extract()
if err != nil {
err = fmt.Errorf("Error getting server to terminate: %s", err)
return err
}
ui.Say(fmt.Sprintf("Terminating the source server: %s ...", instance))
for {
if config.ForceDelete {
err = servers.ForceDelete(computeClient, instance).ExtractErr()
} else {
err = servers.Delete(computeClient, instance).ExtractErr()
}
if err == nil {
break
}
if _, ok := err.(gophercloud.ErrDefault500); !ok {
err = fmt.Errorf("Error terminating server, may still be around: %s", err)
return err
}
if numErrors < maxNumErrors {
numErrors++
log.Printf("Error terminating server on (%d) time(s): %s, retrying ...", numErrors, err)
time.Sleep(2 * time.Second)
continue
}
err = fmt.Errorf("Error terminating server, maximum number (%d) reached: %s", numErrors, err)
return err
}
stateChange := StateChangeConf{
Pending: []string{"ACTIVE", "BUILD", "REBUILD", "SUSPENDED", "SHUTOFF", "STOPPED"},
Refresh: ServerStateRefreshFunc(computeClient, server),
Target: []string{"DELETED"},
}
We have experienced intermittent image build failures since the fix in commit b30391d for Cinder volume detachment. In failure cases, it seems that OpenStack manages to delete the server quickly enough that the next "Get" call returns a 404, and the loop thus exits. Sometimes, however, the deletion is slower and the next Get call returns 202, so the loop continues. Removing lines 141-144 inclusive, i.e. the check of the error code on the Get call, enables correct status checking of the instance being deleted. Co-authored-by: James Osborne <jameso@graphcore.ai> Co-authored-by: John Garbutt <john.garbutt@stackhpc.com>
b497efc
to
deb8dcf
Compare
builder/openstack/server.go
Outdated
@@ -138,11 +138,6 @@ func DeleteServer(state multistep.StateBag, instance string) error { | |||
} | |||
|
|||
server, err := servers.Get(computeClient, instance).Extract() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jjo93sa and @JohnGarbutt - @lbajolet-hashicorp and I had a chance to review to better understand what is happening here. With our understanding of the code we suggest you make one of the following two changes to better support this issue.
- The preferred change used in three places would be to drop the pointer to server for just the instance ID in ServerStateRefreshFunc.
// ServerStateRefreshFunc returns a StateRefreshFunc that is used to watch
// an openstack server.
func ServerStateRefreshFunc(
client *gophercloud.ServiceClient, instanceID string) StateRefreshFunc {
return func() (interface{}, string, int, error) {
serverNew, err := servers.Get(client, instanceID).Extract()
if err != nil {
if _, ok := err.(gophercloud.ErrDefault404); ok {
log.Printf("[INFO] 404 on ServerStateRefresh, returning DELETED")
return nil, "DELETED", 0, nil
}
log.Printf("[ERROR] Error on ServerStateRefresh: %s", err)
return nil, "", 0, err
}
return serverNew, serverNew.Status, serverNew.Progress, nil
}
}
- Move lines 140 - 144 before the deletion loop this way you have a handle to the valid server before it gets deleted. The results of
server, err := servers.Get(computeClient, instance).Extract()
will be used byRefresh: ServerStateRefreshFunc(computeClient, server),
on line 143. As server can't be nil, if the call to Get does error then it would be valid to error but it would because of something else not a race condition error.
server, err := servers.Get(computeClient, instance).Extract()
if err != nil {
err = fmt.Errorf("Error getting server to terminate: %s", err)
return err
}
ui.Say(fmt.Sprintf("Terminating the source server: %s ...", instance))
for {
if config.ForceDelete {
err = servers.ForceDelete(computeClient, instance).ExtractErr()
} else {
err = servers.Delete(computeClient, instance).ExtractErr()
}
if err == nil {
break
}
if _, ok := err.(gophercloud.ErrDefault500); !ok {
err = fmt.Errorf("Error terminating server, may still be around: %s", err)
return err
}
if numErrors < maxNumErrors {
numErrors++
log.Printf("Error terminating server on (%d) time(s): %s, retrying ...", numErrors, err)
time.Sleep(2 * time.Second)
continue
}
err = fmt.Errorf("Error terminating server, maximum number (%d) reached: %s", numErrors, err)
return err
}
stateChange := StateChangeConf{
Pending: []string{"ACTIVE", "BUILD", "REBUILD", "SUSPENDED", "SHUTOFF", "STOPPED"},
Refresh: ServerStateRefreshFunc(computeClient, server),
Target: []string{"DELETED"},
}
Hey folks, are you still open to pushing these changes forward? |
I am indeed, just been super busy. It is on my backlog now though, so
hopefully I'll get around to it this coming week or next
…On Wed, 21 Feb 2024 at 20:15, Wilken Rivera ***@***.***> wrote:
Hey folks, are you still open to pushing these changes forward?
—
Reply to this email directly, view it on GitHub
<#118 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZUZU4RNM5M2I7ER6OVWFLYUZIVXAVCNFSM6AAAAABARGQDMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJXHAZDSNZYHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@nywilken I've made the preferred change (1) and tested it in our pipeline, it seems to work as expected. I've pushed those changes to my fork, and I think they've been automatically picked up by the PR - commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for re-rolling and testing the changes. This looks good to me.
Thanks again for the fix @jjo93sa your patch has been released in https://github.com/hashicorp/packer-plugin-openstack/releases/tag/v1.1.2 |
We have experienced intermittent image build failures since the fix in commit
b30391d
for Cinder volume detachment. In failure cases, it seems that OpenStack manages to delete the server quickly enough that the next "Get" call returns a 404, and the loop thus exits. Sometimes, however, the deletion is slower and the next Get call returns 202, so the loop continues.Removing lines 141-144 inclusive, i.e. the check of the error code on the Get call, enables correct status checking of the instance being deleted.
Closes #105
co-authored by:
jameso@graphcore.ai
john.garbutt@stackhpc.com