Fix intermittent build failure due to race condition in upstream server deletion #118

jjo93sa · 2023-12-12T10:17:10Z

We have experienced intermittent image build failures since the fix in commit b30391d for Cinder volume detachment. In failure cases, it seems that OpenStack manages to delete the server quickly enough that the next "Get" call returns a 404, and the loop thus exits. Sometimes, however, the deletion is slower and the next Get call returns 202, so the loop continues.

Removing lines 141-144 inclusive, i.e. the check of the error code on the Get call, enables correct status checking of the instance being deleted.

Closes #105

co-authored by:
jameso@graphcore.ai
john.garbutt@stackhpc.com

hashicorp-cla · 2023-12-12T10:17:14Z

All committers have signed the CLA.

JohnGarbutt

James, thanks for submitting this and testing it out.

A quick thing on the formatting of co-authored-by:
https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors

JohnGarbutt · 2023-12-12T10:44:15Z

builder/openstack/server.go

@@ -138,11 +138,6 @@ func DeleteServer(state multistep.StateBag, instance string) error {
 	}

 	server, err := servers.Get(computeClient, instance).Extract()


I think we should also remove this.

Line 140? That's needed for lines:
stateChange := StateChangeConf{
Pending: []string{"ACTIVE", "BUILD", "REBUILD", "SUSPENDED", "SHUTOFF", "STOPPED"},
Refresh: ServerStateRefreshFunc(computeClient, server),
Target: []string{"DELETED"},
}

Ah sorry, of course! I remember now, that bit of code correctly treats 404 as success when checking for a deleted server, which the code you have removed did not.

Hi @jjo93sa and @JohnGarbutt - @lbajolet-hashicorp and I had a chance to review to better understand what is happening here. With our understanding of the code we suggest you make one of the following two changes to better support this issue.

The preferred change used in three places would be to drop the pointer to server for just the instance ID in ServerStateRefreshFunc.

// ServerStateRefreshFunc returns a StateRefreshFunc that is used to watch // an openstack server. func ServerStateRefreshFunc( client *gophercloud.ServiceClient, instanceID string) StateRefreshFunc { return func() (interface{}, string, int, error) { serverNew, err := servers.Get(client, instanceID).Extract() if err != nil { if _, ok := err.(gophercloud.ErrDefault404); ok { log.Printf("[INFO] 404 on ServerStateRefresh, returning DELETED") return nil, "DELETED", 0, nil } log.Printf("[ERROR] Error on ServerStateRefresh: %s", err) return nil, "", 0, err } return serverNew, serverNew.Status, serverNew.Progress, nil } }

Move lines 140 - 144 before the deletion loop this way you have a handle to the valid server before it gets deleted. The results of server, err := servers.Get(computeClient, instance).Extract() will be used by Refresh: ServerStateRefreshFunc(computeClient, server), on line 143. As server can't be nil, if the call to Get does error then it would be valid to error but it would because of something else not a race condition error.

server, err := servers.Get(computeClient, instance).Extract() if err != nil { err = fmt.Errorf("Error getting server to terminate: %s", err) return err } ui.Say(fmt.Sprintf("Terminating the source server: %s ...", instance)) for { if config.ForceDelete { err = servers.ForceDelete(computeClient, instance).ExtractErr() } else { err = servers.Delete(computeClient, instance).ExtractErr() } if err == nil { break } if _, ok := err.(gophercloud.ErrDefault500); !ok { err = fmt.Errorf("Error terminating server, may still be around: %s", err) return err } if numErrors < maxNumErrors { numErrors++ log.Printf("Error terminating server on (%d) time(s): %s, retrying ...", numErrors, err) time.Sleep(2 * time.Second) continue } err = fmt.Errorf("Error terminating server, maximum number (%d) reached: %s", numErrors, err) return err } stateChange := StateChangeConf{ Pending: []string{"ACTIVE", "BUILD", "REBUILD", "SUSPENDED", "SHUTOFF", "STOPPED"}, Refresh: ServerStateRefreshFunc(computeClient, server), Target: []string{"DELETED"}, }

We have experienced intermittent image build failures since the fix in commit b30391d for Cinder volume detachment. In failure cases, it seems that OpenStack manages to delete the server quickly enough that the next "Get" call returns a 404, and the loop thus exits. Sometimes, however, the deletion is slower and the next Get call returns 202, so the loop continues. Removing lines 141-144 inclusive, i.e. the check of the error code on the Get call, enables correct status checking of the instance being deleted. Co-authored-by: James Osborne <jameso@graphcore.ai> Co-authored-by: John Garbutt <john.garbutt@stackhpc.com>

nywilken · 2024-01-16T16:08:09Z

builder/openstack/server.go

@@ -138,11 +138,6 @@ func DeleteServer(state multistep.StateBag, instance string) error {
 	}

 	server, err := servers.Get(computeClient, instance).Extract()


Hi @jjo93sa and @JohnGarbutt - @lbajolet-hashicorp and I had a chance to review to better understand what is happening here. With our understanding of the code we suggest you make one of the following two changes to better support this issue.

The preferred change used in three places would be to drop the pointer to server for just the instance ID in ServerStateRefreshFunc.

// ServerStateRefreshFunc returns a StateRefreshFunc that is used to watch // an openstack server. func ServerStateRefreshFunc( client *gophercloud.ServiceClient, instanceID string) StateRefreshFunc { return func() (interface{}, string, int, error) { serverNew, err := servers.Get(client, instanceID).Extract() if err != nil { if _, ok := err.(gophercloud.ErrDefault404); ok { log.Printf("[INFO] 404 on ServerStateRefresh, returning DELETED") return nil, "DELETED", 0, nil } log.Printf("[ERROR] Error on ServerStateRefresh: %s", err) return nil, "", 0, err } return serverNew, serverNew.Status, serverNew.Progress, nil } }

Move lines 140 - 144 before the deletion loop this way you have a handle to the valid server before it gets deleted. The results of server, err := servers.Get(computeClient, instance).Extract() will be used by Refresh: ServerStateRefreshFunc(computeClient, server), on line 143. As server can't be nil, if the call to Get does error then it would be valid to error but it would because of something else not a race condition error.

server, err := servers.Get(computeClient, instance).Extract() if err != nil { err = fmt.Errorf("Error getting server to terminate: %s", err) return err } ui.Say(fmt.Sprintf("Terminating the source server: %s ...", instance)) for { if config.ForceDelete { err = servers.ForceDelete(computeClient, instance).ExtractErr() } else { err = servers.Delete(computeClient, instance).ExtractErr() } if err == nil { break } if _, ok := err.(gophercloud.ErrDefault500); !ok { err = fmt.Errorf("Error terminating server, may still be around: %s", err) return err } if numErrors < maxNumErrors { numErrors++ log.Printf("Error terminating server on (%d) time(s): %s, retrying ...", numErrors, err) time.Sleep(2 * time.Second) continue } err = fmt.Errorf("Error terminating server, maximum number (%d) reached: %s", numErrors, err) return err } stateChange := StateChangeConf{ Pending: []string{"ACTIVE", "BUILD", "REBUILD", "SUSPENDED", "SHUTOFF", "STOPPED"}, Refresh: ServerStateRefreshFunc(computeClient, server), Target: []string{"DELETED"}, }

nywilken · 2024-02-21T20:15:11Z

Hey folks, are you still open to pushing these changes forward?

jjo93sa · 2024-02-23T15:59:46Z

I am indeed, just been super busy. It is on my backlog now though, so hopefully I'll get around to it this coming week or next

…

On Wed, 21 Feb 2024 at 20:15, Wilken Rivera ***@***.***> wrote: Hey folks, are you still open to pushing these changes forward? — Reply to this email directly, view it on GitHub <#118 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZUZU4RNM5M2I7ER6OVWFLYUZIVXAVCNFSM6AAAAABARGQDMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJXHAZDSNZYHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jjo93sa · 2024-02-27T09:01:49Z

@nywilken I've made the preferred change (1) and tested it in our pipeline, it seems to work as expected. I've pushed those changes to my fork, and I think they've been automatically picked up by the PR - commit [ea36d7c]

nywilken

Thanks for re-rolling and testing the changes. This looks good to me.

nywilken · 2024-02-29T17:19:27Z

Thanks again for the fix @jjo93sa your patch has been released in https://github.com/hashicorp/packer-plugin-openstack/releases/tag/v1.1.2

jjo93sa requested a review from a team as a code owner December 12, 2023 10:17

JohnGarbutt suggested changes Dec 12, 2023

View reviewed changes

jjo93sa force-pushed the fix/prevent-race branch from b497efc to deb8dcf Compare December 12, 2023 10:58

nywilken suggested changes Jan 16, 2024

View reviewed changes

jjo93sa and others added 2 commits February 26, 2024 15:58

Address comments

ea36d7c

Merge branch 'hashicorp:main' into fix/prevent-race

fb2a591

nywilken approved these changes Feb 27, 2024

View reviewed changes

nywilken changed the title ~~Error check causes failure to build image~~ Fix intermittent build failure due to race condition with cinder detachment Feb 27, 2024

nywilken added the bug label Feb 27, 2024

nywilken changed the title ~~Fix intermittent build failure due to race condition with cinder detachment~~ Fix intermittent build failure due to race condition in upstream server deletion Feb 27, 2024

nywilken merged commit cdd05c7 into hashicorp:main Feb 27, 2024
12 checks passed

nywilken mentioned this pull request Feb 27, 2024

Build fails with error "Resource not found" after Packer terminates the source server #105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix intermittent build failure due to race condition in upstream server deletion #118

Fix intermittent build failure due to race condition in upstream server deletion #118

jjo93sa commented Dec 12, 2023 •

edited by nywilken

Loading

hashicorp-cla commented Dec 12, 2023 •

edited

Loading

JohnGarbutt left a comment

JohnGarbutt Dec 12, 2023

jjo93sa Dec 12, 2023

JohnGarbutt Dec 12, 2023

nywilken Jan 16, 2024

nywilken Jan 16, 2024

nywilken commented Feb 21, 2024

jjo93sa commented Feb 23, 2024 via email

jjo93sa commented Feb 27, 2024

nywilken left a comment

nywilken commented Feb 29, 2024

		@@ -138,11 +138,6 @@ func DeleteServer(state multistep.StateBag, instance string) error {
		}

		server, err := servers.Get(computeClient, instance).Extract()

Fix intermittent build failure due to race condition in upstream server deletion #118

Fix intermittent build failure due to race condition in upstream server deletion #118

Conversation

jjo93sa commented Dec 12, 2023 • edited by nywilken Loading

hashicorp-cla commented Dec 12, 2023 • edited Loading

JohnGarbutt left a comment

Choose a reason for hiding this comment

JohnGarbutt Dec 12, 2023

Choose a reason for hiding this comment

jjo93sa Dec 12, 2023

Choose a reason for hiding this comment

JohnGarbutt Dec 12, 2023

Choose a reason for hiding this comment

nywilken Jan 16, 2024

Choose a reason for hiding this comment

nywilken Jan 16, 2024

Choose a reason for hiding this comment

nywilken commented Feb 21, 2024

jjo93sa commented Feb 23, 2024 via email

jjo93sa commented Feb 27, 2024

nywilken left a comment

Choose a reason for hiding this comment

nywilken commented Feb 29, 2024

jjo93sa commented Dec 12, 2023 •

edited by nywilken

Loading

hashicorp-cla commented Dec 12, 2023 •

edited

Loading