Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix intermittent build failure due to race condition in upstream server deletion #118

Merged
merged 3 commits into from
Feb 27, 2024

Conversation

jjo93sa
Copy link
Contributor

@jjo93sa jjo93sa commented Dec 12, 2023

We have experienced intermittent image build failures since the fix in commit b30391d for Cinder volume detachment. In failure cases, it seems that OpenStack manages to delete the server quickly enough that the next "Get" call returns a 404, and the loop thus exits. Sometimes, however, the deletion is slower and the next Get call returns 202, so the loop continues.

Removing lines 141-144 inclusive, i.e. the check of the error code on the Get call, enables correct status checking of the instance being deleted.

Closes #105

co-authored by:
jameso@graphcore.ai
john.garbutt@stackhpc.com

@jjo93sa jjo93sa requested a review from a team as a code owner December 12, 2023 10:17
@hashicorp-cla
Copy link

hashicorp-cla commented Dec 12, 2023

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@JohnGarbutt JohnGarbutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

James, thanks for submitting this and testing it out.

A quick thing on the formatting of co-authored-by:
https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors

@@ -138,11 +138,6 @@ func DeleteServer(state multistep.StateBag, instance string) error {
}

server, err := servers.Get(computeClient, instance).Extract()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 140? That's needed for lines:
stateChange := StateChangeConf{
Pending: []string{"ACTIVE", "BUILD", "REBUILD", "SUSPENDED", "SHUTOFF", "STOPPED"},
Refresh: ServerStateRefreshFunc(computeClient, server),
Target: []string{"DELETED"},
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, of course! I remember now, that bit of code correctly treats 404 as success when checking for a deleted server, which the code you have removed did not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jjo93sa and @JohnGarbutt - @lbajolet-hashicorp and I had a chance to review to better understand what is happening here. With our understanding of the code we suggest you make one of the following two changes to better support this issue.

  1. The preferred change used in three places would be to drop the pointer to server for just the instance ID in ServerStateRefreshFunc.
// ServerStateRefreshFunc returns a StateRefreshFunc that is used to watch
// an openstack server.
func ServerStateRefreshFunc(
	client *gophercloud.ServiceClient, instanceID string) StateRefreshFunc {
	return func() (interface{}, string, int, error) {
		serverNew, err := servers.Get(client, instanceID).Extract()
		if err != nil {
			if _, ok := err.(gophercloud.ErrDefault404); ok {
				log.Printf("[INFO] 404 on ServerStateRefresh, returning DELETED")
				return nil, "DELETED", 0, nil
			}
			log.Printf("[ERROR] Error on ServerStateRefresh: %s", err)
			return nil, "", 0, err
		}

		return serverNew, serverNew.Status, serverNew.Progress, nil
	}
}
  1. Move lines 140 - 144 before the deletion loop this way you have a handle to the valid server before it gets deleted. The results of server, err := servers.Get(computeClient, instance).Extract() will be used by Refresh: ServerStateRefreshFunc(computeClient, server), on line 143. As server can't be nil, if the call to Get does error then it would be valid to error but it would because of something else not a race condition error.
	server, err := servers.Get(computeClient, instance).Extract()
	if err != nil {
		err = fmt.Errorf("Error getting server to terminate: %s", err)
		return err
	}

	ui.Say(fmt.Sprintf("Terminating the source server: %s ...", instance))
	for {
		if config.ForceDelete {
			err = servers.ForceDelete(computeClient, instance).ExtractErr()
		} else {
			err = servers.Delete(computeClient, instance).ExtractErr()
		}

		if err == nil {
			break
		}

		if _, ok := err.(gophercloud.ErrDefault500); !ok {
			err = fmt.Errorf("Error terminating server, may still be around: %s", err)
			return err
		}

		if numErrors < maxNumErrors {
			numErrors++
			log.Printf("Error terminating server on (%d) time(s): %s, retrying ...", numErrors, err)
			time.Sleep(2 * time.Second)
			continue
		}
		err = fmt.Errorf("Error terminating server, maximum number (%d) reached: %s", numErrors, err)
		return err
	}

	stateChange := StateChangeConf{
		Pending: []string{"ACTIVE", "BUILD", "REBUILD", "SUSPENDED", "SHUTOFF", "STOPPED"},
		Refresh: ServerStateRefreshFunc(computeClient, server),
		Target:  []string{"DELETED"},
	}

We have experienced intermittent image build failures since the fix in commit b30391d
for Cinder volume detachment. In failure cases, it seems that OpenStack manages to
delete the server quickly enough that the next "Get" call returns a 404, and the loop
thus exits. Sometimes, however, the deletion is slower and the next Get call returns
202, so the loop continues.

Removing lines 141-144 inclusive, i.e. the check of the error code on the Get call,
enables correct status checking of the instance being deleted.

Co-authored-by: James Osborne <jameso@graphcore.ai>
Co-authored-by: John Garbutt <john.garbutt@stackhpc.com>
@@ -138,11 +138,6 @@ func DeleteServer(state multistep.StateBag, instance string) error {
}

server, err := servers.Get(computeClient, instance).Extract()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jjo93sa and @JohnGarbutt - @lbajolet-hashicorp and I had a chance to review to better understand what is happening here. With our understanding of the code we suggest you make one of the following two changes to better support this issue.

  1. The preferred change used in three places would be to drop the pointer to server for just the instance ID in ServerStateRefreshFunc.
// ServerStateRefreshFunc returns a StateRefreshFunc that is used to watch
// an openstack server.
func ServerStateRefreshFunc(
	client *gophercloud.ServiceClient, instanceID string) StateRefreshFunc {
	return func() (interface{}, string, int, error) {
		serverNew, err := servers.Get(client, instanceID).Extract()
		if err != nil {
			if _, ok := err.(gophercloud.ErrDefault404); ok {
				log.Printf("[INFO] 404 on ServerStateRefresh, returning DELETED")
				return nil, "DELETED", 0, nil
			}
			log.Printf("[ERROR] Error on ServerStateRefresh: %s", err)
			return nil, "", 0, err
		}

		return serverNew, serverNew.Status, serverNew.Progress, nil
	}
}
  1. Move lines 140 - 144 before the deletion loop this way you have a handle to the valid server before it gets deleted. The results of server, err := servers.Get(computeClient, instance).Extract() will be used by Refresh: ServerStateRefreshFunc(computeClient, server), on line 143. As server can't be nil, if the call to Get does error then it would be valid to error but it would because of something else not a race condition error.
	server, err := servers.Get(computeClient, instance).Extract()
	if err != nil {
		err = fmt.Errorf("Error getting server to terminate: %s", err)
		return err
	}

	ui.Say(fmt.Sprintf("Terminating the source server: %s ...", instance))
	for {
		if config.ForceDelete {
			err = servers.ForceDelete(computeClient, instance).ExtractErr()
		} else {
			err = servers.Delete(computeClient, instance).ExtractErr()
		}

		if err == nil {
			break
		}

		if _, ok := err.(gophercloud.ErrDefault500); !ok {
			err = fmt.Errorf("Error terminating server, may still be around: %s", err)
			return err
		}

		if numErrors < maxNumErrors {
			numErrors++
			log.Printf("Error terminating server on (%d) time(s): %s, retrying ...", numErrors, err)
			time.Sleep(2 * time.Second)
			continue
		}
		err = fmt.Errorf("Error terminating server, maximum number (%d) reached: %s", numErrors, err)
		return err
	}

	stateChange := StateChangeConf{
		Pending: []string{"ACTIVE", "BUILD", "REBUILD", "SUSPENDED", "SHUTOFF", "STOPPED"},
		Refresh: ServerStateRefreshFunc(computeClient, server),
		Target:  []string{"DELETED"},
	}

@nywilken
Copy link
Contributor

Hey folks, are you still open to pushing these changes forward?

@jjo93sa
Copy link
Contributor Author

jjo93sa commented Feb 23, 2024 via email

@jjo93sa
Copy link
Contributor Author

jjo93sa commented Feb 27, 2024

@nywilken I've made the preferred change (1) and tested it in our pipeline, it seems to work as expected. I've pushed those changes to my fork, and I think they've been automatically picked up by the PR - commit [ea36d7c]

Copy link
Contributor

@nywilken nywilken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for re-rolling and testing the changes. This looks good to me.

@nywilken nywilken changed the title Error check causes failure to build image Fix intermittent build failure due to race condition with cinder detachment Feb 27, 2024
@nywilken nywilken added the bug label Feb 27, 2024
@nywilken nywilken changed the title Fix intermittent build failure due to race condition with cinder detachment Fix intermittent build failure due to race condition in upstream server deletion Feb 27, 2024
@nywilken nywilken merged commit cdd05c7 into hashicorp:main Feb 27, 2024
12 checks passed
@nywilken
Copy link
Contributor

Thanks again for the fix @jjo93sa your patch has been released in https://github.com/hashicorp/packer-plugin-openstack/releases/tag/v1.1.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build fails with error "Resource not found" after Packer terminates the source server
4 participants