-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for SSA Snapshot Success #125
Conversation
We need to wait for success status from the SSA Snapshot operations for both Managed and non-managed (blob) disks. The Managed snapshot will use the Snapshot Service to wait while the blob snapshot will use the Storage Account Service to wait.
@@ -142,7 +143,8 @@ def vm_create_evm_managed_snapshot(vm) | |||
snap_svc.get(ssa_snap_name, resource_group) | |||
rescue ::Azure::Armrest::NotFoundException, ::Azure::Armrest::ResourceNotFoundException => err | |||
begin | |||
snap_svc.create(ssa_snap_name, resource_group, snap_options) | |||
response = snap_svc.create(ssa_snap_name, resource_group, snap_options) | |||
raise "Maximum snapshot wait time exceeded" unless snap_svc.wait(response.response_headers, SSA_SNAPSHOT_WAIT_TIME) == "Succeeded" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd change this to =~ /^succe/i
insted of == "Succeeded"
because, depending on the operation, azure returns "Success" or "Succeeded", and I don't remember where and/or why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djberg96 cool - thanks for the heads up.
@@ -158,6 +160,7 @@ def vm_create_evm_blob_snapshot(vm) | |||
_log.debug("vm=[#{vm.name}] creating SSA snapshot for #{vm.blob_uri}") | |||
begin | |||
snapshot_info = vm.storage_acct.create_blob_snapshot(vm.container, vm.blob, vm.key) | |||
raise "Maximum snapshot wait time exceeded" unless vm.storage_acct_service.wait(snapshot_info, SSA_SNAPSHOT_WAIT_TIME) == "Succeeded" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
Changing the response checking string parsing per @djberg96 comments.
👍 |
@bronaghs Jerry is going to change the timeout handling. |
Rather than timeout after 30 minutes as we did previously, have the wait timeout after one minute, but loop indefinitely until it succeeds or aborts. Instead the timeout will occur at the job level.
After offline discussion with @roliveri we changed the wait processing to allow the Job to control timeout rather than the wait. |
Checked commits jerryk55/manageiq-providers-azure@5c7a6bf~...b4d2c73 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0 |
@miq-bot add_label fine/yes |
@roliveri - can you approve? |
snap_svc.create(ssa_snap_name, resource_group, snap_options) | ||
response = snap_svc.create(ssa_snap_name, resource_group, snap_options) | ||
# wait a minute at a time, allowing the Job Timeout to handle long-running snapshots here | ||
next until snap_svc.wait(response.response_headers) =~ /^succe/i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably can be merged for now, so we can get the fix out. But this condition is too specific. If it doesn't complete successfully, we'll have to wait for the job to time out. The test should be more along the lines of:
next if snap_svc.wait(response.response_headers) =~ /^In progress/i
Not sure of the specific "in progress" state.
Same comment for the similar check below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I discussed that with @djberg96 on last Wednesday and he was of the opinion that there would be an Exception raised in the case of an unsuccessful completion and therefore we wouldn't hit the test regardless.
Wait for SSA Snapshot Success (cherry picked from commit 9217377) https://bugzilla.redhat.com/show_bug.cgi?id=1488967 https://bugzilla.redhat.com/show_bug.cgi?id=1491310
Fine backport details:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve the changes.
We need to wait for success status from the SSA Snapshot
operations for both Managed and non-managed (blob) disks.
The Managed snapshot will use the Snapshot Service to wait
while the blob snapshot will use the Storage Account Service
to wait.
This is the story of the continuing saga of the BZs:
https://bugzilla.redhat.com/show_bug.cgi?id=1463780 (for a running non-managed disk VM)
and
https://bugzilla.redhat.com/show_bug.cgi?id=1475540 (for Managed Disk VMs)
This PR needs to be back ported to FINE and will be added to a hot fix covering both of these BZs.
Not needing to be back ported but still needed to be added are a PR to bump the manageiq-smartstate gem version to 1.4 and one in manageiq to use 1.4.
@roliveri please review for sanity (especially the length of the wait)
@bronaghs or @djberg96 please review and merge so we can put these blocker BZs to bed and push out the hot fix. Thanks all.