-
Notifications
You must be signed in to change notification settings - Fork 522
existing cluster using non-functional VHD prevents VMSS scale out #2074
Comments
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it. |
I believe this issue should be resolved via: #2022 and #2006. If you upgrade to https://github.com/Azure/aks-engine/releases/tag/v0.38.9, you should be fixed. /cc @jackfrancis please correct me if I've misstated something. |
@PascalVA Thank you for that detailed write-up, your summary of the situation is spot on. There are two ways to "re-deploy bug fixes" via aks-engine:
Step #1 is safer, and is more tolerant to environmental failure, but more manual. Although I would argue if you are validating whether or not a newer version of aks-engine actually has the fix you want, using that newer version and executing a scale out against the cluster you want to fix is always the first thing you should try, as the downside is minimal (a period of time when you have a broken extra node, and of course your own troubleshooting time/effort) vs potentially breaking your cluster even more. (It should be noted that step #1 doesn't address bugs you want to apply onto your control plane vms.) You are also correct that the VHD (OS image) reference is not exposed via the api model. We could reconsider that, but so far it's made sense to essentially always pin obscure the specific VHD image behind the Definitely interested in your feedback, thanks! |
@PascalVA I took the liberty of renaming this issue to capture the general problem, so that other folks experiencing it might land here I think we should aim to produce some troubleshooting documentation as an outcome of this thread to better help folks build workflows in the future. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Bug Description
When trying to add a new node pool to our existing cluster that was deployed with aks-engine v0.38.8 (aks-ubuntu-1804), we were unable to succeed in the deployment. The reason for this is that the Ubuntu base image had an expired GPG key for nvidia (VMExtensionProvisioningError, status=99). This was a hard problem to solve due to a few reasons:
1. There is no way to redeploy after manually fixing the issue on the nodes
When we had a failed deployment and manually added the nvidia GPG key + performed the apt updates on the affected machines, redeploying with aks-enine on the existing machines just kept returning the same error as before we fixed it (VMExtensionProvisioningError, status=99).
2. There is no way (or at least not that I know of) to inject some commands to run before (or inside of) the cloud provisioner scripts when working with an existing cluster
Whenever we tried to add the command to update the key as a hotfix in the cloud provisioning scripts, we received an error that we cannot update the customData. This probably due to the old nodes that were deployed with different init scripts.
3. Azure VM Images you can use are built in to the aks-engine binary.
The last thing we tried was to use a newer image of aks-ubuntu-1804 to get around the key issue. To our surprise, the images you can use are built into the aks-engine binary. Perhaps we can add a way to use custom image SKU's and versions instead of baking them into the binary?
We managed to build a custom built aks-engine binary that had an image reference we called aks-engine-1804-patched that referenced a newer image from september. (which has different issues related to systemd-resolvd), but at least the deployment worked. I have put the diff (relative to the 0.38.8 release) in the additional context.
Steps To Reproduce
I don't think there is any easy way to reproduce the issue, because you would have to have an existing cluster of aks-engine v0.38.8 with the images of Ubuntu before the key was expired.
Expected behavior
We should have a way to keep deploying with older versions of aks-engine even if the base images baked into it break. Alternatively, we should be able to deploy with newer versions of aks-engine on older clusters and handle these compatibility issues.
AKS Engine version
v0.38.8
Kubernetes version
v1.13.10
Additional context
The text was updated successfully, but these errors were encountered: