Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Keyvault access stops working correctly with Kustomize-controller >=0.22.0 #595

Closed
log2 opened this issue Mar 25, 2022 · 22 comments
Closed
Labels
area/sops SOPS related issues and pull requests bug Something isn't working

Comments

@log2
Copy link

log2 commented Mar 25, 2022

When using Azure KV with sops-stored secrets in kustomize controller >= 0.22.0 a strange behaviour appears: decryption is rejected due to a bad parameter (HTTP Status code 400) error, with no further indication about what's wrong.

In fact, using Azure Monitor Key Vault functionality and enabling detailed logging (even for failures) I was able to pinpoint the error.

REST API address generated by internal SOPS in kustomize controller is:

https://my-keyvault-name.vault.azure.net/keys/sops-key/1d....my-key-version.../decrypt?api-version=7.3-preview

Instead of

https://my-keyvault-name.vault.azure.net/keys/sops-key/1d....my-key-version.../decrypt?api-version=2016-10-01

Which is generated when I use sops in macOS CLI.

api-version is surely strange, because it does not resembles any API version scheme used in Azure (right?), but it is close enough to the latest version number in Azure SDK for Go, so may be it is a bug on a recent version of that library.

So I started searching history of go.mod and testing recent versions of kustomize-controller that were in use before bumping Flux.

0.22.0 and 0.22.1 exhibit the very same failure, some maybe what bricked Azure KV is this version (0.4.0) of azkeys library, introduced in 0.22.0

kustomize-controller 0.21.1 works just fine, instead.

@hiddeco
Copy link
Member

hiddeco commented Mar 25, 2022

Thank you for your extensive report.

The azkeys library should only be actively used if there are Azure Key Vault credentials configured for the Kustomization resource. Otherwise it will fall back to the default server implementation, which should still make use of github.com/Azure/azure-sdk-for-go v31.2.0 (which is also still included in our go.mod).

However, your observations appear logical to me, which means something about the above apparently is not true.

@hiddeco
Copy link
Member

hiddeco commented Mar 25, 2022

The other change between v0.21.x and v0.22.x is an update of SOPS itself (from v3.7.1 to v3.7.2). Can you confirm the v3.7.2 SOPS binary works correctly on your machine?

@hiddeco hiddeco added the area/sops SOPS related issues and pull requests label Mar 25, 2022
@log2
Copy link
Author

log2 commented Mar 25, 2022

Yes @hiddeco , just tested with:

asdf install sops 3.7.2 (I was using sops 3.7.1 pinned in global asdf settings)

and

asdf shell sops 3.7.2

I can confirm that decryption with AzureKV-backed keys works fine with sops 3.7.2, too, at least in my case.

@hiddeco
Copy link
Member

hiddeco commented Mar 25, 2022

Think I found the issue, will prepare a patch.

@hiddeco
Copy link
Member

hiddeco commented Mar 25, 2022

Can you please give the following (AMD64) image a try? hiddeco/kustomize-controller:fix-azkv-fallback-2495ea9

@hiddeco hiddeco added the bug Something isn't working label Mar 25, 2022
@log2
Copy link
Author

log2 commented Mar 25, 2022

Nope, it is still in error :-(

Group 0: FAILED
  https://xyz.vault.azure.net/keys/sops-key/<version>: FAILED
    - | failed to decrypt data: POST
      | https://xyz.vault.azure.net/keys/sops-key/<version>/decrypt
      | --------------------------------------------------------------------------------
      | RESPONSE 400: 400 Bad Request
      | ERROR CODE: BadParameter
      | --------------------------------------------------------------------------------
      | {
      |   "error": {
      |     "code": "BadParameter",
      |     "message": "The parameter is incorrect.\r\n"
      |   }
      | }
      | -------------------------------------------------------------------------------- 

@hiddeco
Copy link
Member

hiddeco commented Mar 25, 2022

Are you sure this information is from after you rolled out the above image? As based on the patch I created, I am quite sure this was the mistake, and the patch must resolve the issue.

Latest image, fully up-to-date with current state of PR: hiddeco/kustomize-controller:fix-azkv-fallback-dc5486f

@hiddeco
Copy link
Member

hiddeco commented Mar 25, 2022

Now available as an official release in https://github.com/fluxcd/flux2/releases/tag/v0.28.3

@log2
Copy link
Author

log2 commented Mar 26, 2022

Hi @hiddeco , thank you for your quick intervention. Unfortunately that did not solve the issue, which I still have both for your image, hosted on hiddeco DockerHub repo, and with the official kustomize-controller image (0.22.2, released together with Flux 0.28.3). Yes, I have waited for the deployment to complete the rollout, before forcing a reconciliation on the problematic kustomizations, and getting their result.

Anyway, while I'm sure that handling in a better manner the fallback to the default server for Azure KV, as you did on this patch, I don't think this can I help in my case. Incidentally, both fallback and secretRef-provided auth information are the same in my case (we're not ready to use different keys for various kustomizations, yet we are aiming for it in the future). So, using fallback or secretRef-provided (kustomization-specific) auth wouldn't make any difference, at least in my case. I'm sorry for not having specified this clearly beforehand.

Moreover, I did already check that appropriate token was retrieved by Azure AD from sops. The problem surfaces later: for some reason, Azure API (azkeys) sends bad api_version in REST call. This appears beyond scope of kustomize-controller, which is merely a user of that SDK by means of embedded sops module.

Can you think about something bad happening there, instead of in kustomize-controller? Thanks!

@log2
Copy link
Author

log2 commented Mar 26, 2022

However, @hiddeco, I can confirm that, if secretRef is removed, fallback works perfectly right now with kustomize-controller 0.22.2, so your patch works in that respect. This is a workaround allowing me to use recent kustomization versions in my case, as long as I do not need to use different secrets/keys/key vaults in different kustomizations. I can afford not using that at this point in development, so thank you!

The problem remains with kustomizations using both secretRef and AzureKV, which, as you stated before, uses a different SDK, where the problems still lies.

@BertelBB
Copy link

BertelBB commented Mar 28, 2022

Can confirm that kustomize-controller crashes as soon as it tries to reconcile a Kustomization that uses .spec.decryption.provider: sops and has encrypted secrets stored under .spec.path.

runtime: goroutine stack exceeds 1000000000-byte limit
runtime: sp=0xc024080368 stack=[0xc024080000, 0xc044080000]
fatal error: stack overflow

runtime stack:
runtime.throw({0x1e93518, 0x3168800})
	/usr/local/go/src/runtime/panic.go:1198 +0x71
runtime.newstack()
	/usr/local/go/src/runtime/stack.go:1088 +0x5ac
runtime.morestack()
	/usr/local/go/src/runtime/asm_amd64.s:461 +0x8b

goroutine 328 [running]:
github.com/fluxcd/kustomize-controller/internal/sops/keyservice.Server.Decrypt({0x0, {0xc00077da10, 0x25}, {0x0, 0x0, 0x0}, {0x0, 0x0}, 0x0, {0x22007a8, ...}}, ...)
	/workspace/internal/sops/keyservice/server.go:216 +0x505 fp=0xc024080378 sp=0xc024080370 pc=0x1a0f4a5
...

I'm also using aad-pod-identity.

@BertelBB
Copy link

Now available as an official release in https://github.com/fluxcd/flux2/releases/tag/v0.28.3

In my case, this release is broken.

@hiddeco
Copy link
Member

hiddeco commented Mar 29, 2022

@BertelBB can you share more details about your SOPS configuration and your Azure setup so I can replicate your precise setup?

@hiddeco
Copy link
Member

hiddeco commented Mar 29, 2022

@log2 for the issues you described in #595 (comment), I expect https://github.com/fluxcd/kustomize-controller/releases/tag/v0.22.3 to now work without issues.

@BertelBB this might solve your issues as well. If not, please provide me with the output of kubectl get kustomization <name> -oyaml --show-managed-fields, plus some details about your specific SOPS setup.

@BertelBB
Copy link

@hiddeco I installed the latest version of gotk on one of my clusters this morning and the kustomize-controller functioned properly while decrypting sops secrets. I guess it is using v0.22.3, I didn’t check the release notes.

I don’t have access to my setup but I can give you a high level overview.

aad-pod-identity using the latest official Helm chart with an identity authorized to use a key in Azure Key Vault.

kustomize-controller deployed with gotk@v0.28.3 and patched to bind with Azure Identity and use MSI authentication.

Kustomization object configured with .decryption.provider: sops and in the path it references there is a Secret that is encrypted using the key mentioned above. The kustomize-controller crashes as soon as it tries to reconcile this Kustomization. It has no issues reconciling Kustomizations that use the sops decryption provider as long as there are no encrypted secrets in the path referenced.

Downgrading gotk to an older version resolves the issue.

@BertelBB
Copy link

BertelBB commented Mar 30, 2022

Ok I looked into it and my dev cluster that is running the latest version of gotk (v0.28.4) and has kustomize-controller@v0.22.2 and there are no issues decrypting secrets, which is strange because that is the same version as in gotk@v0.28.3. There is another difference between this cluster and the one that was failing, it is using aad-pod-identity@4.1.8. I was wrong when I said the failing cluster was using the latest version of aad-pod-identity Helm chart, it is in fact using aad-pod-identity@4.1.7. Not sure if it matters.

I can try replicating the setup that was failing and give you a more detailed report if you want

@log2
Copy link
Author

log2 commented Mar 30, 2022

Confirmed, @hiddeco !

In a staging cluster I've used Flux 0.28.4 + patch to use kustomize-controller 0.22.3 (since in Flux 0.28.4 the default kustomize-controller is 0.22.2), with the following outcome: I can use secretRef.name in spec.decryptor and the decryption process works fine.

Thank you!

@hiddeco
Copy link
Member

hiddeco commented Mar 30, 2022

@BertelBB can you first confirm that the cluster that was misbehaving didn't accidentally have kustomize-controller v0.22.0 or v0.22.1 deployed? As the fallback to the one which should pick up the MSI setting you have deployed got restored in >=v0.22.2.

I am investigating another issue which I am experiencing in a replicated environment like yours, where the MSI authentication flow of SOPS seems to become unresponsive. However, this also seems to happen for anything in the v0.21.0 range, making it not specific to this v0.22.x release.

@hiddeco
Copy link
Member

hiddeco commented Mar 30, 2022

I am investigating another issue which I am experiencing in a replicated environment like yours, where the MSI authentication flow of SOPS seems to become unresponsive.

This appears to just take a very long time in the versions I tried out, eventually ending up with an error like:

 Warning  BuildFailed  7m42s (x3 over 30m) kustomize-controller  decryption failed for 'azkv': DataWithFormat: GetDataKey: Failed to get the data key required to decrypt the SOPS file.

Group 0: FAILED
  https://fluxcd-07dc50e47be34695.vault.azure.net/keys/sops-cluster0/40226a34c0ea4432aa9a3a5026845c4b: FAILED
    - | Error decrypting key:
      | azure.BearerAuthorizer#WithAuthorization: Failed to refresh
      | the Token for request to
      | https://fluxcd-07dc50e47be34695.vault.azure.net/keys/sops-cluster0/40226a34c0ea4432aa9a3a5026845c4b/decrypt?api-version=2016-10-01:
      | StatusCode=404 -- Original Error: adal: Refresh request
      | failed. Status Code = '404'. Response body: getting assigned
      | identities for pod
      | flux-system/kustomize-controller-58595d4b5d-7gzlq in
      | ASSIGNED state failed after 20 attempts, retry duration
      | [5]s, error: <nil>. Check MIC pod logs for identity
      | assignment errors
      |  Endpoint
      | http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%!!(MISSING)A(MISSING)%!!(MISSING)F(MISSING)%!!(MISSING)F(MISSING)vault.azure.net

@BertelBB
Copy link

It was running kustomize-controller@v0.22.1

kustomize-controller-v0 22 1

@hiddeco
Copy link
Member

hiddeco commented Mar 30, 2022

That means the cause of the issue was the broken fallback, which was restored in v0.22.2 (and fixed at the time of #595 (comment)).

I will be making some structural changes to the code so this can not happen again, while adding some more tests for other KMS solutions to ensure regression bugs (or compatibility issues with upstream) will be detected earlier, as already done for Azure Key Vault in #604.

Given all this, I think this issue can be closed, but do not hesitate to comment (or open a new issue) if new problems and/or questions arise. Thanks all 🙇

@hiddeco hiddeco closed this as completed Mar 30, 2022
@hiddeco
Copy link
Member

hiddeco commented Mar 30, 2022

Manual patches to overwrite the source-controller version can be removed when updating to https://github.com/fluxcd/flux2/releases/tag/v0.28.5. This should solve all reported problems in this issue (and the general v0.22.x range of this controller).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sops SOPS related issues and pull requests bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants