Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status updates in ExternalClient.Create are omitted #285

Open
muvaf opened this issue Sep 2, 2021 · 13 comments
Open

Status updates in ExternalClient.Create are omitted #285

muvaf opened this issue Sep 2, 2021 · 13 comments
Assignees
Labels
enhancement New feature or request stale wontfix This will not be worked on

Comments

@muvaf
Copy link
Member

muvaf commented Sep 2, 2021

What problem are you facing?

With #283, we issue a standard update in all cases and that causes the status to be lost since managed resources treat it as subresource. But there are cases where some information about the resource is available only in the response of the creation call to the API.

How could Crossplane help solve your problem?

We could possibly expand resource.Managed interface to have GetStatus() runtime.Object and SetStatus(runtime.Object) and then DeepCopy the status before the update operation and later call SetStatus on the returned object and issue a status update this time.

@muvaf muvaf added the enhancement New feature or request label Sep 2, 2021
@muvaf
Copy link
Member Author

muvaf commented Sep 13, 2021

@negz this issue causes Creating condition to never appear in status, which makes it feel like there isn't any progress.

@negz negz self-assigned this Sep 15, 2021
negz added a commit to negz/crossplane-runtime that referenced this issue Sep 15, 2021
crossplane#285

This approach causes us to repeat ourselves a bit, but prevents issues like the
above, where refactoring caused us to accidentally overwrite a pending status
update that we hadn't committed.

Signed-off-by: Nic Cope <negz@rk0n.org>
@negz
Copy link
Member

negz commented Sep 15, 2021

Whoops - good catch. I always look at events now so I didn't even notice. I think there's a simpler fix for status conditions specifically per #292. That won't cover the case in which a Create implementation itself is mutating status, but I don't think that is a pattern I've seen yet.

negz added a commit to negz/crossplane-runtime that referenced this issue Sep 17, 2021
crossplane#285

This approach causes us to repeat ourselves a bit, but prevents issues like the
above, where refactoring caused us to accidentally overwrite a pending status
update that we hadn't committed.

Signed-off-by: Nic Cope <negz@rk0n.org>
@negz
Copy link
Member

negz commented Sep 17, 2021

I don't think that is a pattern I've seen yet.

We noted in #292 (comment) that this pattern does in fact exist, at least in some AWS and Azure controllers.

i-dont-think-they-really-exist

@negz
Copy link
Member

negz commented Sep 17, 2021

We could possibly expand resource.Managed interface to have GetStatus() runtime.Object and SetStatus(runtime.Object) and then DeepCopy the status before the update operation and later call SetStatus on the returned object and issue a status update this time.

#280 (comment)

Hasan had a similar suggestion during review of a PR that lead to #283.

@negz
Copy link
Member

negz commented Sep 18, 2021

Despite the guidance around the status field's "recreatability" being relaxed per kubernetes/enhancements#2537 I think we should be warning folks to update status (or really any part of the managed resource) during the Create call as only as a last resort, given that we can't just call Create again on the next Reconcile if we fail to persist the result. It's always better to update the managed resource during Observe where possible.

That said, if we know there are edge cases where we have no alternative but to rely on persisting updates made during Create so it does seem like we need a solution here.

I don't love the idea of having to extract and save status so we can later persist it again, especially if it requires modifying the resource.Managed interface. 🤔 Some potential alternatives (that I don't particularly love either):

  1. We could consider Create updating status to be enough of an edge case that folks should plumb a Kubernetes client down into their ExternalClient to do so.
  2. We could call Status().Update() before we called UpdateCriticalAnnotations(). The risk being if the status update failed we'd fail to persist our critical annotation and get stuck waiting for human intervention. We could alleviate that by retrying the status update, and further alleviate it by adding a UpdatedStatus bool result to the ExternalObservation struct to allow us to avoid trying to persist status at all when there was no need to.

I think the alternative I like the best is still just trying to establish a "thou shalt not update anything (except annotations) during Create" rule. If I understand correctly we haven't yet found any AWS resources that make status updates in Create that they couldn't also make in Observe. We do know some Azure controllers rely on saving their operations, but it actually seems like they do that to workaround the fact that the external resources don't show up in the Azure API for some time after they're created, so we could probably just ignore the operation completely and use the new WithCreationGracePeriod option to cover that case.

@muvaf
Copy link
Member Author

muvaf commented Oct 5, 2021

FYI, this causes reconciler to drop any status update made in Observe before the first successful Create call. In long-running creation calls, like AWS DBs, Azure DBs and async Terrajet calls, the error of the creation is returned in the GET call (either in a field of the returned object or as part of the call). So, we have to call status update in Observe to catch those creation errors.

I think the alternative I like the best is still just trying to establish a "thou shalt not update anything (except annotations) during Create" rule. If I understand correctly we haven't yet found any AWS resources that make status updates in Create that they couldn't also make in Observe.

That rule wouldn't apply in this case unfortunately, where creation error needs to be saved and Create needs to be called repeatedly until the problem is gone. But the solution doesn't have to be related to this issue specifically; we can issue a status update before external.Create call to make sure whatever is set in Observe is persisted.

@negz
Copy link
Member

negz commented Dec 2, 2021

FYI, this causes reconciler to drop any status update made in Observe before the first successful Create call.

I'm working through this today, trying to determine how much of a blocker it could be for updating providers to the latest version of crossplane-runtime. First let me make sure I understand the situation. You're pointing out that in some cases a Create call will appear to succeed, but then later during Observe we'll find out that it didn't really. In this case we probably want to record this situation somehow to the managed resource status.

As far as I can tell, we still will update the status in most cases. For example:

  1. If during Observe we update the MR status then return ResourceExists: true (e.g. the resource was created but is in a bad state) we should fall through to the "resource is up-to-date" clause where we'll persist the status update. I think this is what the RDS example you linked does.
  2. If during Observe we return an error the reconciler should persist that error as a status condition (and event), bringing along any other status updates that were made before the error was returned. I think this is what the Azure Postgres example you linked does.

Is my understanding here correct, or am I missing something? Also, can you let me know whether Terrajet providers are still affected by this? I notice that crossplane/terrajet#103 removed the block you referenced where we were previously recording a status condition during Observe.

@muvaf
Copy link
Member Author

muvaf commented Dec 3, 2021

If during Observe we update the MR status then return ResourceExists: true

The case I'm referring to is where we have to return ResourceExists: false because creation actually failed (even though external.Create succeeded in earlier reconcile) and needs to be retried. But in the current implementation if we return ResourceExists: false at any time, the status is lost, which contained the error about the late-failed creation.

Also, can you let me know whether Terrajet providers are still affected by this? I notice that crossplane/terrajet#103 removed the block you referenced where we were previously recording a status condition during Observe.

That was a workaround and now we're doing a similar thing in callback mechanism by issuing a status update right after the operation completes, but that doesn't apply to native async resources since they don't have such callback mechanism from the APIs they consume. But for Terrajet, we should be good.

@negz
Copy link
Member

negz commented Dec 3, 2021

The case I'm referring to is where we have to return ResourceExists: false because creation actually failed (even though external.Create succeeded in earlier reconcile) and needs to be retried.

Is this hypothetical, or do we do this in practice? As far as I could tell none of the examples you provided follow this pattern.

@muvaf
Copy link
Member Author

muvaf commented Dec 3, 2021

@negz We do that in Terrajet in this line to trigger a new creation, here in Azure PostgreSQL.

In fact, I think it can be considered a bug if we don't do that because then it'd mean we just never retry once a long-running creation fails. We don't see that very often since cloud providers like AWS and GCP make their best to do all validations in the creation calls, albeit there are always cases in those, too, where validation passes but creation fails due to external factors like insufficient capacity.

@muvaf
Copy link
Member Author

muvaf commented Dec 4, 2021

@negz Opened #307 to be more specific about the problem I mentioned above. We can keep this issue contained to status updates being omitted after ExternalClient.Create.

@stale
Copy link

stale bot commented Aug 13, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Aug 13, 2022
Copy link

github-actions bot commented Sep 4, 2024

Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as stale because it has had no activity in the last 90 days. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

@github-actions github-actions bot added the stale label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants