Skip to content

Conversation

@psafont
Copy link
Member

@psafont psafont commented Oct 29, 2025

An xcp-ng user reported a failure when enabling HA, with the only error being a Not_found, with this change we now know that it's because the IP of the coordinator is not present in the local database's ha_peers value.

While doing this:

  • I also removed several finds and hd to reduce these kind of exceptions and make the problematic uses easier to find in the future
  • The localdb interface was rebustified by using options
  • the cluster_stack values were reified and consolidated into constants, they are problematic because they are strings instead of an enum, but we can tame them a bit

@psafont psafont marked this pull request as draft October 30, 2025 08:36
@psafont
Copy link
Member Author

psafont commented Oct 30, 2025

I thought I had make-checked it after rebasing on top of master. Some code changes used the changed functions, so I need to make further changes

Comment on lines 1904 to +1925
List.iter
(fun sr ->
let vdi = Xha_statefile.find_or_create ~__context ~sr ~cluster_stack in
statefile_vdis := vdi :: !statefile_vdis
)
srs ;
[sr] ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could drop this List.iter since the list only has a single element

Comment on lines +2041 to +2064
List.iter
(fun (_, exn) ->
(* Perform a disable since the pool HA state isn't consistent *)
error "Attempting to disable HA pool-wide" ;
Helpers.log_exn_continue
"Disabling HA after a failure during enable" disable_internal
__context ;
raise exn
)
errors ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and above changed the code from raising the first exception to raising all of them one after another - I guess it'd be hard for anything to depend on this, so it should be alright (might increase noise in the logs though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't have resumable exceptions, only the first one is actually reported, or none at all

The Not_found and hd exceptions keep popping up, and it's difficult to
find them when there are no backtraces logged.
Remove usages if them, even if they are not problematic so the actual
problematic ones can be flushed out over time.

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
Until now, the valid string values were not written anywhere, change the
situation.

Ideally this would be done using an enum in the idl, but unfortunately
this changes types of existing parameters in API calls, so it's quite
risky. Instead have a conservative approach to only enumerate the valid
values and make Constants the only source of truth for these values,
including default ones, which were scatterred around previously.

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
Previously it was opened if the default stack was selected, but this
could actually be different from XHAd

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
Allows callers to avoid exceptions, the previous get is now called
get_exn so it's clear which users have to be have to be mindful of
exceptions.

Not all get_exn were converted to get, previous behaviour was
widespread, and doing the change without changing behaviour is not
trivial, better to do it only once its detected.

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
Joining the liveset has been found to raise `Not_found` in some cases.
Transform these exceptions to others that are more readable and show the
exact cause.

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
@psafont psafont marked this pull request as ready for review October 30, 2025 10:33
@psafont
Copy link
Member Author

psafont commented Oct 30, 2025

Fixed another bug because the cluster stack values are stringy, so I created a variant to model them. I would have like to use an enum at the API level, but that is very invasive and we know how problematic that can be

@psafont psafont requested a review from BengangY October 30, 2025 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants