Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fish: Deal with ELECTED Applications on restart #65

Open
sparshev opened this issue May 8, 2024 · 0 comments
Open

Fish: Deal with ELECTED Applications on restart #65

sparshev opened this issue May 8, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@sparshev
Copy link
Collaborator

sparshev commented May 8, 2024

#64 OOM triggered another issue - that if something happens during Application allocation - it will stuck in ELECTED state and will be abandoned after restart. Most probably that will not happen in the cluster (because multiple nodes will look at the election process and notice when the Application was not ALLOCATED in time), but that should not happen even in one node configurations.

Expected Behaviour

We need to clean-up and mark the Application Allocation as ERROR or try to continue the allocation if it's possible.

Actual Behaviour

The application is not picked up upon node startup

Reproduce Scenario (including but not limited to)

Should be relatively easy to reproduce by killing the node right during allocation and then restarting it

Steps to Reproduce

  1. Run the Fish node
  2. Try to Allocate something
  3. Kill the node during allocation (as hard as possible)
  4. Start the node and see that Application is not picked up and in ELECTED state forever

Platform and Version

  • Ubuntu 20.04.6 LTS
  • Aquarium Fish v0.7.1 (231111.070935)

Logs taken while reproducing problem

  • Check logs in AWS: OOM during execution #64
  • How Application State looks after restart:
    {"UID":"12050150-63b1-4ce6-9a1f-3876c06bb82e","application_UID":"12050150-63b1-463f-a0c0-340be13ab1bc","created_at":"2024-05-03T15:27:34.442634343Z","description":"Elected node: <node_name>","status":"ELECTED"}
    
@sparshev sparshev self-assigned this May 8, 2024
@sparshev sparshev added the bug Something isn't working label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant