Skip to content
This repository has been archived by the owner on Sep 8, 2023. It is now read-only.

fix: Session crash stop #350

Merged
merged 6 commits into from
Aug 31, 2023
Merged

fix: Session crash stop #350

merged 6 commits into from
Aug 31, 2023

Conversation

jonas-martinez
Copy link
Contributor

@jonas-martinez jonas-martinez commented Jun 23, 2023

Closes #

Description of the changes

Checklist

  • I didn't over-scope my PR
  • My PR title matches the commit convention
  • I did not include breaking changes
  • I made my own code-review before requesting one

I included unit tests that cover my changes

  • 👍 yes
  • 🙅 no, because they aren't needed
  • 🙋 no, because I need help

I added/updated the documentation about my changes

  • 📜 README.md
  • 📕 docs/*.md
  • 📓 docs.lenra.io
  • 🙅 no documentation needed

Technical highlight/advice

@jonas-martinez jonas-martinez added the bug Something isn't working label Jun 23, 2023
@jonas-martinez jonas-martinez self-assigned this Jun 23, 2023
@jonas-martinez jonas-martinez requested review from pichoemr and removed request for pichoemr July 21, 2023 12:47
pichoemr
pichoemr previously approved these changes Aug 31, 2023
@taorepoara taorepoara merged commit e8ab33c into beta Aug 31, 2023
3 checks passed
@taorepoara taorepoara deleted the fix-session-crash-stop branch August 31, 2023 10:04
@Nesqwik
Copy link
Member

Nesqwik commented Sep 5, 2023

I'm not sure about this...
Why did you needed to do this ?
What was the issue ?
If the log was critical with an error "It should NEVER happen" maybe there is a reason and the fix should be somewhere else ?

Please give more informations about this !

We should probably revert the changes

@jonas-martinez
Copy link
Contributor Author

I just added a check to make sure that we can stop the environments if the session_pid is defined. If it is not defined it means that the session was already closed and no environments should be stopped.

The problem came when the session was stopping, it tried to count the children of a bad session_pid which caused an error and crashed the server.

I don't think that this should be reverted as it is just adding a simple check, the session is still properly stopped and the environments too.

@Nesqwik
Copy link
Member

Nesqwik commented Sep 6, 2023

Ok I understand but this PR could create some issue...

If the session_id is not a pid, it means there is some error way before in the execution flow. This is why the critical error say that there is an issue that should not happen.

I'll try to explain why i think this should be reverted.

  # Here we take the Session.DynamicSupervisor pid.
  # This is the supervisor that handle every session for this environment.
  session_pid = Swarm.whereis_name(Session.DynamicSupervisor.get_name(env_id))

  # You check if the session_pid is actually a pid.
  # Here, if the session_pid is NOT a Pid, we have an other issue because this should NEVER be the case.
  # If the pid is nil for example, this means the dynamic supervisor crashed for whatever reason OR Swarm cannot find the pid in the register. 
  # Since the supervisor does not handle any business logic, it should NEVER crash on its own.
  # And if it IS crashing for whatever reason, the environment supervisor should restart it on the fly anyway.
  # So fixing this is like adding rubber band to fix the leak instead of properly fixing the leak
  if is_pid(session_pid) do
    ...
  end

I can think of 2 possible issues :
First one :
The Session.DynamicSupervisor crash and is not properly restarted if it crash for whatever reason. This should be fixed in the supervisor that manage this Session.DynamicSupervisor. I think it's in the Environment.Supervisor. Also, we should check why this supervisor crash and fix that too since it should not happen.
Second one :
The Session.DynamicSupervisor is not properly registered in Swarm (first start or after a restart) or swarm cannot find him for whatever reason.

This needs some more investigation, but this fix is not the proper one IMO.

@jonas-martinez
Copy link
Contributor Author

jonas-martinez commented Sep 6, 2023

You're right, this still needs some investigation and we still need to add some sort of error handling to stop the server from crashing.

I don't think that this should be reverted because this causes more problems reverted than not.

@pichoemr
Copy link
Contributor

pichoemr commented Sep 6, 2023

I'll try to be as clear as possible, we can discuss it if not :)

The addition of the if is just to avoid a 500 error when we want to stop the Session.DynamicSupervisor after a session crash.
In fact, we have an unexpected behavior with our genserver tree, when a session crashes, the Session.DynamicSupervisor doesn't try to restart the session but crashes and the Environment.Supervisor does the same, it doesn't try to restart and crash.

For the first case:
It's not really a problem, when a session stops we call the function and session_pid = Swarm.whereis_name(Session.DynamicSupervisor.get_name(env_id)) returns null, the is_pid filters it and all environment trees are stopped (crash). I know this is now the desired behavior, but for now it works fine.

For the second:
If the Session.DynamicSupervisor is not registered correctly the tree is stopped with an error (normally I'm not 100% sure), if swarm doesn't find it for some reason we can't do anything, if we move the logic or whatever swarm style doesn't find it...

This needs some more investigation, but this fix is not the proper one IMO.

Sure, but the real solution is to find out why dynamic supervisors crash and don't try to restart the supervisor. it's not worth spending time on a simple if...

@Nesqwik
Copy link
Member

Nesqwik commented Sep 6, 2023

Sure, but the real solution is to find out why dynamic supervisors crash and don't try to restart the supervisor. it's not worth spending time on a simple if...

Ok but it's not about this "if" but more about fixing the real issue instead of hiding the error.
This PR just hide the actual error without fixing it. We should not do that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants