Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create base supervisor script #26

Merged
merged 4 commits into from
Jan 27, 2024
Merged

Conversation

nghi01
Copy link
Collaborator

@nghi01 nghi01 commented Jan 24, 2024

Currently still need to implement recovering nodes, need to talk with Alex about implementing recovering nodes from the available public keys and private keys.

async def ping_node(host, port):
try:
reader, writer = await asyncio.open_connection(host, port)
print(f"Alive node at {host}:{port}")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use logging

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 12 to 13
writer.close()
await writer.wait_closed()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should maybe be in a finally block.

Does the reader need to be closed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the asyncio python doc doesn't seem so. I'd look further into it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification:
Both the reader and finally block were not included in the example from the python docs so I did not include them

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... the docs don't provide clear guidance on cleanup in the face of exceptions. Would be nice if they did.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot find any instance where they try to close the reader, so I assume it is not worth doing so.

async def ping_node(host, port):
try:
reader, writer = await asyncio.open_connection(host, port)
print(f"Alive node at {host}:{port}")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be good enough for our MVP, but merely being able to connect on the port at the host is not a guarantee that it's our application listening at that moment in time.

A stronger signal would be to have a "healthcheck" ping.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little more on this beyond the above and our conversation.

In a deployment environment for something we want to ensure stays up, we'd usually use (in increasing preference and not an exhaustive list) one of:

  1. a monitor script with cron
  2. a supervising tool such as daemontools or supervisord
  3. systemd

Determining "liveness" of a service via a network call is somewhat fragile since the network path from the monitoring app could have a partition, but other apps can reach it just fine. The more degenerate case is that the monitoring app can reach it, but the apps that depend on it can't. A more refined approach is to look at client app calls for signs of health (assuming a reasonable degree of traffic).

writer.close()
await writer.wait_closed()
except (OSError, asyncio.TimeoutError):
print(f"Node at {host}:{port} is not responding")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

while True:
for i in range(len(peer_list)):
await ping_node("127.0.0.1", peer_list[i][1])
await asyncio.sleep(5)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this an input parameter to the supervisor script that's passed through to here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U mean the 127.0.0.1 as an input parameter?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sleep interval.

devops/deploy/supervisor.py Show resolved Hide resolved
@nghi01
Copy link
Collaborator Author

nghi01 commented Jan 24, 2024

The changes make sense. I'm wondering about where to put this python file, should I keep it in devops/deploy? I'll make the spin-up-script call it to run continually when the spin up is done

@zimmermatt
Copy link
Owner

The changes make sense. I'm wondering about where to put this python file, should I keep it in devops/deploy? I'll make the spin-up-script call it to run continually when the spin up is done

devops/deploy seems a reasonable location

Copy link
Owner

@zimmermatt zimmermatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this and the one remaining comment can be addressed optionally/opportunistically.

@zimmermatt zimmermatt merged commit 2833d5f into zimmermatt:main Jan 27, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement a supervisor procedure to monitor the health of the peer list and restore peers that have failed
2 participants