Initial HealthCheck Code #245

AronVietti · 2020-05-18T03:16:12Z

Many services, like AWS, do basic TCP Health Checks by connecting to the servers socket. UDP is not supported, so to make this work for Jamulus I created a HealthCheckSocket that handles connections on the same port as the UDP Service. To make that more meaningful it is on the same thread as the UDP socket. This could be expanded with more service checks in the future, but at the very least it will fail the HealthCheck if something is wrong on the thread.

To get this working I had to make both socket NonBlocking and use a select call to make sure the thread doesn't use CPU when there are no incoming messages or data.

Adds a TCP Socket class called CHealthCHeckSocket. This allows services like AWS to connect and verify the service is still running. Added Health Check to CHighPriorityThread, only used when bClient is false. Created global value to control the max number of Health Check connections. Changed CSocket to be Non Blocking. Updated the Socket loop to handle both sockets, and use select to block until there is either a UDP datagram or a TCP Connection request. Created socket error handling. Added Socket helper functions to the Socket implementation. These keep the implementation code cleaner than having the windows specific ifdefs inline.

corrados · 2020-05-18T06:05:09Z

What is a AWS Health Check used for? I have not heard oft such a service.

AronVietti · 2020-05-18T06:38:03Z

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html

Basically it let's the AWS infrastructure check that your services are healthy, and if they're not then create new ones and route to them. This is common practice for automating service availability.

AronVietti · 2020-05-18T06:43:52Z

Google Cloud also supports the same thing. https://cloud.google.com/load-balancing/docs/health-checks

corrados · 2020-05-18T06:57:06Z

What is the relation to Jamulus? What benefit has the Jamulus Server by this?

AronVietti · 2020-05-18T07:05:30Z

So that you can run Jamulus Server on cloud infrastructure and get the benefits of high availability and integrated monitoring.

corrados · 2020-05-18T07:12:09Z

What information about the Jamulus Server do you want to query with this interface?

AronVietti · 2020-05-18T07:30:55Z

The Load Balancer needs to be able to make a TCP connection to the Jamulus Server as a way to monitor general health of the software, instance, and network. It does not query for anything, there is no request or response, it just establishes the connection and then closes it and tries again at a configured interval. It cannot do that using the existing UDP socket because UDP is connectionless. Thus adding a TCP socket to Jamulus and putting it in the same loop as the UDP socket. If something goes wrong on the thread and the socket cannot handle connections the Load Balancer will fail to connect and take the appropriate action.

corrados · 2020-05-18T07:38:09Z

Do you run Jamulus in AWS in a normal Linux shell? Could you start other programs or scripts in parallel to the Jamulus Server?

AronVietti · 2020-05-18T17:05:46Z

Yes, it is running under a normal Linux shell, so there is nothing preventing another program/script. This is often referred to as a "sidecar" and has some downsides. The largest being that you can't know that the health of the sidecar actually reflects the health of the service.

corrados · 2020-05-18T17:07:15Z

The largest being that you can't know that the health of the sidecar actually reflects the health of the service.

What do you mean by that? What is the "service" in that case?

AronVietti · 2020-05-18T17:08:53Z

Jamulus is the service.

AronVietti · 2020-05-18T17:10:34Z

You seem reluctant about this, is this also the reason that the Html status is written to a file instead of responding to an Http request? I totally understand being hesitant about opening extra ports. That was part of the reason for making it not accept requests, only TCP connections.

corrados · 2020-05-18T17:11:08Z

Ok, so is your health check then to check if Jamulus has crashed?

AronVietti · 2020-05-18T17:13:46Z

Crashed, or if the control thread for the service is in a bad state. If a deadlock or some other bad state happened on that thread the program may not exit, but it will no longer be able to accept connections and the health check will fail.

AronVietti · 2020-05-18T17:14:37Z

This is where future improvements could be made, where there is more monitoring code into the state of the service and if a bad state is entered the Health Check socket would get closed.

corrados · 2020-05-18T17:16:41Z

From my experience, the Jamulus server runs very stable. E.g. the Central Server which is occupied by a lot of people, ran for months without a restart. So I do not expect it to crash or having a bad state. If it is just about making the AWS health checking server happy, a simple python script could do the job. Do you agree?

AronVietti · 2020-05-18T17:25:19Z

A script could do the job, but with reduced reliability. It's good that it's stable, but again the problem with a script or separate program is that they may fail independent of Jamulus. Checking that the process is still running is a poor solution. So then we move to polling Jamulus using one of the protocol calls, like the Ping. Now the script has to be synchronized to the rate of the health check, so it can be sure it's reporting the correct health. This leads to the realization that to be accurate you need to poll at a higher rate than the Health Check requests. And now we're adding unnecessary load to a real time service, by making constant requests to Jamulus. Though I admit this only becomes a problem when you have many connections and are reaching the limits of the hardware it is being run on.

corrados · 2020-05-18T17:46:20Z

Checking that the process is still running is a poor solution.

First, my assumption is that Jamulus does not crash :-).

Checking that the process is still running is a poor solution.

Why is that supposed to be a poor solution?

And now we're adding unnecessary load to a real time service, by making constant requests to Jamulus.

Well, if you really do not trust the stability of the Jamulus server and you think checking the process is poor, sending ping times does not influence the stability of the Jamulus audio streams. Each client constantly sends ping messages. The Central Server even receives a lot of ping messages from all around the world if anyone opens the server list. I do not see an issue with that.

AronVietti · 2020-05-18T18:44:37Z

So, just checking for the process is considered a poor solution because of things like deadlocks, high cpu and memory pressure that can all cause programs to be unresponsive without crashing. Just checking those things is also considered bad, because there is no way to know if 100% cpu means a service is designed to run that way, or if for example it's in an infinite loop.

Sending pings from another process, running on the same or even a different server, brings back the problem of not knowing if there is a problem with that process or with Jamulus.

Under heavy load, or running in a constrained environment, the processing of ping times would influence the stability and performance of the service, because the system recourses are finite. But I agree that this is the least important, and is an extreme edge case.

If this is a feature you don't see a use for, I'll just maintain it on my fork. Is there a developer forum for contributing, so i can ask in the future or get assigned bug fixes/feature request? I'd like to help out now that I've familiarized myself with the code base.

corrados · 2020-05-18T18:57:11Z

If this is a feature you don't see a use for, I'll just maintain it on my fork.

Actually, yes. If that is ok for you it would be good to maintain it on your fork for now.

Is there a developer forum for contributing, so i can ask in the future or get assigned bug fixes/feature request?

Sure. The bug and feature requests are here: https://github.com/corrados/jamulus/issues
Discussion forums are here: https://sourceforge.net/p/llcon/discussion

If you plan to implement something for Jamulus, it is always advantageous to inform me prior to starting coding and first discuss the specifications of the new feature. Preferable by creating a feature request Issue.

Here is a list of possible Issues you could work on if you like:
#169
#129
#116
#83
#73
#49
#40
#23

AronVietti · 2020-05-18T19:06:16Z

I had looked on the forums, but didn't see a developer forum, do you just use General? To work on an issue should I just volunteer in the issue conversation and wait to be assigned?

Thanks for the list, I'll go through these and choose one. Going to close this PR now.

corrados · 2020-05-18T19:32:54Z

I had looked on the forums, but didn't see a developer forum, do you just use General?

Good idea. I just created one: https://sourceforge.net/p/llcon/discussion/developerforum

To work on an issue should I just volunteer in the issue conversation and wait to be assigned?

Just write in the Issue that you are interested in taking care of it. Then we can discuss the details in that Issue.

AronVietti closed this May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial HealthCheck Code #245

Initial HealthCheck Code #245

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

Initial HealthCheck Code #245

Initial HealthCheck Code #245

Conversation

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020

AronVietti commented May 18, 2020

corrados commented May 18, 2020