Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DON'T MERGE] Bare bones support for for dynamic VM management #19

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

davefinster
Copy link

@davefinster davefinster commented Jul 27, 2019

Figured I would post up this code to get some discussion going around whether there are any better ways to do this.

The short version is that these changes bypass the server learning behavior that exists in UnicornLB in favor of spinning up/starting (initially) pre-baked VMs in (initially) GCP. Initially I tried intercepting the server selection logic, but the lead time in cold starting a VM caused more headaches than anything else.

So these changes work by observing the Plex tokens that pass through UnicornLB. For every token that is spotted, it is put in a queue that is consumed every 5 seconds. The queue worker than resolves the token to a Plex username (by calling out to the Plex server) and if that username appears on a whitelist, will attempt to match it to a VM.

If no running VM is available, but there are stopped VMs, it will start it and wait until an IP address has been assigned (which will return immediately if a reserved IP is used - but $). With that IP, it will update the "magic dns" hostname via Google Cloud DNS since it may have changed. If a machine is already running, it is used.

The manager on each loop pass will also look for tokens that haven't been seen in a configurable timeout period (default 15 minutes). If it expires, the machine is released back into the pool and if no additional sessions are using that VM, it is stopped.

What works:

  • Basically the whole flow listed above. At the moment there is only one DNS hostname and really one VM in use.

More work to do:

  • Generate arbitrary DNS hostnames and leverage a wildcard certificate to make HTTPS work properly. This helps to avoid DNS caching problems and make TTL less of an issue.
  • Better exercising of the timeout/VM stopping logic.
  • Add some location awareness to the backend manager so that it can look for VMs in particular locations to service particular tokens and thus users.
  • Better unify the server pool options to avoid the if statements and generally make it more pluggable.

Open to suggestions/comments/concerns!

@Maxou44
Copy link
Member

Maxou44 commented Jul 27, 2019

So each user will have a VM ? Why don't share the same servers between different users? The cleanest way could be to support differents providers classes and trigger events to upscale/downscale server pool

@cruex-de
Copy link

Would Kubernetes make more sense? Does it have a very good API to realize the whole thing in pods that could be spread over several nodes? And to realize the DNS hostnames via the ingress loadbalancer?

Just thinking aloud

@Maxou44
Copy link
Member

Maxou44 commented Jul 27, 2019

The biggest chalenge with upscale/downscale is the state of each transcoder, you can't remove a server if streams are launched on it.

I don't know if kunernetes can scale a server pool easily ? Maybe the first step is to create a Dockerfile ?

@cruex-de
Copy link

cruex-de commented Jul 27, 2019

I don't think that should be a problem with the upscale/downscale. Kuberentes is specially designed to be very easy to upscale/downscale, especially in production mode.

The CPU, memory and hard disk specifications are actually specified in the deployment.yaml file such as:

resources:
limits:
cpu: "2"
memory: "2Gi"

And the configuration can actually be patched via API.

Or you can automatically create another Pod and delete the old one when it's not in use?

@Maxou44
Copy link
Member

Maxou44 commented Jul 27, 2019

But how you can determine if a pod is used or not ?

What happen if the load balancer redirect on a dead server ?

@cruex-de
Copy link

cruex-de commented Jul 27, 2019

In the load balancer the transcoders are listed. Maybe you make a query with the load balancer and ask if the pod/transcoder is in use and compare it with the hostname or a pod token? If yes or no, does the load balancer send a delete/patch/create to the Kubernetes API?

And with prometheus, Kubernetes can check the services to see if the port of the transcoder or service is still alive.

@Maxou44
Copy link
Member

Maxou44 commented Jul 27, 2019

If we bundle hooks to add / remove a transcoder, do you think it's easy to support kunernetes ?

@cruex-de
Copy link

It'd be a start we could play with to test it out.

@davefinster
Copy link
Author

Users don't necessarily need their own VM - I think there is still room in there for the scoring system. It is just that (in the future) there would be another option which is spin up another VM somewhere if load gets too painful.

Kubernetes works from the perspective of taking a pool of resources and scheduling work on them. Many managed Kube platforms support scaling clusters up and down on demand so it might be worth exploring.

My original goal here was to architect a system whereby resources are kept cold, which for most cloud providers means the user isn't paying for them, until they are needed. Keep the index server running on a rather under powered VM and only start the transcoder VMs as needed. Leverage the per-[second|minute|hour] billing to its full potential.

There is still more work to do. I think the first step to cleaning this up and making it something mergeable (one day) is to make the server listings more pluggable.

The comment about needing to ensure a machine isn't being used for a stream before its killed is important. This would probably involve the machine manager querying the transcoder to determine what work its doing.

Another thought I had was to completely extract the machine management functionality from the load balancer. It would then be this processes responsibility to inspect sessions (on the LB and Transcoders) looking for resources to release. The load balancer would need to kick events over to this service which at the moment would consist of:

  • User showing up - use the existing "I've seen this token" method. Gives an opportunity for the manager to warm up a server
  • I need a server - details please

One piece of this I suspect is the ability to stall the user without risking failures. While warming up a server helps prevent timeouts, is there a safe way to indefinitely redirect a user while we get a VM ready?

@Maxou44
Copy link
Member

Maxou44 commented Jul 31, 2019

For me it's not possible to have a 0 transcoder poll and scale it only when you have a transcode request: The creation of a new VM will take at least 30sec, maybe few minutes, the stream can't start before the transcoder start, it's really long..
Some plex apps also have timeout when a stream don't respond until few seconds... (If our 302 redirect take too much time, Plex will stop it)

A N+1 solution is more adapted in this case. We always need to have a free transcoder slot if a user want to start a stream

@davefinster
Copy link
Author

I think this would be a configuration detail more than anything else. But your right about the creation time, hence my addition of the plex token water. When a request with a token mapping back to a whitelisted user is seen by the balancer, it boots a transcoder. At least having the VM already created, but stopped, typically ends up taking 30 seconds before it is ready to serve.

But the boot up starts from when the user first loads the Plex index (based on seeing the token for the first time) so the wait should be brief.

@Maxou44
Copy link
Member

Maxou44 commented Aug 1, 2019

The load balancer have access to the Plex Media Server and it can know if a user token is valid or not, a whitelist isn't necessary. If the user have access to your Plex, he can stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants