-
Notifications
You must be signed in to change notification settings - Fork 4
Job Submission Process
When the job interface gets invoked to start the job process, it is given the user’s request information from the POST /job
route, as well as a current state of the waiting list. The state of a user’s request will always start in the waiting
state, and the final state will always be the claimed
state. The states between waiting
and claimed
are endlessly customizable by the job interface module. Having this system gives the job interface the ability to chain multiple tasks that depend on each other’s existence, such as the HMI depending on knowing the location of SDL Core in order to connect to it, and this is done so by repeatedly updating the job file submitted to Nomad until the full job is constructed for the final stage.
In Manticore’s example, waiting
goes to the pending-1
state after the core job is submitted and is operational. The store gets updated with this state change and then the waiting list gets invoked again, making the job interface module be invoked again. The request’s state is found to be in pending-1
and generates random addresses for SDL Core's services, then makes the state pending-2
. pending-2
is where the HMI job is submitted. pending-3
is where addresses are generated for the HMI services. Finally, once stage pending-4
is reached the state is changed to claimed
and there is no further work done.
Manticore exposes a lot of useful functionality to make the process of writing new job interface modules easy. The main complication left is to create the job file to submit to the Nomad agents. The utils.js
file hides a lot of the complexity, and so this section will go into more detail about what happens.
First, the job gets submitted to Nomad. Then, the API server pings Nomad’s API for the details on the job status. The job's tasks list is parsed for verifying that the job is allocated successfully. The watches will continue to stay up until a certain defined limit, or until the tasks are in a running state. Either way, the allocations will be returned after some time and a function will investigate the allocations object for the current status of the tasks. If all the tasks specified are healthy then it is considered a pass. If it is not a pass, then the allocations and the evaluation of the job gets passed to an error handling function to determine what action to take. An allocation represents the placement of a task, and an evaluation represents the status of the entire job submission, so it has higher-level information useful for debugging.
If the job has been allocated, the next step is to ensure that the tasks are healthy. Nomad’s API was used to check on the tasks’ successful placements, so Consul’s API is used for post-placement health checks. Next, the services are parsed from the job file for reference. The service checks operate similarly to the allocation checks, in that it will ping an endpoint for changes until either all services are running or until a certain amount of time passes, and it will return all of the services in the end. Those services are then checked for a passing state, and if they're not passing then they will be passed to an error handling function to determine what action to take.
If the services’ checks pass then the final step happens. The locations of those healthy services are queried using Consul and then they are stored in the waiting state to be saved in the KV store. This completes a single stage. Future stages may undergo a similar process, but with modified job files that build on the previous stages.
There are three failure types possible when looking at a failed event here, and the different failure types correspond to different actions taken by Manticore to help ensure its continued operation.
The default action to go to is the permanent failure, which boots the user’s request out of the store and they will not receive their job. It is a harsh action to take if even one thing fails, but the fact is that it is the only safe option for Manticore to take when facing an unexpected outcome.
Another failure type is a pending failure, which essentially pauses the job submission process until a new update happens from an external cause. This is useful for cases where there aren’t enough resources available for job placement, and so the waiting list will not update until something else invokes a change, such as a user being removed from the request list. If the type of failure found has been misdiagnosed as a pending failure, then it is possible that Manticore can enter a deadlocked state due to being unable to make any further actions on the waiting list because a request is forever in a non-waiting, non-claimed state.
Another type is a restart failure, which will place the user’s request back in the front of the waiting list to start over the job submission process. Currently an implemented use doesn’t exist yet, but in the case where Manticore knows that a job that has failed for a user is a known anomaly and will work by just running the job submission process again, then this failure type can be used so that the user doesn’t have to get booted. If the type of failure found has been misdiagnosed as a restart failure, then it is possible that Manticore can enter an infinite loop of submitting a job for a single request, having it fail, then trying again for that same user.
Additional failure types can be considered in the future, for example, a special restart failure which keeps the user in the waiting list, but puts them to the back of the list. This failure type could have a similar effect to the infinite loop problem if the user’s request is the only one in the waiting list.