A Distributed Task Driven Framework for Generic Task Execution on a Cluster
- Python - Standard Library
- Python bindings for 0MQ (pyzmq) - Interprocess communication between different components e.g. master, controller, etc.
- ClusterShell (clustershell) - For RangeSet notation specifying Lustre OST indexes
- LfsUtils (lfsutils) - Utility library for accessing the Lustre filesystem
- MySQL Connector Python (mysql) - For storing task results into a MySQL database
- HTTP library (requests) - For communication with the Prometheus Pushgateway
Cyclone | Python | pyzmq |
---|---|---|
2.0.5 | 3.9.12 | 22.3.0 |
2.0.2 | 3.8.5 | 20.0.0 |
2.0.0 | 3.6.0 | 19.0.0 |
The master component is the central unit of Cyclone where controller can register.
It is responsible for managing and scheduling tasks to controller.
The task generator is an interface which is attached to the master.
Before dispatching new tasks to controller a specific task generator must be implemented,
which provides tasks to the master.
A controller communicates with the master to receive new tasks to be executed.
The controller itself delegates tasks to the attached worker.
It also manages how many workers are available for task execution.
A worker executes a task that it received by the proper controller instance.
The MySQL database proxy acts like a client to a proper DBMS.
It buffers e.g. monitoring metric values it receives from task results and saves them bulk-wise to the proper MySQL database.
Otherwise each worker would have to send its task result to the MySQL instance each time.
As an example how the MySQL database proxy is used, refer to the Lustre OST monitoring use case.
The Prometheus pushgateway client is used to send monitoring metrics to the Promentheus monitoring system.
See the monitoring use case for the Lustre file creation check for an example.
Task distribution is done in a FCFS way. The fastest controller will get more tasks assigned to its worker than a controller that takes more time to process its assigned tasks.
During runtime new controller instances can be started even after the master is already running. Thus, not all instances must be started at the start and the number of instances must not be kept equal.
A specific task to be executed must implement the BaseTask interface.
Tasks are executed by so called worker within a controller instance.
If a task execution runs into a timeout, the proper task is redispatched.
This feature is currently not available and needs to be updated (see issue).
Name | Type | Value | Description |
---|---|---|---|
pid_file | String | Path | Path to pid file for running just one master process |
controller_timeout | Number | n>=0 | Timeout in seconds waiting for an expected controller response |
controller_wait_duration | Number | n>=0 | Wait time in seconds for controller if no tasks are available |
task_resend_timeout | Number | n>=0 | Time duration before resending a task |
Name | Type | Value | Description |
---|---|---|---|
target | String | * | Network target from which to accept messages '*' means all |
port | Number | 1024 - 65535 | TCP port for network communication with controller |
poll_timeout | Number | n>=0 | Polling timeout in seconds waiting for controller messages |
Name | Type | Value | Description |
---|---|---|---|
filename | String | Path | Filepath for log file of master including specific task generator |
This section describes the parameter how to specify a task generator to use.
Name | Type | Value | Description |
---|---|---|---|
module | String | * | Python module path to task generator |
class | String | * | Class name of task generator |
config_file | String | Path | Filepath to config file of the specific task generator |
# Starts the master component with the attached task generator:
./cyclone-master.py -f Configuration/master.conf
It is recommended to start the master component first and than the attaching controller.
The master can also be started after the controller, but this might lead to timeouts if the controller do not find the master in time.
Currently there is a PID control check for the executable name that avoids running multiple processes of the same program name on a host. To run multiple instances the executable names must be different.
Master can be stopped by sending a kill signal with the proper PID with kill <PID>
.
The PID can be retrieved in multiple ways e.g. from the log or from PID file:
# Get PID from log file
grep "Master PID" Runtime/master.log
# Get PID from PID file
cat Runtime/master.pid
Example controller config file
Name | Type | Value | Description |
---|---|---|---|
pid_file | String | Path | Path to pid file for running just one controller process |
request_retry_wait_duration | Number | n>=0 | Seconds to wait until trying next request to master |
max_num_request_retries | Number | n>=0 | Max number of request attempts before quiting |
Name | Type | Value | Description |
---|---|---|---|
target | String | IP-Addr | IP address of master process |
port | Number | 1024 - 65535 | TCP port for network communication with master |
poll_timeout | Number | n>0 | Polling timeout for new messages |
Name | Type | Value | Description |
---|---|---|---|
filename | String | Path | Filepath of log file for controller |
Name | Type | Value | Description |
---|---|---|---|
worker_count | Number | n>0 | Number of worker processes available for task processing |
# Starts the controller:
./cyclone-controller.py -f Configuration/controller.conf
It is recommendend to start the controller after the master.
Otherwise the controller might run into a timeout if the master is not reachable.
Currently there is a PID control check for the executable name that avoids running multiple processes of the same program name on a host. To run multiple instances the executable names must be different.
The controller will shutdown itself when receiving a stop signal by the master or when the master is not reachable.
A controller can be send also a stop signal by the proper locally on a target host.
The PID can be found e.g. in the proper log or PID file.
In any case, if a controller gets killed or crashed this will result in an inconsistent state in Cyclone (see issue).
- Create a specific task class that inherites from
BaseTask
and implements theexecute
method. - The constructor of the new task class must contain each property that should be serialized to the controller instances.
- A XML task file can be used to preinitalize the class properties.