Skip to content

Latest commit

 

History

History
455 lines (354 loc) · 18.5 KB

09.poster-node.md

File metadata and controls

455 lines (354 loc) · 18.5 KB

Standalone PoSter node

In earlier versions, although damocles-manager supports using --poster and --miner parameters of the daemon run command to enable the corresponding module, the post proof process is still of strong correlation with sector location information which makes it more limited and difficult to expand.

From v0.2.0 onwards, we have provided a series of function combinations that make easy-to-use, scalable standalone PoSter nodes an option for SP of large-scale operations and multiple miner ids.

Below, we will introduce these new features and provide a practice to complete the deployment of standalone PoSter nodes using these features. Subsequent documents use the node with --poster enabled as an example, and the standalone --miner node operates in a similar manner, which will not be described separately.


From version v0.8.0 and onwards, damocles supports three ways to run PoSter nodes independently, namely worker-prover mode, proxy node mode, and ext-prover mode (external executor mode).

worker-prover mode

The worker-prover mode is a new feature of v0.8.0. It is characterized by simplicity and can support multi-machine wdpost (with coordination and redundancy) very easily.

Fundamental

The worker-prover mode uses damocles-worker to compute the window post proof, obtains the window post task from damocles-manager through RPC and returns the computation result.

damocles-worker adds wdpost planner for executing window post tasks.

Architecture

                +-----------------------------------+
                |     damocles-manager daemon       |
                |     with --worker-prover flag     |
                |                                   |
                |        +-----------------+        |
                |        |damocles-manager |        |
                |        |  poster module  |        |
                |        +-------+-^-------+        |
                |           send | |recv            |
                |                | |                |
                |        +-------v-+-------+        |
                |        |  worker-prover  |        |
       +--------+-------->      module     <--------+--------+
       |        |        +--------^--------+        |        |
       |        |                 |                 |        |
       |        +-----------------+-----------------+        |
       |                          |                          |
-------+--------------------------+--------------------------+------------
       |                          |                          |
       | pull job                 | pull job                 | pull job
       | push res                 | push res                 | push res
       | by rpc                   | by rpc                   | by rpc
       |                          |                          |
+------+--------+         +-------+-------+           +------+--------+
|damocles-worker|         |damocles-worker|           |damocles-worker|
|wdpost planner |         |wdpost planner |  ...      |wdpost planner |
+---------------+         +---------------+           +---------------+

Damocles-manager Configuration and Startup

New configuration:

# ~/.damocles-manager/sector-manager.cfg

#...

[Common.Proving.WorkerProver]
# The maximum number of attempts of the WindowPoSt task, optional, number type
# Default value is 2
# The WindowPoSt task whose number of attempts exceeds JobMaxTry can only be re-executed by manually resetting
JobMaxTry = 2
# WindowPoSt task heartbeat timeout, optional, time string type
# Default value is 15s
# Tasks that have not sent a heartbeat for more than this amount of time will be set as failed and retried
HeartbeatTimeout = "15s"
# WindowPoSt task timeout, optional, time string type
# Default value is 25h
# WindowPoSt tasks whose creation time exceeds this time will be deleted
JobLifetime = "25h0m0s"

#...

Start the damocles-manager process:

# --miner flag is optional to add, which means to start the miner module to execute WinningPoSt and produce blocks
# --poster flag must be added, which means to start the WindowPoSt module
# --worker-prover must be added, indicating that the WorkerProver module is used to execute WindowPoSt
./damocles-manager daemon run --miner --poster --worker-prover

damocles-worker configuration

Configuration walkthrough:

[[sealing_thread]]
# Configure to use wdpost plan
plan = "wdpost"
# The configuration limits the execution of the task to the specified miner id; if it is left empty, it means no limit
# sealing.allowed_miners = [6666, 7777]
# Configure tasks that only allow sectors of the specified size to run
# allowed_sizes = ["32GiB", "64GiB"]

[[attached]]
# Configure the permanent storage that this worker will use during the execution of the window post task
name = "miner-6666-store"
location = "/mnt/miner-6666-store"


# Control window_post task concurrency (optional), no limit if not configured
[processors.limitation.concurrent]
window_post = 2

[[processors. window_post]]
# Use a custom wdpost proof (optional), if you do not configure bin, the built-in proof will be used by default
bin="~/my_algorithm"
args = ["window_post"]
# Limit the cpuset to this sub processes
# cgroup.cpuset = "10-19"
# Configure environment variables for custom proof (optional)
envs = { BELLMAN_GPU_INDEXS="0", CUDA_VISIBLE_DEVICES="0", TMPDIR = "/tmp/worker-prover1/", ... }
# Configure the maximum concurrent number of this process (optional), no limit if not configured
concurrent = 1
A simple example configuration to start just one wdpost sealing_thread is as follows:
# /path/to/your-damocles-worker-config.toml

[worker]
name = "damocles-worker-USA-01"

[sector_manager]
rpc_client.addr = "/ip4/your-damocles-manager-address-here/tcp/1789"

[[sealing_thread]]
plan = "wdpost"
# The time interval for trying to claim tasks, the default is 60s,
# For wdpost plan, we can reduce this value to get new wdpost tasks faster
sealing.recover_interval = "15s"
# sealing.allowed_miners = [6666]
# sealing. allowed_sizes = ["32GiB"]
#...

[[attached]]
name = "miner-6666-store"
location = "/mnt/miner-6666-store"
Example configuration for two wdpost job with one GPU
# /path/to/your-damocles-worker-config.toml

[worker]
name = "damocles-worker-USA-01"

[sector_manager]
rpc_client.addr = "/ip4/your-damocles-manager-address-here/tcp/1789"

[[sealing_thread]]
plan = "wdpost"
sealing.recover_interval = "15s"
sealing.allowed_miners = [6666]
# ...

[[sealing_thread]]
plan = "wdpost"
sealing.recover_interval = "15s"
sealing.allowed_miners = [7777]
# ...

[[attached]]
name = "miner-6666-store"
location = "/mnt/miner-6666-store"

[[attached]]
name = "miner-7777-store"
location = "/mnt/miner-7777-store"

# -------------------------

[[processors.window_post]]
bin="~/my_algorithm"
# args = ["window_post", ...]
cgroup.cpuset = "10-19"
envs = { CUDA_VISIBLE_DEVICES="0", TMPDIR = "/tmp/worker-prover1/", ... }
concurrent = 1

[[processors.window_post]]
bin="~/my_algorithm"
# args = ["window_post", ...]
cgroup.cpuset = "20-29"
envs = { CUDA_VISIBLE_DEVICES="0", TMPDIR = "/tmp/worker-prover1/", ... }
concurrent = 1

Doing two wdPost jobs with one GPU can greatly improve utilization. As each wdPost job is composed of vanilla_proofs (CPU computation) and snark_proof (GPU computation), improved GPU utilization can be achieved through allowing vanilla_proofs to be computed in parallel and having snark_proof to be computed serially using GPU lock file in TMPDIR.

An example of a wdpost machine equipped with two graphics cards
# /path/to/your-damocles-worker-config.toml

[worker]
name = "damocles-worker-USA-01"

[sector_manager]
rpc_client.addr = "/ip4/your-damocles-manager-address-here/tcp/1789"

[[sealing_thread]]
plan = "wdpost"
sealing.recover_interval = "15s"
sealing.allowed_miners = [6666]
#...

[[sealing_thread]]
plan = "wdpost"
sealing.recover_interval = "15s"
sealing.allowed_miners = [7777]
#...

[[attached]]
name = "miner-6666-store"
location = "/mnt/miner-6666-store"

[[attached]]
name = "miner-7777-store"
location = "/mnt/miner-7777-store"

# -------------------------

[[processors. window_post]]
# bin="~/my_algorithm"
# args = ["window_post", ...]
# cgroup.cpuset = "10-19"
envs = { ... }
concurrent = 2

# ----------- or ---------

#[[processors. window_post]]
# bin="~/my_algorithm"
# args = ["window_post", ...]
# cgroup.cpuset = "10-19"
# envs = { CUDA_VISIBLE_DEVICES="0", TMPDIR = "/tmp/worker-prover1/",, ... }
# concurrent = 1

# [[processors. window_post]]
# bin="~/my_algorithm"
# args = ["window_post"]
# cgroup.cpuset = "20-29"
# envs = { CUDA_VISIBLE_DEVICES="1", TMPDIR = "/tmp/worker-prover2/", ... }
# concurrent = 1

When damocles-worker is running wdpost plan, it is not necessary to use the damocles-worker store sealing-init -l command to initialize the local storage directory of the data.

Manage window post tasks

  • Show window post task list

# By default, unfinished tasks and failed tasks are displayed, where the DDL field represents the deadline Index of the task, and the Try field is the number of attempts of the task
./damocles-manager util worker wdpost list

JobID MinerID DDL Partitions Worker State Try CreateAt Elapsed Heartbeat Error
3FgfEnvrub1 1037 3 1,2 10.122.63.30 ReadyToRun 1 07-27 16:37:31 - -
gbCVH4TUgEf 1037 2 1,2 ReadyToRun 0 07-27 16:35:56 - -
CrotWCLaXLa 1037 1 1,2 10.122.63.30 Succeed 1 07-27 17:19:04 6m38s(done) -

# show all tasks
./damocles-manager util worker wdpost list --all
#...

# show window post task details
./damocles-manager util worker wdpost list --detail
#...
  • reset task

When the execution of the window post task fails and the number of automatic retries reaches the limit, the task status can be manually reset so that it can continue to be picked up and executed by damocles-worker.

./damocles-manager util worker wdpost reset gbCVH4TUgEf 3FgfEnvrub1
  • delete task

Deleting a task is similar to resetting a task. When the command to delete a task is executed, the retry mechanism of damocles-manager will detect whether the window post task of the current deadline exists in the database, if not, it will resend the task and record it in the database.

In addition, worker-prover will automatically delete tasks that have been created for more than a certain period of time (the default is 25 hours, and the time is configurable).

# Delete the specific task
./damocles-manager util worker wdpost remove gbCVH4TUgEf 3FgfEnvrub1

# delete all tasks
./damocles-manager util worker wdpost remove-all --really-do-it

Disable damocles-manager to use GPU

After enabling the worker-prover feature, winning_post is executed by damocles-manager. If you do not want winning_post to use the GPU, you can compile damocles-manager with the following command to disable it:

make dist-clean
FFI_BUILD_FROM_SOURCE=1 FFI_USE_GPU=0 make build-manager

Proxy node mode

We know that for PoSter nodes, the most important capability is to obtain real-time and accurate sector location information. In the current damocles-manager version, we only provide metadata management based on the local embedded kv database (more to be supported).

This only allows data to be managed by one process, and direct data sharing across processes is not possible.

Therefore, we designed the proxy node mode to provide some metadata to other consumer nodes through network interfaces, thus realizing data sharing.

How to use the proxy node

We have added the --proxy parameter to the daemon run command. Its format is like {ip}:{port}. When the startup command contains a valid --proxy parameter, {ip}:{port} will be used as data source for the current damocles-manager node and the necessary metadata (read-only) management module will be constructed.

In addition to --proxy, we also provide switches that control whether proxy mode is enabled for specific data management modules.

We just provide --proxy-sector-indexer-off switch for the time being. When --proxy-sector-indexer-off is enabled, nodes use the SectorIndexer database in their own data directory.

For example, if started with the damocles-manager daemon run --miner command, it will launch a damocles-manager instance listening on port 1789 using ~/.damocles-manager as the data directory with mining module enabled.

At this time, we can use the following command to initialize and start a proxy node with the above instance as the data source on the same machine. This proxy node will use ~/.damocles-manager2 as the data directory and listen to 2789 port.

damocles-manager --home ~/.damocles-manager2 daemon init
// maintain configuration files
damocles-manager --home ~/.damocles-manager2 daemon run --proxy="127.0.0.1:1789" --listen=":2789" --poster

The proxy node can provide the exact same and real-time sector location information as the source node.

The agent node uses the existing configuration file

According to the method described in the previous section, we can already start an proxy node, but there is still a problem with this startup method: the configuration file of the proxy node needs to be written again, or copied from the data directory of the source node. This introduces additional maintenance work, especially when configuration files may change frequently.

For this, we also provide a --conf-dir parameter, which is in the form of a directory path. When the startup command includes a valid --conf-dir parameter, the node will use the configuration file that already exists in the specified directory as its own configuration file.

This saves the work of writing and maintaining configuration files for different source and agent nodes on the same machine and serving the same set of clusters.

Based on this function, the agent node startup method mentioned in the previous section can become:

damocles-manager --home ~/.damocles-manager2 daemon run --proxy="127.0.0.1:1789" --listen=":2789" --conf-dir="~/.damocles-manager" --poster

At this point, the source node and the proxy node will use the same batch of configuration files.

ext-prover executor

In addition to sharing sector information, another challenge faced by standalone PoSter nodes is the utilization of hardware resources.

Limited by the underlying algorithm library, granularity of computing nodes utilizing GPUs is by process. This makes it difficult for PoSter nodes to effectively utilize the computing power of multiple GPUs, and it is also difficult to safely avoid proof timeouts when multiple SPs have conflicting WindostPoSt proof windows.

For this, we provide an ext-prover mechanism similar to the ext processor in damocles-worker.

The ext-prover mechanism consists of two components:

  1. The --ext-prover parameter of the daemon run command
  2. The ext-prover.cfg configuration file in the node data directory

A default ext-prover.cfg file looks like:

# Default config:
#[[WdPost]]
#Bin = "/path/to/custom/bin"
#Args = ["args1", "args2", "args3"]
#Concurrent = 1
#Weight = 1
#ReadyTimeoutSecs = 5
#[WdPost.Envs]
#ENV_KEY = "ENV_VAL"
#
#[[WinPost]]
#Bin = "/path/to/custom/bin"
#Args = ["args1", "args2", "args3"]
#Concurrent = 1
#Weight = 1
#ReadyTimeoutSecs = 5
#[WinPost.Envs]
#ENV_KEY = "ENV_VAL"
#

In recent versions, daemon init initializes the ext-prover.cfg file.

Users can write their own, or copy the corresponding files from a data directory initialized by the latest version to an existing data directory.

The functions of the configuration items in ext-prover.cfg are very similar to the configuration blocks in damocles-worker, and users can refer to the corresponding documents for reference.

When the --ext-prover parameter is included in the start command of damocles-manager, the node will use the ext-prover.cfg configuration file in the configuration directory as the basis for starting child processes. For this configuration file, setting the --conf-dir parameter will also have an effect.

If user sees logs like the following, then it means that ext-prover is ready.

2022-04-27T19:15:00.441+0800 INFO porver-ext ext/prover.go:122 response loop start {"pid": 24764, "ppid": 24732, "loop": "resp"}
2022-04-27T19:15:00.441+0800 INFO porver-ext ext/prover.go:155 request loop start {"pid": 24764, "ppid": 24732, "loop": "req"}
2022-04-27T19:15:00.468+0800 INFO processor-cmd processor/processor.go:35 ready {"pid": 24764, "ppid": 24732, "proc": "wdpost"}

Deployment Practice

Suppose we have a node machine with 8 GPUs, then we can provide stronger PoSt processing capabilities through the following configuration.

  1. Configure and start the source node

    damocles-manager daemon run --miner
    

    At this time, the source node only provides functions and capabilities related to sealing;

  2. Configure the ext-prover.cfg file:

    [[WdPost]]
    [WdPost.Envs]
    CUDA_VISIBLE_DEVICES = "0"
    TMPDIR = "/tmp/ext-prover0/"
    
    [[WdPost]]
    [WdPost.Envs]
    CUDA_VISIBLE_DEVICES = "1"
    TMPDIR = "/tmp/ext-prover1/"
    
    [[WdPost]]
    [WdPost.Envs]
    CUDA_VISIBLE_DEVICES = "2"
    TMPDIR = "/tmp/ext-prover2/"
    
    [[WdPost]]
    [WdPost.Envs]
    CUDA_VISIBLE_DEVICES = "3"
    TMPDIR = "/tmp/ext-prover3/"
    
    [[WdPost]]
    [WdPost.Envs]
    CUDA_VISIBLE_DEVICES = "4"
    TMPDIR = "/tmp/ext-prover4/"
    
    [[WdPost]]
    [WdPost.Envs]
    CUDA_VISIBLE_DEVICES = "5"
    TMPDIR = "/tmp/ext-prover5/"
    
    [[WdPost]]
    [WdPost.Envs]
    CUDA_VISIBLE_DEVICES = "6"
    TMPDIR = "/tmp/ext-prover6/"
    
    [[WdPost]]
    [WdPost.Envs]
    CUDA_VISIBLE_DEVICES = "7"
    TMPDIR = "/tmp/ext-prover7/"
    
    
  3. Initialize and start a standalone PoSter node

    damocles-manager --home=~/.damocles-individual-poster daemon init
    damocles-manager --home=~/.damocles-individual-poster daemon run --proxy="127.0.0.1:1789" --poster --listen=":2789" --conf-dir="~/.damocles-manager" --ext-prover
    

By this way of deployment,

  • The source node provides both sealing and mining support
  • Proxy nodes provide WindowPoSt support
    • The proxy node enables ext-prover, and each child process independently uses a GPU and a computing lock directory

There is no conflict between winningPost and windowPost due to device usage

Limitations

So far, we have described the functions, principles and simple usage examples that stand-alone PoSter nodes rely on.

However, this mode still has some limitations for very large SP clusters, which may manifest in:

  • The scheduling of the PoSt and the serious conflict in the PoSt window period still relies on the operation and maintenance to a certain extent;