fix(rpc): Improve input validation and error handling #13069

thevilledev · 2025-04-22T14:58:47Z

The rpc-server was vulnerable to Denial of Service attacks via several RPC commands (SET_TENSOR, GRAPH_COMPUTE, etc.). Malformed messages could trigger failed assertions (e.g., invalid ggml_type) or out-of-bounds reads/writes leading to GGML_ABORT calls, crashing the server process.

This PR introduces robust input validation and replaces abort() calls with graceful error handling:

Type Validation: deserialize_tensor now checks if the tensor->type is within the valid GGML_TYPE_COUNT range before calling ggml_new_tensor_4d. Returns nullptr on invalid type.
Bounds Checks: Replaced GGML_ABORT in set_tensor, set_tensor_hash, and get_tensor handlers with error logging and returning false when data/offset parameters are out of buffer bounds.
Error Propagation:
- create_node now checks for nullptr return values from deserialize_tensor and its recursive calls, propagating nullptr upwards on failure. Uses find instead of at for safer map access.
- copy_tensor now checks for nullptr from deserialize_tensor and sets the response status to failure if deserialization or bounds checks fail.
- graph_compute now checks for nullptr return from create_node and returns failure status correctly. The final return value now reflects the actual computation status.
- RPC_CMD_GET_ALLOC_SIZE now checks the return value of server.get_alloc_size in the RPC server
  loop. If the call fails, return early to close the connection.

lexasub · 2025-04-22T21:18:39Z

on my opinion it may affects to perfomance, may be use feature flag (via cmake)?

ggml/src/ggml-rpc/ggml-rpc.cpp

thevilledev · 2025-04-23T18:07:59Z

on my opinion it may affects to perfomance, may be use feature flag (via cmake)?

I believe it would be interesting to see what the performance impact of this change is. I'm new to the project so pointers welcome if there's a test suite available which would show that.

Slightly off-topic but related: I think there's plenty of opportunities for similar improvements in the RPC server. From invalid tensor operations to crashing via deep recursion in create_node which I would like to also fix. I'd like to work on those one change at a time though.

I think multiple critical fixes behind a feature flag would be counterintuitive. Rather build bench tooling (if needed) and iterate on the fixes so there's minimal performance hit.

rgerganov · 2025-04-24T08:45:39Z

I believe it would be interesting to see what the performance impact of this change is. I'm new to the project so pointers welcome if there's a test suite available which would show that.

We use llama-bench to test performance

Slightly off-topic but related: I think there's plenty of opportunities for similar improvements in the RPC server.

The best investment of efforts in this direction would be creating a script/job for coverage guided fuzzing. This way we can automatically test for security issues when we make RPC changes and even integrate it into the CI.

slaren · 2025-04-24T10:48:56Z

RPC_CMD_GET_ALLOC_SIZE does not check for errors, and if the call to get_alloc_size fails it will leave the client connected in a bad state:

llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp

Line 1470 in 604f0a0

server.get_alloc_size(request, response);

thevilledev · 2025-04-24T17:51:56Z

Thanks @slaren, added it to this same PR since it falls under the same scope. e6dd976

thevilledev · 2025-04-24T17:53:10Z

The best investment of efforts in this direction would be creating a script/job for coverage guided fuzzing.

Sounds good, I can look into that after this PR 👍

The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - **Type Validation:** `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - **Size Checks:** Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - **Error Propagation:** - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 22, 2025

rgerganov reviewed Apr 23, 2025

View reviewed changes

thevilledev force-pushed the fix/tensor-ggml-type branch from bef194d to 604f0a0 Compare April 23, 2025 18:00

thevilledev force-pushed the fix/tensor-ggml-type branch from e6dd976 to 359e38e Compare April 26, 2025 06:01

thevilledev added 5 commits April 26, 2025 09:03

refactor(rpc): address pr comments

cd054aa

removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

fix(rpc): Handle get_alloc_size failure in server

e38c4d7

Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

refactor(rpc): input size validation in graph_compute

72c447a

Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

thevilledev force-pushed the fix/tensor-ggml-type branch from 359e38e to 72c447a Compare April 26, 2025 06:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rpc): Improve input validation and error handling #13069

fix(rpc): Improve input validation and error handling #13069

thevilledev commented Apr 22, 2025 •

edited

Loading

lexasub commented Apr 22, 2025

thevilledev commented Apr 23, 2025 •

edited

Loading

rgerganov commented Apr 24, 2025

slaren commented Apr 24, 2025

thevilledev commented Apr 24, 2025

thevilledev commented Apr 24, 2025

fix(rpc): Improve input validation and error handling #13069

Are you sure you want to change the base?

fix(rpc): Improve input validation and error handling #13069

Conversation

thevilledev commented Apr 22, 2025 • edited Loading

lexasub commented Apr 22, 2025

thevilledev commented Apr 23, 2025 • edited Loading

rgerganov commented Apr 24, 2025

slaren commented Apr 24, 2025

thevilledev commented Apr 24, 2025

thevilledev commented Apr 24, 2025

thevilledev commented Apr 22, 2025 •

edited

Loading

thevilledev commented Apr 23, 2025 •

edited

Loading