Skip to content

fix(rpc): Improve input validation and error handling #13069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

thevilledev
Copy link

@thevilledev thevilledev commented Apr 22, 2025

Fixes #13067

The rpc-server was vulnerable to Denial of Service attacks via several RPC commands (SET_TENSOR, GRAPH_COMPUTE, etc.). Malformed messages could trigger failed assertions (e.g., invalid ggml_type) or out-of-bounds reads/writes leading to GGML_ABORT calls, crashing the server process.

This PR introduces robust input validation and replaces abort() calls with graceful error handling:

  • Type Validation: deserialize_tensor now checks if the tensor->type is within the valid GGML_TYPE_COUNT range before calling ggml_new_tensor_4d. Returns nullptr on invalid type.
  • Bounds Checks: Replaced GGML_ABORT in set_tensor, set_tensor_hash, and get_tensor handlers with error logging and returning false when data/offset parameters are out of buffer bounds.
  • Error Propagation:
    • create_node now checks for nullptr return values from deserialize_tensor and its recursive calls, propagating nullptr upwards on failure. Uses find instead of at for safer map access.
    • copy_tensor now checks for nullptr from deserialize_tensor and sets the response status to failure if deserialization or bounds checks fail.
    • graph_compute now checks for nullptr return from create_node and returns failure status correctly. The final return value now reflects the actual computation status.
    • RPC_CMD_GET_ALLOC_SIZE now checks the return value of server.get_alloc_size in the RPC server
      loop. If the call fails, return early to close the connection.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 22, 2025
@lexasub
Copy link
Contributor

lexasub commented Apr 22, 2025

on my opinion it may affects to perfomance, may be use feature flag (via cmake)?

@thevilledev thevilledev force-pushed the fix/tensor-ggml-type branch from bef194d to 604f0a0 Compare April 23, 2025 18:00
@thevilledev
Copy link
Author

thevilledev commented Apr 23, 2025

on my opinion it may affects to perfomance, may be use feature flag (via cmake)?

I believe it would be interesting to see what the performance impact of this change is. I'm new to the project so pointers welcome if there's a test suite available which would show that.

Slightly off-topic but related: I think there's plenty of opportunities for similar improvements in the RPC server. From invalid tensor operations to crashing via deep recursion in create_node which I would like to also fix. I'd like to work on those one change at a time though.

I think multiple critical fixes behind a feature flag would be counterintuitive. Rather build bench tooling (if needed) and iterate on the fixes so there's minimal performance hit.

@rgerganov
Copy link
Collaborator

I believe it would be interesting to see what the performance impact of this change is. I'm new to the project so pointers welcome if there's a test suite available which would show that.

We use llama-bench to test performance

Slightly off-topic but related: I think there's plenty of opportunities for similar improvements in the RPC server.

The best investment of efforts in this direction would be creating a script/job for coverage guided fuzzing. This way we can automatically test for security issues when we make RPC changes and even integrate it into the CI.

@slaren
Copy link
Member

slaren commented Apr 24, 2025

RPC_CMD_GET_ALLOC_SIZE does not check for errors, and if the call to get_alloc_size fails it will leave the client connected in a bad state:

server.get_alloc_size(request, response);

@thevilledev
Copy link
Author

Thanks @slaren, added it to this same PR since it falls under the same scope. e6dd976

@thevilledev
Copy link
Author

The best investment of efforts in this direction would be creating a script/job for coverage guided fuzzing.

Sounds good, I can look into that after this PR 👍

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
@thevilledev thevilledev force-pushed the fix/tensor-ggml-type branch from e6dd976 to 359e38e Compare April 26, 2025 06:01
removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
@thevilledev thevilledev force-pushed the fix/tensor-ggml-type branch from 359e38e to 72c447a Compare April 26, 2025 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misc. bug: RPC server crash on SET_TENSOR with invalid ggml_type
4 participants