Improve handling of gRPC failures #471

erichulburd · 2024-05-14T19:42:05Z

Over the past few months, I've seen a variety of different gRPC status failures that should be retryable on the client side. A most recent example:

QpuApiError                               Traceback (most recent call last)
...
    212 """Execute a job and return the shots."""
    213 job_id = submit(
    214     program=executable.program,
    215     patch_values=patch,
   (...)
    218     execution_options=self.execution_options,
    219 )
--> 220 return retrieve_results(
    221     job_id=job_id,
    222     quantum_processor_id=self.device_name,
    223     client=self.qcs_client,
    224     execution_options=self.execution_options,
    225 )

QpuApiError: Call failed during gRPC request: status: Unavailable, message: "error trying to connect: Unsuccessful reply: TtlExpired", details: [], metadata: MetadataMap { headers: {} }

It's difficult to diagnose and handle errors of this nature in Python as the QCS SDK is currently structured. I advocate consideration for the following:

Supporting retry configuration on all gRPC calls - translation, execution (ie submit), and result retrieval retrieve_results. This should support retry based on gRPC status code as well as a backoff strategy - linear, exponential, max retries, etc.
Surfacing gRPC exceptions to Python in a structured way. At a minimum, this should include the status code. Request id and timing data would also be nice.
Configurable gRPC logging. The gRPC C API uses environment variables in a well structured and documented way: https://github.com/grpc/grpc/blob/15850972ddba9c1262a9d51341da03bc607bd934/doc/environment_variables.md
A persistent handle to the gRPC channel. The way the client is currently structured, each call to translate, execute, and retrieve results instantiates a new channel (see for instance

qcs-sdk-rust/crates/lib/src/qpu/api.rs

Line 292 in e73f83d

let mut controller_client = execution_options

and then https://github.com/rigetti/qcs-sdk-rust/blob/main/crates/lib/src/qpu/api.rs#L525). This both adds latency and makes connections more fallible, which is contrary to the design of gRPC. If necessary, this should be achievable with some once_cell utilities: https://docs.rs/once_cell/latest/once_cell/sync/struct.Lazy.html.

If these options present inordinate technical challenges, I wonder if an alternative approach would be to interface with existing Python gRPC tooling - as in expose functions that convert Python based gRPC message objects to QCS SDK structs.

The text was updated successfully, but these errors were encountered:

erichulburd changed the title ~~Improve handling of retryable error~~ Improve handling of gRPC failures May 14, 2024

MarquessV mentioned this issue Aug 13, 2024

Expose error model to both Rust and Python users #491

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of gRPC failures #471

Improve handling of gRPC failures #471

erichulburd commented May 14, 2024

Improve handling of gRPC failures #471

Improve handling of gRPC failures #471

Comments

erichulburd commented May 14, 2024