Choose between backpressure and load shedding for each service #1618

teor2345 · 2021-01-22T10:03:36Z

Motivation

In #1593, we reviewed the poll_ready implementations in Zebra, but kept the backpressure implementations mostly the same.

Incorrect backpressure implementations can impact resource usage and cause hangs, so we should make sure that Zebra transmits backpressure correctly.

Constraints

The constraints imposed by the tower::Buffer and tower::Batch implementations are:

poll_ready must be called at least once for each call
Once we've reserved a buffer slot, we always get Poll::Ready from a buffer, regardless of the current readiness of the buffer or its underlying service
The Buffer/Batch capacity limits the number of concurrently waiting tasks. Once this limit is reached, further tasks will block, awaiting a free reservation.
Some tasks can depend on other tasks before they resolve. (For example: block validation.) If there are task dependencies, the Buffer/Batch capacity must be larger than the maximum number of concurrently waiting tasks, or Zebra could deadlock (hang).
To avoid deadlocks, we should acquire slots in a consistent order, using the ready! macro, rather than polling them all simultaneously

For example, see #1735:

        // We acquire checkpoint readiness before block readiness, to avoid an unlikely
        // hang during the checkpoint to block verifier transition. If the checkpoint and
        // block verifiers are contending for the same buffer/batch, we want the checkpoint
        // verifier to win, so that checkpoint verification completes, and block verification
        // can start. (Buffers and batches have multiple slots, so this contention is unlikely.)
        use futures::ready;
        // The chain verifier holds one slot in each verifier, for each concurrent task.
        // Therefore, any shared buffers or batches polled by these verifiers should double
        // their bounds. (For example, the state service buffer.)
        ready!(self
            .checkpoint
            .poll_ready(cx)
            .map_err(VerifyChainError::Checkpoint))?;
        ready!(self.block.poll_ready(cx).map_err(VerifyChainError::Block))?;
        Poll::Ready(Ok(()))

Design Tasks

Develop consistent design patterns for Zebra backpressure.

These patterns could include:

transmit backpressure
load shed directly in the service (or the code that calls the service)
ignore backpressure

The design should include advice on when to use each design pattern.

Complex Design Constraints: Inbound Service

The Inbound service currently doesn't transmit backpressure.

Ideally, we'd like to transmit backpressure when:

the state is busy, but only for requests that use the state
the address book is locked, but only for peer requests

We want to drop old requests when the download queue is full, because the latest requests are most likely to be new blocks we haven't seen yet, so we don't want to buffer any download requests. Since we want recent gossiped blocks, we might want to:

avoid all backpressure in the Inbound service, or
have a separate queue for gossiped blocks

Rejected Designs

We don't want to propagate backpressure in Inbound::poll_ready, because we'd be holding a state buffer slot for every concurrent request, even if the request didn't use the state.

We don't want to generate a Poll::Pending if the downloader is full, because most requests don't use the downloader. (And we'd have to wake the pending task when the downloader emptied.)

Here's a backpressure design for the address book:

lock and copy the address book data in Inbound::call, so the Inbound buffer fills up if the address book is locked or heavily contended

But I'm not sure if there's an API that does the same thing for the state service:

acquire the state buffer slot in Inbound::call, so the Inbound buffer fills up if the state is busy

Potential Design

We might want to turn Inbound requests into a Stream, so they hold slots until the future resolves to a result. (Rather than Buffers, which hold slots until the future is created and returned.) That way, we can drop new requests when the Inbound stream is full, transmitting backpressure.

Implementation Tasks

Review:

every Zebra Service and Stream
all the code that calls Zebra Services and Streams

and make sure the code uses one of the Zebra backpressure design patterns.

In particular, we should search for the following strings:

poll_ready for Services
ready_and for Services
poll_next for Streams
any other Service or ServiceExt functions that call poll_ready
any other Stream or StreamExt functions that call poll_next

The text was updated successfully, but these errors were encountered:

Uses the "load shed directly" design pattern from ZcashFoundation#1618.

Uses the "load shed directly" design pattern from #1618.

teor2345 · 2022-02-24T01:53:30Z

We should fix issues with individual services as they come up.

teor2345 added C-bug Category: This is a bug C-design Category: Software design work A-rust Area: Updates to Rust code C-cleanup Category: This is a cleanup labels Jan 22, 2021

teor2345 mentioned this issue Jan 22, 2021

Make sure Zebra uses poll_ready and Buffer reservations correctly #1593

Closed

9 tasks

teor2345 added a commit to teor2345/zebra that referenced this issue Jan 22, 2021

Limit concurrent inbound gossipped block requests

f8a1152

Uses the "load shed directly" design pattern from ZcashFoundation#1618.

teor2345 mentioned this issue Jan 22, 2021

Fix poll_ready usage for the Inbound service #1620

Merged

2 tasks

teor2345 added a commit to teor2345/zebra that referenced this issue Jan 22, 2021

Limit concurrent inbound gossipped block requests

65b4973

Uses the "load shed directly" design pattern from ZcashFoundation#1618.

teor2345 mentioned this issue Jan 22, 2021

Limit inbound download and verify queue #1622

Merged

2 tasks

teor2345 added P-Low and removed P-Low labels Jan 28, 2021

teor2345 added a commit that referenced this issue Jan 29, 2021

Limit concurrent inbound gossipped block requests

21b0360

Uses the "load shed directly" design pattern from #1618.

This was referenced Feb 3, 2021

Document the Buffer/Batch constraints on tower_fallback #1674

Closed

Fix poll_ready usage for the ChainVerifier service #1679

Closed

Fix poll_ready usage for the script::Verifier service #1681

Closed

teor2345 mentioned this issue Feb 15, 2021

Partly revert "Fix poll_ready usage in ChainVerifier" #1735

Merged

teor2345 added the S-incomplete label Feb 24, 2022

teor2345 closed this as completed Feb 24, 2022

arya2 mentioned this issue Mar 31, 2023

feat(script): Add binary for finding references to closed issues #6347

Merged

6 tasks

mpguerra mentioned this issue Apr 11, 2023

Tracking: TODOs with closed tasks #6281

Closed

4 tasks

mpguerra closed this as not planned Won't fix, can't repro, duplicate, stale May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose between backpressure and load shedding for each service #1618

Choose between backpressure and load shedding for each service #1618

teor2345 commented Jan 22, 2021 •

edited

Loading

Rejected Designs

Potential Design

teor2345 commented Feb 24, 2022

Choose between backpressure and load shedding for each service #1618

Choose between backpressure and load shedding for each service #1618

Comments

teor2345 commented Jan 22, 2021 • edited Loading

Motivation

Constraints

Design Tasks

Complex Design Constraints: Inbound Service

Rejected Designs

Potential Design

Implementation Tasks

teor2345 commented Feb 24, 2022

teor2345 commented Jan 22, 2021 •

edited

Loading