Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choose between backpressure and load shedding for each service #1618

Closed
teor2345 opened this issue Jan 22, 2021 · 1 comment
Closed

Choose between backpressure and load shedding for each service #1618

teor2345 opened this issue Jan 22, 2021 · 1 comment
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug C-cleanup Category: This is a cleanup C-design Category: Software design work

Comments

@teor2345
Copy link
Contributor

teor2345 commented Jan 22, 2021

Motivation

In #1593, we reviewed the poll_ready implementations in Zebra, but kept the backpressure implementations mostly the same.

Incorrect backpressure implementations can impact resource usage and cause hangs, so we should make sure that Zebra transmits backpressure correctly.

Constraints

The constraints imposed by the tower::Buffer and tower::Batch implementations are:

  1. poll_ready must be called at least once for each call
  2. Once we've reserved a buffer slot, we always get Poll::Ready from a buffer, regardless of the current readiness of the buffer or its underlying service
  3. The Buffer/Batch capacity limits the number of concurrently waiting tasks. Once this limit is reached, further tasks will block, awaiting a free reservation.
  4. Some tasks can depend on other tasks before they resolve. (For example: block validation.) If there are task dependencies, the Buffer/Batch capacity must be larger than the maximum number of concurrently waiting tasks, or Zebra could deadlock (hang).
  5. To avoid deadlocks, we should acquire slots in a consistent order, using the ready! macro, rather than polling them all simultaneously

For example, see #1735:

        // We acquire checkpoint readiness before block readiness, to avoid an unlikely
        // hang during the checkpoint to block verifier transition. If the checkpoint and
        // block verifiers are contending for the same buffer/batch, we want the checkpoint
        // verifier to win, so that checkpoint verification completes, and block verification
        // can start. (Buffers and batches have multiple slots, so this contention is unlikely.)
        use futures::ready;
        // The chain verifier holds one slot in each verifier, for each concurrent task.
        // Therefore, any shared buffers or batches polled by these verifiers should double
        // their bounds. (For example, the state service buffer.)
        ready!(self
            .checkpoint
            .poll_ready(cx)
            .map_err(VerifyChainError::Checkpoint))?;
        ready!(self.block.poll_ready(cx).map_err(VerifyChainError::Block))?;
        Poll::Ready(Ok(()))

Design Tasks

Develop consistent design patterns for Zebra backpressure.

These patterns could include:

  • transmit backpressure
  • load shed directly in the service (or the code that calls the service)
  • ignore backpressure

The design should include advice on when to use each design pattern.

Complex Design Constraints: Inbound Service

The Inbound service currently doesn't transmit backpressure.

Ideally, we'd like to transmit backpressure when:

  • the state is busy, but only for requests that use the state
  • the address book is locked, but only for peer requests

We want to drop old requests when the download queue is full, because the latest requests are most likely to be new blocks we haven't seen yet, so we don't want to buffer any download requests. Since we want recent gossiped blocks, we might want to:

  • avoid all backpressure in the Inbound service, or
  • have a separate queue for gossiped blocks

Rejected Designs

We don't want to propagate backpressure in Inbound::poll_ready, because we'd be holding a state buffer slot for every concurrent request, even if the request didn't use the state.

We don't want to generate a Poll::Pending if the downloader is full, because most requests don't use the downloader. (And we'd have to wake the pending task when the downloader emptied.)

Here's a backpressure design for the address book:

  • lock and copy the address book data in Inbound::call, so the Inbound buffer fills up if the address book is locked or heavily contended

But I'm not sure if there's an API that does the same thing for the state service:

  • acquire the state buffer slot in Inbound::call, so the Inbound buffer fills up if the state is busy

Potential Design

We might want to turn Inbound requests into a Stream, so they hold slots until the future resolves to a result. (Rather than Buffers, which hold slots until the future is created and returned.) That way, we can drop new requests when the Inbound stream is full, transmitting backpressure.

Implementation Tasks

Review:

  • every Zebra Service and Stream
  • all the code that calls Zebra Services and Streams

and make sure the code uses one of the Zebra backpressure design patterns.

In particular, we should search for the following strings:

  • poll_ready for Services
  • ready_and for Services
  • poll_next for Streams
  • any other Service or ServiceExt functions that call poll_ready
  • any other Stream or StreamExt functions that call poll_next
@teor2345 teor2345 added C-bug Category: This is a bug C-design Category: Software design work A-rust Area: Updates to Rust code C-cleanup Category: This is a cleanup labels Jan 22, 2021
teor2345 added a commit to teor2345/zebra that referenced this issue Jan 22, 2021
Uses the "load shed directly" design pattern from ZcashFoundation#1618.
teor2345 added a commit to teor2345/zebra that referenced this issue Jan 22, 2021
Uses the "load shed directly" design pattern from ZcashFoundation#1618.
@teor2345 teor2345 added P-Low and removed P-Low labels Jan 28, 2021
teor2345 added a commit that referenced this issue Jan 29, 2021
Uses the "load shed directly" design pattern from #1618.
@teor2345
Copy link
Contributor Author

We should fix issues with individual services as they come up.

@mpguerra mpguerra closed this as not planned Won't fix, can't repro, duplicate, stale May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug C-cleanup Category: This is a cleanup C-design Category: Software design work
Projects
None yet
Development

No branches or pull requests

2 participants