Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement additional QoS metadata #1231

Merged
merged 9 commits into from
Oct 21, 2024
Merged

Conversation

carterkozak
Copy link
Contributor

@carterkozak carterkozak commented Oct 4, 2024

This adds two new fields to QosException, which may produce following response headers:

  • Qos-Due-To: custom: Describes this QoS failure as the result of an aspect of the request that doesn't necessarily apply to all requests, and shouldn't impact routing behavior. For example, per-user rate limits are specific to a single user, not all users.
  • Qos-Retry-Hint: do-not-retry: Hints that callers should not retry the request which resulted in this failure.

Before this PR

No way to provide additional context with QoS exceptions.

After this PR

==COMMIT_MSG==
Implement additional QoS metadata
==COMMIT_MSG==

This PR provides two pieces:

  1. QosReason provides additional fields: RetryHint and DueTo
  2. We provide a QosReasons utility which handles recording this metadata to a response, and recording from a response. This allows us to make changes/additions over time without updating projects that use these utilities.

Possible downsides?

This adds complexity to our QoS exceptions, which are already somewhat complicated to reason about. We need additional control over how QoS responses impact the system as a whole, so this should be a reasonable trade-off.

@changelog-app
Copy link

changelog-app bot commented Oct 4, 2024

Generate changelog in changelog/@unreleased

What do the change types mean?
  • feature: A new feature of the service.
  • improvement: An incremental improvement in the functionality or operation of the service.
  • fix: Remedies the incorrect behaviour of a component of the service in a backwards-compatible way.
  • break: Has the potential to break consumers of this service's API, inclusive of both Palantir services
    and external consumers of the service's API (e.g. customer-written software or integrations).
  • deprecation: Advertises the intention to remove service functionality without any change to the
    operation of the service itself.
  • manualTask: Requires the possibility of manual intervention (running a script, eyeballing configuration,
    performing database surgery, ...) at the time of upgrade for it to succeed.
  • migration: A fully automatic upgrade migration task with no engineer input required.

Note: only one type should be chosen.

How are new versions calculated?
  • ❗The break and manual task changelog types will result in a major release!
  • 🐛 The fix changelog type will result in a minor release in most cases, and a patch release version for patch branches. This behaviour is configurable in autorelease.
  • ✨ All others will result in a minor version release.

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Implement additional QoS metadata to convey server intent to calling clients

Check the box to generate changelog(s)

  • Generate changelog entry

return new QosReason(
Preconditions.checkNotNull(reason, "reason"),
Optional.ofNullable(retryHint),
Optional.ofNullable(dueTo));
Copy link
Contributor

@bjlaub bjlaub Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems the presence of retryHint and dueTo imply some semantic meaning for the caller such that both must be set, and in fact QosReasons#parseFromResponse requires both to be set in order to parse either - but it's not clear from this API that you must set both. Should make that a requirement in the builder?

EDIT: ah whoops, parseFromResponse only requires one to be set, but I guess my question is the same - are there scenarios where one would be set, but not the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I have a table in the design document where the combinations are described with internal examples


private static final String CLIENT_REASON = "client-qos-response";
private static final QosReason DEFAULT_CLIENT_REASON = QosReason.of(CLIENT_REASON);
private static final String DUE_TO_HEADER = "Qos-Due-To";
Copy link
Contributor

@bjlaub bjlaub Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want these to be all lowercase?

EDIT: disregard, just realized these are the header names, not the values

private static final String RETRY_HINT_HEADER = "Qos-Retry-Hint";

private static final String DUE_TO_CUSTOM_HEADER_VALUE = "custom";
private static final String RETRY_HINT_DO_NOT_RETRY_HEADER_VALUE = "do-not-retry";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: i'm just curious why these are not encoded as e.g. a toString on the enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the separation of concerns with enum definitions in QosReason.java and the encoding pieces in the utility helper. Either way would be reasonable, but now that we have enum switch expressions, we can do a typesafe mapping with compile-time checks that we've associated a string with each enum value.

bjlaub
bjlaub previously approved these changes Oct 8, 2024
Copy link
Contributor

@bjlaub bjlaub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines 153 to 179
/**
* Conveys the servers opinion on whether a QoS failure should be retried, or propagate
* back to the caller and result in a failure. There is no guarantee that these values
* will be respected by all clients, and should be considered best-effort.
*/
public enum RetryHint {
/**
* Clients should not attempt to retry this failure,
* providing the failure as context back to the initial caller.
*/
DO_NOT_RETRY;
}

/**
* Describes the cause of a QoS failure when known to be non-default. By default, we assume that
* a 503 Unavailable is the result of a node-wide limit being reached, and that a 429 is specific
* to an individual endpoint on a node. These assumptions do not hold true in all cases, so
* {@link DueTo} informs relays this intent.
*/
public enum DueTo {
/**
* A cause that the RPC system isn't directly aware of, for example a user or user-agent specific limit, or
* based on a specific resource that's being accessed, as opposed to the target node as a whole, or endpoint.
* QosReasons with this cause shouldn't impact things like the dialogue concurrency limiter.
*/
CUSTOM;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use conjure-style enums here, with an UNKNOWN variant that can pass through new unknown values. This way, future additions will be passed through the system even before all services have taken the latest version of cjr-api. Additionally, it allows folks to potentially define more descriptive DueTo reasons (though it may require some care to make that safe, as it may impact dialogue and friends)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i'm struggling to understand the difference between UNKNOWN and CUSTOM. Is the concern that a client may receive a DueTo value from a newer version of c-j-r that they don't know how to deserialize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i'm struggling to understand the difference between UNKNOWN and CUSTOM.

I'm not entirely sure that it makes sense to declare both, in practice I think they could be considered equivalent by most consumers. I do prefer having well-defined values like CUSTOM for the common case, rather than using arbitrary string wrappers.
An open question I have is whether it would make sense to allow applications/libraries to define their own specific DueTo values using this mechanism -- such values would have to be treated the same as CUSTOM by Dialogue, but could provide better observability info.

Is the concern that a client may receive a DueTo value from a newer version of c-j-r that they don't know how to deserialize?

Yep, the current implementation will ignore the value as though it doesn't exist.

Comment on lines +69 to +71
public Optional<DueTo> dueTo() {
return dueTo;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should become a Set<DueTo> dueTo() to support multiple reasons some day, keeps the door open.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, a downside of allowing multiple DueTo values is that adding it to metric tags would become unwieldy.

@carterkozak carterkozak changed the title [WIP] Implement additional QoS metadata Implement additional QoS metadata Oct 21, 2024
@carterkozak carterkozak marked this pull request as ready for review October 21, 2024 15:29
@bjlaub
Copy link
Contributor

bjlaub commented Oct 21, 2024

👍 lgtm

@bulldozer-bot bulldozer-bot bot merged commit 8011546 into develop Oct 21, 2024
5 checks passed
@bulldozer-bot bulldozer-bot bot deleted the ckozak/qos-metadata branch October 21, 2024 15:29
@autorelease3
Copy link

autorelease3 bot commented Oct 21, 2024

Released 2.54.0

carterkozak added a commit to palantir/conjure-java-runtime that referenced this pull request Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants