Instrument ConcurrencyLimitedChannel #443

iamdanfox · 2020-02-26T20:57:18Z

Before this PR

'Concurrency Limiters' is a scary word to most people, and people fear what they don't understand. I'd like to have graphs to show when concurrency limiters are actually affecting people.

After this PR

==COMMIT_MSG==
Two new gauges, dialogue.concurrencylimiter.max and dialogue.concurrencylimiter.utilization, describe the state of each host's concurrency limiter.
==COMMIT_MSG==

~~do not merge until we've decided how to plumb in the service-name (otherwise if a service creates multiple dialogue clients each with 3 hosts, these gauges will flicker horribly).~~

Possible downsides?

more metrics = more DD $
Witchcraft's metric sampling is roughly every 30 seconds, so even if a DataDog graph shows low utilization, there's a possibility that it was spiking in between samples, which is kinda misleading.

changelog-app · 2020-02-26T20:57:23Z

Generate changelog in `changelog/@unreleased`

Type

Description

Two new gauges, dialogue.concurrencylimiter.max and dialogue.concurrencylimiter.utilization, describe the state of each host's concurrency limiter.

Check the box to generate changelog(s)

Generate changelog entry

carterkozak · 2020-02-26T21:09:32Z

dialogue-core/src/main/java/com/palantir/dialogue/core/Channels.java

-                .map(channel -> new FixedLimitedChannel(channel, MAX_REQUESTS_PER_CHANNEL, clientMetrics))
-                .collect(ImmutableList.toImmutableList());
+        ImmutableList.Builder<LimitedChannel> limitedChannels = ImmutableList.builder();
+        for (int hostIndex = 0; hostIndex < channels.size(); hostIndex++) {


Can we sort the input uris and index based on sorted order? That way if the input list changes order we don't shuffle metrics, and I think it's cleaner to consume a collection/iterable here.

At the moment we just get 'Channel' as the input type - it's not comparable and we can't correlate between any particular instances.

Thoughts on asking people to provide a ChannelWithId so we can do that sorting?

Ah right. Not sure channel IDs are helpful outside of this use case. We can sort by identity hash code and index from that list ¯_(ツ)_/¯

I think they'd also be useful for staying pinned on the same host after uris live-reload. Let me try something with identifyhashcode.

carterkozak · 2020-02-26T21:13:56Z

dialogue-core/src/main/java/com/palantir/dialogue/core/ConcurrencyLimitedChannel.java

+        hostIndex.ifPresent(index -> {
+            DialogueConcurrencylimiterMetrics metrics = DialogueConcurrencylimiterMetrics.of(taggedMetrics);
+            // TODO(dfox): hook up the service-name somehow? also when nodes reshuffle these metrics will look odd.
+            metrics.utilization().hostIndex(Integer.toString(index)).build(this::getUtilization);


I'm not confident this works when we have multiple discovered services, each combined channel (alta both gatekeeper, for example) will have separate concurrency limited channels using the same gauge for index zero. We don't know which one we're measuring.

Yep that's why I added the do not merge. I think the thing we're really interested in here is some stable identifier per host.

I don't think we can actually use hashcode because then a ton of uri reloads will result in an unbounded number of tags, even for the uris that don't change.

iamdanfox · 2020-02-27T23:31:08Z

@ellisjoe I think you were keen to avoid plumbing uris/channel ids deep into the internals of dialogue, so I'm trying to think of a way to produce these graphs (especially for the initial rollout), so that we can really clearly explain exactly what dialogue is doing at any point in time.

What if at Channels#create time, in addition to a List, we asked users to pass in an opaque List or something, and then just called methods on that interface to report stuff about a specific channel?

(Also I think it would be really helpful to articulate the dangerous you see in plumbing uris / channeld identifiers in some kind of architecture decision record).

stale · 2020-03-13T00:24:15Z

This PR has been automatically marked as stale because it has not been touched in the last 14 days. If you'd like to keep it open, please leave a comment or add the 'long-lived' label, otherwise it'll be closed in 7 days.

…miter

iamdanfox · 2020-03-16T15:07:16Z

Two open questions:

if people create clients with a high number of uris then this will create a high tag DD tag cardinality.
if the same URI appears in multiple service-discovery-blocks e.g. one for multipass and one for gatekeeper, then the metrics will likely be confusing.

ferozco · 2020-03-16T15:13:07Z

dialogue-core/src/main/java/com/palantir/dialogue/core/ConcurrencyLimitedChannel.java

-    static ConcurrencyLimitedChannel create(LimitedChannel delegate, DialogueClientMetrics metrics) {
-        return new ConcurrencyLimitedChannel(
-                delegate, ConcurrencyLimitedChannel.createLimiter(SYSTEM_NANOTIME), metrics);
+    interface Instrumentation {


why do we need this extra interface? can we just pass a DialogueConcurrencyLimiterMetrics?

I had this originally, but it required passing in the string uri which was only used for instrumentation purposes (here's the commit where you can see this: 8a8139d).

This felt a bit gross, as the string uri could in theory be re-purposed for other stuff. Also felt odd to always be passing in the only possible tag value rather than just pre-combining them.

dialogue-core/src/main/java/com/palantir/dialogue/core/ConcurrencyLimitedChannel.java

ferozco · 2020-03-16T17:07:13Z

dialogue-core/src/main/java/com/palantir/dialogue/core/ConcurrencyLimitedChannel.java

+        weakGauge(
+                taggedMetrics,
+                MetricName.builder()
+                        .safeName("dialogue.concurrencylimiter.utilization")


kinda sad that we can't use metric schema :(

Agreed, but I think it's ok to experiment with new APIs by just manually writing them. If we find this weak reference pattern is actually widespread and worth recommending, then we could bake it into tritium/metric-schema.

I think we could almost recommend everyone express gauges in terms of a T and a Function<T,Number> (and make this a first-class tritium) because in the current world I bet the majority of gauges are actually preventing things being GC'd!

dialogue-core/src/main/java/com/palantir/dialogue/core/ConcurrencyLimitedChannel.java

ferozco · 2020-03-16T17:14:38Z

dialogue-core/src/main/java/com/palantir/dialogue/core/DialogueChannel.java

@@ -97,13 +98,17 @@ public void updateUris(Collection<String> uris) {
        Sets.SetView<String> newUris = Sets.difference(uniqueUris, limitedChannelByUri.keySet());

        staleUris.forEach(limitedChannelByUri::remove);
-        newUris.forEach(uri -> limitedChannelByUri.put(uri, createLimitedChannel(uri)));
+        ImmutableList<String> allUris = ImmutableList.<String>builder()


These indices aren't stable across updates so it will probably pretty confusing to look at individual metrics. Not sure of a better solution, but worth flagging

Yep. Given that it's not recommended to just hash things, I think this is the best we can do. In practise, I hope it'll be reasonably easy to see n coloured lines go into a live-reload event and then either n+1 or n-1 lines come out. People should be able to visually connect em up.

ferozco · 2020-03-16T17:16:40Z

👍

svc-autorelease · 2020-03-16T17:17:36Z

Released 0.19.12

iamdanfox added 2 commits February 26, 2020 20:43

Instrument max & utilization

6238d90

Test the gauges

5e48615

iamdanfox added the do not merge label Feb 26, 2020

probot-autolabeler bot added the autorelease label Feb 26, 2020

./gradlew revapiAcceptAllBreaks

639c412

iamdanfox requested a review from carterkozak February 26, 2020 20:59

carterkozak reviewed Feb 26, 2020

View reviewed changes

stale bot added the stale label Mar 13, 2020

Merge remote-tracking branch 'origin/develop' into dfox/instrument-li…

d6a197c

…miter

stale bot removed the stale label Mar 16, 2020

iamdanfox added 2 commits March 16, 2020 14:42

Plumb in actual hostname

f64d641

Add generated changelog entries

4ccf1bc

iamdanfox removed the do not merge label Mar 16, 2020

iamdanfox added 2 commits March 16, 2020 15:01

Separate interface emphasises uri is only for instrumentation

8a8139d

Less indirection

47663d6

ferozco reviewed Mar 16, 2020

View reviewed changes

dialogue-core/src/main/java/com/palantir/dialogue/core/ConcurrencyLimitedChannel.java Outdated Show resolved Hide resolved

ferozco reviewed Mar 16, 2020

View reviewed changes

dialogue-core/src/main/java/com/palantir/dialogue/core/ConcurrencyLimitedChannel.java Show resolved Hide resolved

iamdanfox added 3 commits March 16, 2020 16:12

use uri index instead of hashing

83c7f8d

Ditch the extra interface

c895f20

Weak reference

7543636

ferozco reviewed Mar 16, 2020

View reviewed changes

dialogue-core/src/main/java/com/palantir/dialogue/core/ConcurrencyLimitedChannel.java Show resolved Hide resolved

ferozco reviewed Mar 16, 2020

View reviewed changes

iamdanfox added the merge when ready label Mar 16, 2020

bulldozer-bot bot merged commit 19c392c into develop Mar 16, 2020

bulldozer-bot bot deleted the dfox/instrument-limiter branch March 16, 2020 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrument ConcurrencyLimitedChannel #443

Instrument ConcurrencyLimitedChannel #443

iamdanfox commented Feb 26, 2020 •

edited

Loading

changelog-app bot commented Feb 26, 2020 •

edited by iamdanfox

Loading

carterkozak Feb 26, 2020

iamdanfox Feb 26, 2020

carterkozak Feb 26, 2020

iamdanfox Feb 26, 2020 •

edited

Loading

carterkozak Feb 26, 2020

iamdanfox Feb 26, 2020 •

edited

Loading

iamdanfox commented Feb 27, 2020

stale bot commented Mar 13, 2020

iamdanfox commented Mar 16, 2020

ferozco Mar 16, 2020

iamdanfox Mar 16, 2020

ferozco Mar 16, 2020

iamdanfox Mar 16, 2020 •

edited

Loading

iamdanfox Mar 16, 2020 •

edited

Loading

ferozco Mar 16, 2020

iamdanfox Mar 16, 2020

ferozco commented Mar 16, 2020

svc-autorelease commented Mar 16, 2020

Instrument ConcurrencyLimitedChannel #443

Instrument ConcurrencyLimitedChannel #443

Conversation

iamdanfox commented Feb 26, 2020 • edited Loading

Before this PR

After this PR

Possible downsides?

changelog-app bot commented Feb 26, 2020 • edited by iamdanfox Loading

Generate changelog in changelog/@unreleased

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iamdanfox Feb 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iamdanfox Feb 26, 2020 • edited Loading

Choose a reason for hiding this comment

iamdanfox commented Feb 27, 2020

stale bot commented Mar 13, 2020

iamdanfox commented Mar 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iamdanfox Mar 16, 2020 • edited Loading

Choose a reason for hiding this comment

iamdanfox Mar 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ferozco commented Mar 16, 2020

svc-autorelease commented Mar 16, 2020

iamdanfox commented Feb 26, 2020 •

edited

Loading

changelog-app bot commented Feb 26, 2020 •

edited by iamdanfox

Loading

Generate changelog in `changelog/@unreleased`

iamdanfox Feb 26, 2020 •

edited

Loading

iamdanfox Feb 26, 2020 •

edited

Loading

iamdanfox Mar 16, 2020 •

edited

Loading

iamdanfox Mar 16, 2020 •

edited

Loading