[FLINK-12284][Network,Metrics]Fix the incorrect inputBufferUsage metric in credit-based network mode #8455

Aitozi · 2019-05-15T16:14:27Z

What is the purpose of the change

This PR is to fix the bug in the calculation of inputBufferUsage. It does't now take the exclusive buffer which is assigned from global buffer pool into account.

Brief change log

(for example:)

The TaskInfo is stored in the blob store on job creation time as a persistent artifact
Deployments RPC transmits only the blob storage reference
TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (100MB)
Extended integration test for recovery after master (JobManager) failure
Added test that validates that TaskInfo is transferred only once across recoveries
Manually verified the change by running a 4 node cluser with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Aitozi · 2019-05-15T16:15:48Z

Hi @zhijiangW could you help take a look on this PR when you are free, thanks.

flinkbot · 2019-05-15T16:16:25Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

zhijiangW

Thanks for concerning this issue and opening the PR @Aitozi !

Actually I also considered this issue before, but have not confirmed the proper way to fix this. I could share some my thoughts:

The current inputBufferUsage is only reflecting the usage of floating buffers, and it could give an indicator whether the setting of floating-buffers-per-gate is enough for performance tuning.
We could always assume if the exclusive buffers are enough for one channel, it will not request floating buffers. In other words, if we see the floating buffers are used, that means the exclusive buffers should also be used.
The semantics of inputBufferUsage is also changed after credit-based mode. The previous usage indicates the buffer is actually filled with network data. Now the usage is only indicating the floating buffer is removed from LocalBufferPool to RemoteInputChannel, but the buffer might still be available, not filled with data.
The metric purpose should be guidable for performance tuning I think. If we mixed the usages of exclusive and floating buffers together, it might not bring any help for performance tuning. E.g. if the usage is 100% after merging, we could easy to estimate exclusive/floating are both used. But now we could also think so based on my second point. If the usage is 50% or other, we still need to further calculate how many floating are used in all.

It is indeed loss of some metrics to help performance tuning better in credit-based mode. Such as

how to measure the exclusive usages
how to measure the floating distribution among channels which could give feedback whether the current mechanism of floating buffers is proper.
how to measure whether the sender in network is blocked by insufficient credit feedback which could help further improve the setting and internal rule.

I plan to focus on these issues future, maybe not covered in next release. For the issue of this PR I would confirm with other guys and feedback to you later.

FYI, the commit title should cover [network, metrics] for better recognization, and exists netowork typo. :)

Aitozi · 2019-05-16T15:13:48Z

Thanks @zhijiangW for sharing your thoughts on this issue.
Although the semantics of inputBufferUsage means the buffer occupied by RemoteChannel which may fill with data or not as you mentioned in credit-based mode in the moment, But I think this PR taking the exclusive buffer into account, which will make the semantics of this more close to the original intention.
I'am also ok to stay the same until the time of introducing new network related metrics. I think we are short of more detail metric about credit-based network mode too.
Thanks for remind of the PR title format, I will pay more attention to this.:)

zhijiangW · 2019-05-24T09:41:47Z

@pnowojski What do you think of this issue?

pnowojski · 2019-05-24T14:47:08Z

The current inputBufferUsage is only reflecting the usage of floating buffers, and it could give an indicator whether the setting of floating-buffers-per-gate is enough for performance tuning.

I agree, there is a value of the current semantic.

I would even build upon that. If you want to detect bottleneck, current semantic of tracking just floating buffers is perfect for that. If current operator (reading from this input) is causing a bottleneck it well can be that only a single one input channel is back-pressured. At the moment we could detect this situation with "all floating buffers are exhausted/used".

We could always assume if the exclusive buffers are enough for one channel, it will not request floating buffers. In other words, if we see the floating buffers are used, that means the exclusive buffers should also be used.

This is not true for even slight data skew.

I think that there is a value of tracking separately floating and exclusive buffers.

Maybe we should make inPoolUsage track floating + exclusive buffers (it would make it consistent with non credit based mode), while we could add inFloatingPoolUsage metric with current semantic? What do you think?

Aitozi · 2019-05-24T15:55:35Z

Hi @pnowojski , what is the reason for the second point ?

This is not true for even slight data skew

And +1 for adding the inFloatingPoolUsage specialized for detecting bottleneck.

zhijiangW · 2019-05-27T03:58:46Z

Thanks for the confirmation and good suggestions. @pnowojski

In other words, if we see the floating buffers are used, that means the exclusive buffers should also be used.
This is not true for even slight data skew

Maybe my previous expression is not very clear. If floating buffers are used, I only mean the exclusive buffers for one/some RemoteInputChannels are also expected to be used eventually, but not indicate the exclusive buffers for all the channels are used.

For example, if the producer's backlog is 1, we would always request another 1 floating buffer even though the 2 exclusive buffers for this channel are available atm. Because we want to feedback some overhead credits beforehand in order not to block the network transport after producing more backlog soon. So it is not strong consistent for exclusive buffers used in time, might be eventual consistent within our expectation. If the backlog is becoming 0 from 1, the previous requested floating buffer would also be released by this channel if the 2 exclusive buffers are still available. So from the aspect of one input channel, it would not occupy extra floating buffers if its available exclusive buffers are enough.

I also agree with the above suggestions for distinguishing the total inPoolUsage and floatingInPoolUsage, or we could only retain exclusiveInPoolUsage and floatingInPoolUsage for credit-based mode.

Aitozi · 2019-05-27T05:01:42Z

Hi, @zhijiangW thanks for your detail explanation, I make a conclusion below
Non-credit-based mode:

InPoolUsage

Credit-based mode:

exclusiveInPoolUsage exclusiveBufferUsed/exclusiveBufferTotal(all initial credit)
floatingInPoolUsage floatingBufferUsed/floatingBufferTotal
InPoolUsage combine with exclusiveInPoolUsage and floatingInPoolUsage

I think this looks better to distinguish the buffer usage between floating and exclusive in credit-based mode and also give a view of the overall perspective with inPoolUsage metric. WDYT @pnowojski @zhijiangW

pnowojski · 2019-05-27T09:01:26Z

@Aitozi :

I think your proposal makes sense. Having 3 metrics for credit based flow control sounds as the best, the most explicit solution for me.

Hi @pnowojski , what is the reason for the second point ?

Which second point did you mean @Aitozi ? Or did @zhijiangW answer your question?

@zhijiangW :

Maybe my previous expression is not very clear. If floating buffers are used, I only mean the exclusive buffers for one/some RemoteInputChannels are also expected to be used eventually, but not indicate the exclusive buffers for all the channels are used.

👍

I meant that from the perspective of such data skew, just knowledge of floating buffers usage is not enough for performance tuning.

If you see that floating buffers usage is constantly jumping, between empty & full, it means that you should probably increase the either floating buffer pool or exclusive buffer pool. Now depending if you have a data skew and mostly unused exclusive buffers, you would be better off increasing floating buffer pool and potentially shrinking exclusive. While if you do not have data skew (and mostly used exclusive buffer pool % a couple of idling channels), you would be better off increasing exclusive buffer pool.

Aitozi · 2019-05-27T09:46:29Z

OK, I will open another issue additionally to track and will update this PR accordingly, thanks.

Aitozi · 2019-05-27T16:28:38Z

Have added the three metric for the credit based mode, please take a look when you are free @zhijiangW @pnowojski

pnowojski

Thanks a lot @Aitozi for the update :) Generally speaking code looks conceptually OK. Left couple of comments.

pnowojski · 2019-05-29T10:09:51Z

.../main/java/org/apache/flink/runtime/io/network/metrics/CreditBasedInputBufferPoolMetric.java

+import org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate;
+
+/**
+ * Gauge metric measuring the input buffer pool usage gauge for {@link SingleInputGate}s under credit based mode.


typo: metrics?

pnowojski · 2019-05-29T10:18:20Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/NetworkEnvironment.java

 	private static final String METRIC_OUTPUT_POOL_USAGE = "outPoolUsage";
 	private static final String METRIC_INPUT_QUEUE_LENGTH = "inputQueueLength";
 	private static final String METRIC_INPUT_POOL_USAGE = "inPoolUsage";
+	private static final String METRIC_CREDIT_BASED_FLOATING_BUFFER_POOL_USAGE = "floatingBufferUsage";


Maybe a nit:

I'm not a native english speaker so I might be wrong, but shouldn't this be floatingBuffersUsage? And ditto in other places? It seems to me like the correct spelling are the following forms/contexts:

buffer pool usage (singular since its a single "buffer pool")

red buffers usage (multiple buffers)

pnowojski · 2019-05-29T11:07:57Z

.../main/java/org/apache/flink/runtime/io/network/metrics/CreditBasedInputBufferPoolMetric.java

+	/**
+	 * Gauge metric measuring the floating buffer usage for {@link SingleInputGate}.
+	 */
+	public static class FloatingBufferPoolUsage implements Gauge<Float> {


Could we deduplicate the code here a bit? For example we could introduce InputPoolUsageGauge that would contain the following code/loop:

protected abstract int calculateUsedBuffers(InputGate); protected abstract int calculateBufferPoolSize(InputGate); @Override public Float getValue() { int usedBuffers = 0; int bufferPoolSize = 0; for (SingleInputGate inputGate : inputGates) { usedBuffers += calculateUsedBuffers(inputGate) bufferPoolSize += calculateBufferPoolSize(inputGate); } if (bufferPoolSize != 0) { return ((float) usedBuffers) / bufferPoolSize; } else { return 0.0f; } }

That could be extended/implemented by InputBufferPoolUsageGauge and the classes that you introduced.

Also it would be nice to express somehow the total usage as FloatingBufferPoolUsage + ExclusiveBufferPoolUsage instead of duplicating the logic.

It's a good suggestion, I will follow it.

pnowojski · 2019-05-29T11:09:58Z

...src/main/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteInputChannel.java

 		return initialCredit;
 	}

+	public int unsafeGetExclusiveBufferUsed() {


unsynchronized instead of unsafe to match existing unsynchronizedGetNumberOfQueuedBuffers?

ExclusiveBuffers

pnowojski · 2019-05-29T11:10:05Z

.../main/java/org/apache/flink/runtime/io/network/metrics/CreditBasedInputBufferPoolMetric.java

+/**
+ * Gauge metric measuring the input buffer pool usage gauge for {@link SingleInputGate}s under credit based mode.
+ */
+public class CreditBasedInputBufferPoolMetric {


Can you add tests for those classes?

zhijiangW · 2019-05-30T03:54:41Z

...src/main/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteInputChannel.java

+		return Math.max(0, initialCredit - bufferQueue.exclusiveBuffers.size());
+	}
+
+	public int unsafeGetFloatingBufferLeft() {


FloatingBuffers
Left->Available?

zhijiangW · 2019-05-30T04:11:40Z

.../main/java/org/apache/flink/runtime/io/network/metrics/CreditBasedInputBufferPoolMetric.java

+			if (requestedFloatingBuffer == 0) {
+				return 0.0f;
+			} else {
+				return 1 - ((float) leftFloatingBuffer) / requestedFloatingBuffer;


I think it is not very accurate here.
It should be (requestedFloatingBuffer - leftFloatingBuffer) / LocalBufferPool.getNumBuffers

zhijiangW · 2019-05-30T04:16:39Z

.../main/java/org/apache/flink/runtime/io/network/metrics/CreditBasedInputBufferPoolMetric.java

+		}
+
+		@Override
+		public Float getValue() {


We might extract the logics of above exclusiveBuffersUsage and floatingBuffersUsage into separate methods, then these methods could be reused here to get final results directly, to avoid below detail logics again.

zhijiangW · 2019-05-30T04:20:22Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/NetworkEnvironment.java

 		buffersGroup.gauge(METRIC_INPUT_QUEUE_LENGTH, new InputBuffersGauge(inputGates));
-		buffersGroup.gauge(METRIC_INPUT_POOL_USAGE, new InputBufferPoolUsageGauge(inputGates));
+
+		if (config.isCreditBased()) {


For exclusive case there is no buffer pool actually. So maybe we could break down the new class names into ExclusiveBuffersUsageGauge, FloatingBuffersUsageGauge, InputBuffersUsageGauge instead?

But considering the metric name compatibility with before, we might still use METRIC_INPUT_POOL_USAGE in credit-based mode to correspond with InputBuffersUsageGauge.

I don't have a strong feelings, but maybe keeping the name of the gauge in sync with the metric name is a better idea? (I'm fine either way)

zhijiangW · 2019-05-30T04:28:13Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/NetworkEnvironment.java

 	private static final String METRIC_OUTPUT_POOL_USAGE = "outPoolUsage";
 	private static final String METRIC_INPUT_QUEUE_LENGTH = "inputQueueLength";
 	private static final String METRIC_INPUT_POOL_USAGE = "inPoolUsage";
+	private static final String METRIC_CREDIT_BASED_FLOATING_BUFFER_POOL_USAGE = "floatingBufferUsage";


I think it is better to not add the keywords CREDIT-BASED here, because after we remove the deprecated network codes finally, it is no need to emphasize this term any more. The terms of exclusive/floating already indicate that.
Also it should be consistent between variable name and value, like METRIC_FLOATING_BUFFERS_USAGE = "floatingBuffersUsage". ditto for the below case of exclusive.

zhijiangW · 2019-05-30T04:34:19Z

Thanks for the updates @Aitozi ! I left some inline comments.

Aitozi · 2019-06-02T10:15:12Z

Hi @zhijiangW @pnowojski I have addressed your comments. About the inPoolUsage name, I use the name for both two mode to be consist with previous style. And i squash the commit history solving the conflicts, please help review when you are free, I will continue to solve another PR after this one has been approved to avoid conflicts.

pnowojski

Couple of more comments from my side :)

pnowojski · 2019-06-04T13:37:54Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/NetworkEnvironment.java

 		buffersGroup.gauge(METRIC_INPUT_QUEUE_LENGTH, new InputBuffersGauge(inputGates));
-		buffersGroup.gauge(METRIC_INPUT_POOL_USAGE, new InputBufferPoolUsageGauge(inputGates));
+
+		if (config.isCreditBased()) {


I don't have a strong feelings, but maybe keeping the name of the gauge in sync with the metric name is a better idea? (I'm fine either way)

...est/java/org/apache/flink/runtime/io/network/partition/consumer/InputBuffersMetricsTest.java

Aitozi · 2019-06-14T10:17:16Z

Sorry for the late response @pnowojski , I will solve the comments you mentioned.

Aitozi · 2019-06-15T17:39:52Z

Hi @pnowojski , I have adjust the test case according to your suggestion, please take a look when you have time, Thanks.

Aitozi · 2019-07-01T13:50:29Z

Ping @zhijiangW

...est/java/org/apache/flink/runtime/io/network/partition/consumer/InputBuffersMetricsTest.java

zhijiangW · 2019-07-02T03:56:59Z

Thanks for the updates @Aitozi ! I left some other comments for tests.

Aitozi · 2019-07-02T12:15:17Z

Very thanks for @zhijiangW 's carefully review, please take a look again, hoping this is the last look :)

zhijiangW · 2019-07-03T02:54:56Z

...est/java/org/apache/flink/runtime/io/network/partition/consumer/InputBuffersMetricsTest.java

+		}
+	}
+
+	private Tuple3<SingleInputGate, List<RemoteInputChannel>, List<LocalInputChannel>> buildInputGate(


we could only return Tuple2<SingleInputGate, List<RemoteInputChannel>> instead because the list of local channels would not be used any more.

zhijiangW · 2019-07-03T03:00:48Z

...est/java/org/apache/flink/runtime/io/network/partition/consumer/InputBuffersMetricsTest.java

+			.buildRemoteAndSetToGate(inputGate);
+	}
+
+	private LocalInputChannel buildLocalChannel(


make it void method after adjusting tuple2 above.

zhijiangW · 2019-07-03T03:03:43Z

Thanks for the patience and addressing my comments @Aitozi !

I think you might miss one previous comment for using tuple2 instead of tuple3. After fixing this I have no other concerns. 👍 LGTM!

…ic in credit-based network mode

Aitozi · 2019-07-03T05:54:15Z

Ping @pnowojski

pnowojski · 2019-07-03T09:46:51Z

Thanks for the fix @Aitozi and your patience with our reviews (especially that the scope of this ticket has grown significantly from your initial version). Merging as I can see that on your private travis this is already green.

rmetzger added review=description? component=Runtime/Network labels May 15, 2019

zhijiangW reviewed May 16, 2019

View reviewed changes

Aitozi changed the title ~~[FLINK-12284]Fix the incorrect inputBufferUsage metric in credit-based netowork mode~~ [FLINK-12284][Network,Metrics]Fix the incorrect inputBufferUsage metric in credit-based network mode May 16, 2019

pnowojski requested changes May 29, 2019

View reviewed changes

zhijiangW reviewed May 30, 2019

View reviewed changes

Aitozi force-pushed the inpoolusage_fix branch from 2b78438 to 7b39e0b Compare June 2, 2019 10:07

Aitozi changed the title ~~[FLINK-12284][Network,Metrics]Fix the incorrect inputBufferUsage metric in credit-based network mode~~ [FLINK-12284,FLINK-12637][Network,Metrics]Fix the incorrect inputBufferUsage metric in credit-based network mode Jun 2, 2019

rmetzger added the component=Runtime/Metrics label Jun 2, 2019

pnowojski requested changes Jun 4, 2019

View reviewed changes

Aitozi force-pushed the inpoolusage_fix branch from 971d6a7 to 65465bc Compare June 15, 2019 17:36

Aitozi force-pushed the inpoolusage_fix branch from f1d2f8c to 5b846b3 Compare July 1, 2019 13:41

Aitozi changed the title ~~[FLINK-12284,FLINK-12637][Network,Metrics]Fix the incorrect inputBufferUsage metric in credit-based network mode~~ [FLINK-12284][Network,Metrics]Fix the incorrect inputBufferUsage metric in credit-based network mode Jul 1, 2019

Aitozi force-pushed the inpoolusage_fix branch 2 times, most recently from 220d01e to a43de92 Compare July 1, 2019 13:50