[FLINK-12615][coordination] Track partitions on JM #8778

zentol · 2019-06-18T08:36:13Z

Based on #8687 . When reviewing this PR, please ignore all commits preceding
"[FLINK-12615][coordination] Support generic key in PartitionTable".

What is the purpose of the change

With this PR we track on the JobMaster side which partitions are still on task executors, maintain the connection to a task executor until all partitions on it have been released, and issue a release call for all partitions when a job terminates.

Attention: The release logic in PR only serves as a MVP. It ignores the shuffle master (and thus will not work with other shuffle service implementation) and is not viable for high-volume jobs (because it requires all blocking partitions to be persisted on disk until the job is finished).

Brief changelog

6009f00 modifies the PartitionTable to support arbitrary keys; so far it was hard-wired to use JobIDs as keys for use on the TE side, but now we want to organize them by ResourceID (== TE ID) instead.

c7193dc introduces a PartitionTable into the JM, Scheduler and EG. This commit only modifies constructor-related methods and does not contain any actual logic. This commit only exists to make reviewing easier.

ea418da contains the tracking logic. Partitions are added to the table, grouped by task executor, by an Execution when it reaches the state FINISHED, and removed again by the JM when disassociating from a TE.

3087792 introduces a call to TE#releasePartitions for when a job is finished.

d5a9f3c finally modifies the check for empty taskmanagers to take partitions into account.

flinkbot · 2019-06-18T08:38:36Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

azagrebin · 2019-06-18T14:35:24Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java

 							IntermediateResultPartition[] allPartitions = finishedPartition
 									.getIntermediateResult().getPartitions();

+							PartitionTable<ResourceID> partitionTable = vertex.getExecutionGraph().getPartitionTable();


We could consider introducing a separate component something like PartitionTracker or manager which would also encapsulate ShuffleMaster (like JobAwareShuffleEnvironmentImpl). It could be responsible for doing what we have in this PR and maybe even register with the ShuffleMaster and track producedPartitions by execution id as well.

yes we could do that. Not sure whether we should do this now or defer this to a later point. Currently we'd just end up wrapping the partition table and mirroring the API.

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java

zhijiangW

Thanks for opening this PR @zentol !

I have finished reviewing the core codes and left some comments. I would further check the tests later.

zentol · 2019-06-19T08:39:35Z

@zhijiangW I don't see any comment of yours, did something go wrong when submitting the review?

zhijiangW · 2019-06-19T08:44:32Z

I am not sure why they are not shown outside, but inline codes we could see these comments. I would double check and submit them again if necessary.

zentol · 2019-06-19T08:56:10Z

I can only see andreys comments :(

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraph.java

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/LegacyScheduler.java

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java

zhijiangW · 2019-06-19T09:00:35Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java

 		final Optional<ResourceID> resourceIdOptional = slotPool.failAllocation(allocationId, cause);
-		resourceIdOptional.ifPresent(this::releaseEmptyTaskManager);
+		resourceIdOptional.ifPresent(taskManagerId -> {
+			if (!partitionTable.hasTrackedPartitions(taskManagerId)) {


nit: if we put the if (!partitionTable.hasTrackedPartitions(taskManagerId)) inside the method of releaseEmptyTaskManager, we could reuse it if release might be reused in other parts future.

zhijiangW · 2019-06-19T09:03:24Z

I entered the commit ids in above brief change log and left inline comments before. So these comments would only be seen inside them. I just re-submitted them in common way.

zentol · 2019-06-20T09:02:31Z

rebase on #8789.

azagrebin

Thanks for the PR @zentol , I had only minor comments atm

flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/partition/PartitionTable.java

flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionTest.java

flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/JobMasterTest.java

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java

flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/partition/PartitionTable.java

flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/JobMasterTest.java

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java

gaoyunhaii · 2019-06-25T06:49:40Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java


 		if (newJobStatus.isGloballyTerminalState()) {
+			// other terminal job states are handled by the executions
+			if (newJobStatus == JobStatus.FINISHED) {


From my understanding, I think we can only consider FINISHED due to that when transit to CANCELED and FAILED , all execution vertices will be canceled and the thus the result partition will be canceled via sendReleaseIntermediateResultPartitionsRpcCall.

However, with region failover, an execution may need to re-run after it turns into FINISHED. This happens for cases like A -> C, B->C, A and B are all finished but C fails to read data from A, then B also need to re-execute. When the execution vertex is resetForNewExecution, the result partition of previous execution is not released. Then when we cancel or failGlobal on the ExecutionGraph, it will also only consider the last execution, then the result partitions of the previous executions may not get released explicitly.

failover scenarios are not covered by this PR; it only adds a bare-bones implementation for cleaning up after the job terminates.

You are correct that we have to release partitions for reset vertices, which will be added in a follow-up.

azagrebin

I think that forcePartitionReleaseOnConsumption should actually go to ResultPartitionDeploymentDescriptor, not to ResultPartitionFactory:

this.releasedOnConsumption = partitionDescriptor.getPartitionType() != ResultPartitionType.BLOCKING || forcePartitionReleaseOnConsumption;

otherwise we always activate releasedOnConsumption for blocking,
what is the point of forcePartitionReleaseOnConsumption in ResultPartitionFactory?
and we basically track/release them for nothing in JM by default after FLINK-12615

azagrebin · 2019-06-25T14:30:19Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java

 	}

+	private void startTrackingUnreleasedPartitions(final Collection<IntermediateResultPartition> partitions) {
+		PartitionTable<ResourceID> partitionTable = vertex.getExecutionGraph().getPartitionTable();


inconsistent use of final

azagrebin · 2019-06-25T14:55:36Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java

+					for (Map.Entry<ResourceID, Tuple2<TaskManagerLocation, TaskExecutorGateway>> entry : registeredTaskManagers.entrySet()) {
+						Collection<ResultPartitionID> storedPartitions = partitionTable.stopTrackingPartitions(entry.getKey());
+						// if this call fails TaskExecutors will cleanup partitions regardless once we close the connections
+						entry.getValue().f1.releasePartitions(jobGraph.getJobID(), storedPartitions);


what about ShuffleMaster.releasePartitionExternally plans here? as discussed, we will need it in general.

ah yes, I got stuck here since the SM requires a ShuffleDescriptor, but so far we've only been tracking ResultPartitionIDs.
I have updated the PR to include a PartitionTracker which keeps track of shuffle descriptor, result partition ids etc. and acts as a central point for issuing release calls.

zentol · 2019-06-26T10:52:41Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/PartitionTracker.java

+/**
+ * Utility for tracking partitions and issuing release calls to task executors and shuffle masters.
+ */
+public class PartitionTracker {


should introduce an interface for this

zentol · 2019-06-27T18:19:45Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java

 			IntermediateResultPartitionID id) {
-		return Optional.ofNullable(producedPartitions.get(id));
+		Collection<ResultPartitionDeploymentDescriptor> trackedDeploymentDescriptors = vertex.getExecutionGraph().getPartitionTracker().getTrackedDeploymentDescriptors(getAttemptId());
+		for (ResultPartitionDeploymentDescriptor resultPartitionDeploymentDescriptor : trackedDeploymentDescriptors) {


I hate this bit, but really don't want to introduce an additional map just to support this lookup. Any ideas?

I do not really see a problem atm to have it in InternalPartitionInfo with a similar add partition method and Collection -> Map

anyways I think the method rather belongs to the tracker with the exec id added to the signature

azagrebin

Thanks @zentol ! The tracker looks good. I left some comments to resolve before merging.

azagrebin · 2019-06-28T08:09:09Z

...ime/src/main/java/org/apache/flink/runtime/io/network/partition/PartitionTrackerFactory.java

+	/**
+	 * Lookup function for {@link TaskExecutorGateway}.
+	 */
+	interface TaskExecutorGatewayLookup {


@FunctionalInterface

azagrebin · 2019-06-28T08:09:52Z

...ime/src/main/java/org/apache/flink/runtime/io/network/partition/PartitionTrackerFactory.java

+/**
+ * Factory for {@link PartitionTracker}.
+ */
+public interface PartitionTrackerFactory {


@FunctionalInterface

azagrebin · 2019-06-28T08:13:38Z

...untime/src/main/java/org/apache/flink/runtime/io/network/partition/PartitionTrackerImpl.java

+
+	private final JobID jobId;
+
+	private final Map<ExecutionAttemptID, InternalPartitionInfo> partitionInfoByProducer = new HashMap<>();


I would prefer all initialisations in the constructor

azagrebin · 2019-06-28T08:14:30Z

...untime/src/main/java/org/apache/flink/runtime/io/network/partition/PartitionTrackerImpl.java

+
+	private final ShuffleMaster<?> shuffleMaster;
+
+	private PartitionTrackerFactory.TaskExecutorGatewayLookup taskExecutorGatewayLookup;


final
also any value in explicit prefixing over a static import of PartitionTrackerFactory?

azagrebin · 2019-06-28T08:28:41Z

...untime/src/main/java/org/apache/flink/runtime/io/network/partition/PartitionTrackerImpl.java

+		final ProducerDescriptor producerDescriptor,
+		final ResultPartitionDeploymentDescriptorFactory resultPartitionDeploymentDescriptorFactory) {
+
+		CompletableFuture<? extends ShuffleDescriptor> shuffleDescriptorFuture = shuffleMaster.registerPartitionWithProducer(partitionDescriptor, producerDescriptor);


nit: any value in this local var?

azagrebin · 2019-06-28T09:42:14Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/PartitionTracker.java

+	 *
+	 * @param producerId producer id to determine relevant partitions
+	 */
+	void stopTrackingAndReleasePartitions(final ExecutionAttemptID producerId);


atm it is stop all for exec id and release what needs release.
I think comment needs adjustment and function could be called:
stopTrackingAndReleasePartitionsForExecution
same for the next method, stop all for task executor and release what needs release:
stopTrackingAndReleasePartitionsOnTaskExecutor (release all is a bit confusing)

azagrebin · 2019-06-28T10:04:36Z

...ime/src/main/java/org/apache/flink/runtime/io/network/partition/PartitionTrackerFactory.java

+		 * @param taskExecutorId id of the task executor to look up.
+		 * @return optional task executor gateway
+		 */
+		Optional<TaskExecutorGateway> lookup(ResourceID taskExecutorId);


I would prefer a partition releaser:

interface TaskExecutorPartitionReleaser { void releasePartitions(ResourceID taskExecutorId, JobID jobId, Collection<ResultPartitionID> partitionIds); }

could also return boolean whether TE found, although not used anyways
do we need the full TaskExecutorGateway for something?

azagrebin · 2019-06-28T10:14:03Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java

 			IntermediateResultPartitionID id) {
-		return Optional.ofNullable(producedPartitions.get(id));
+		Collection<ResultPartitionDeploymentDescriptor> trackedDeploymentDescriptors = vertex.getExecutionGraph().getPartitionTracker().getTrackedDeploymentDescriptors(getAttemptId());
+		for (ResultPartitionDeploymentDescriptor resultPartitionDeploymentDescriptor : trackedDeploymentDescriptors) {


anyways I think the method rather belongs to the tracker with the exec id added to the signature

azagrebin · 2019-06-28T10:21:52Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraphBuilder.java

 			Logger log,
-			ShuffleMaster<?> shuffleMaster) throws JobExecutionException, JobException {
+			PartitionTracker partitionTracker)
+		throws JobExecutionException, JobException {


keep formatting

azagrebin · 2019-06-28T10:26:15Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java

-			}
-		}
+		final PartitionTracker partitionTracker = getVertex().getExecutionGraph().getPartitionTracker();
+		partitionTracker.stopTrackingAndReleasePartitions(getAttemptId());


looks like we are still doing it in cancel and and suspend

azagrebin · 2019-06-28T10:36:13Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java

+			// other terminal job states are handled by the executions
+			if (newJobStatus == JobStatus.FINISHED) {
+				runAsync(() -> {
+					for (Map.Entry<ResourceID, Tuple2<TaskManagerLocation, TaskExecutorGateway>> entry : registeredTaskManagers.entrySet()) {


registeredTaskManagers.keySet().forEach(partitionTracker::stopTrackingAndReleaseAllPartitions);

…ns are released

rmetzger added review=description? component=Runtime/Network component=Runtime/Coordination labels Jun 18, 2019

azagrebin reviewed Jun 18, 2019

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java Outdated Show resolved Hide resolved

azagrebin reviewed Jun 18, 2019

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java Outdated Show resolved Hide resolved

zhijiangW reviewed Jun 19, 2019

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraph.java Outdated Show resolved Hide resolved

zhijiangW reviewed Jun 19, 2019

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/LegacyScheduler.java Outdated Show resolved Hide resolved

zhijiangW reviewed Jun 19, 2019

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java Outdated Show resolved Hide resolved

zhijiangW reviewed Jun 19, 2019

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java Outdated Show resolved Hide resolved

zhijiangW reviewed Jun 19, 2019

View reviewed changes

zentol force-pushed the 12615 branch from 7323c33 to c7d2cbf Compare June 20, 2019 09:02

azagrebin reviewed Jun 20, 2019

View reviewed changes

azagrebin mentioned this pull request Jun 20, 2019

[FLINK-12890] Add partition lifecycle related Shuffle API #8789

Merged

zentol force-pushed the 12615 branch 2 times, most recently from ab52653 to 5b237fe Compare June 21, 2019 11:06

gaoyunhaii reviewed Jun 25, 2019

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/partition/PartitionTable.java Outdated Show resolved Hide resolved

gaoyunhaii reviewed Jun 25, 2019

View reviewed changes

flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/JobMasterTest.java Outdated Show resolved Hide resolved

gaoyunhaii reviewed Jun 25, 2019

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java Outdated Show resolved Hide resolved

gaoyunhaii reviewed Jun 25, 2019

View reviewed changes

zentol force-pushed the 12615 branch 3 times, most recently from b53dad9 to 8494e4f Compare June 25, 2019 11:38

azagrebin reviewed Jun 25, 2019

View reviewed changes

zentol force-pushed the 12615 branch 4 times, most recently from 0be4d1d to 67417d3 Compare June 26, 2019 10:37

zentol commented Jun 26, 2019

View reviewed changes

zentol force-pushed the 12615 branch 4 times, most recently from 10416a0 to cf7c5a9 Compare June 26, 2019 19:42

zentol commented Jun 27, 2019

View reviewed changes

azagrebin requested changes Jun 28, 2019

View reviewed changes

azagrebin reviewed Jun 28, 2019

View reviewed changes

[FLINK-12615][coordination] Support generic key in PartitionTable

0a40ca7

zentol force-pushed the 12615 branch from aecb751 to 8b8bd3b Compare June 29, 2019 10:03

zentol added 3 commits June 29, 2019 13:18

[FLINK-12615][coordination] Track partitions on JM

300ed57

[FLINK-12641][coordination] Release partitions on job shutdown

c896a49

[FLINK-12612][coordination] Maintain JM connection until all partitio…

be6bcef

…ns are released

zentol force-pushed the 12615 branch from 8b8bd3b to be6bcef Compare June 29, 2019 11:18

zentol merged commit 27712bd into apache:master Jun 29, 2019

zentol deleted the 12615 branch July 1, 2019 08:19


		private final JobID jobId;

		private final Map<ExecutionAttemptID, InternalPartitionInfo> partitionInfoByProducer = new HashMap<>();


		private final ShuffleMaster<?> shuffleMaster;

		private PartitionTrackerFactory.TaskExecutorGatewayLookup taskExecutorGatewayLookup;

[FLINK-12615][coordination] Track partitions on JM #8778

[FLINK-12615][coordination] Track partitions on JM #8778

Uh oh!

Conversation

zentol commented Jun 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief changelog

Uh oh!

flinkbot commented Jun 18, 2019

Review Progress

Uh oh!

azagrebin Jun 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhijiangW left a comment

Choose a reason for hiding this comment

Uh oh!

zentol commented Jun 19, 2019

Uh oh!

zhijiangW commented Jun 19, 2019

Uh oh!

zentol commented Jun 19, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhijiangW commented Jun 19, 2019

Uh oh!

zentol commented Jun 20, 2019

Uh oh!

azagrebin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaoyunhaii Jun 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azagrebin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azagrebin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zentol commented Jun 18, 2019 •

edited

Loading

azagrebin Jun 18, 2019 •

edited

Loading

gaoyunhaii Jun 25, 2019 •

edited

Loading

azagrebin left a comment •

edited

Loading