Skip to content

Conversation

@tillrohrmann
Copy link
Contributor

What is the purpose of the change

Port TaskManagerFailsITCase to new code base.

Brief change log

  • "detect a failing task manager" --> JobMaster#testHeartbeatTimeoutWithTaskManager

  • "handle gracefully failing task manager" --> JobMasterTest#testJobFailureWhenGracefulTaskExecutorTermination

  • "handle hard failing task manager" --> JobMasterTest#testJobFailureWhenTaskExecutorHeartbeatTimeout

  • "go into a clean state in case of a TaskManager failure" --> TaskExecutorITCase#testNewTaskExecutorJoinsCluster

Verifying this change

  • Run ported tests

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@flinkbot
Copy link
Collaborator

flinkbot commented Feb 11, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ✅ 1. The [description] looks good.
    • Approved by @GJL [committer]
  • ✅ 2. There is [consensus] that the contribution should go into to Flink.
    • Approved by @GJL [committer]
  • ❔ 3. Needs [attention] from.
  • ✅ 4. The change fits into the overall [architecture].
    • Approved by @GJL [committer]
  • ✅ 5. Overall code [quality] is good.
    • Approved by @GJL [committer]

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot approve description to approve the 1st aspect (similarly, it also supports the consensus, architecture and quality keywords)
  • @flinkbot approve all to approve all aspects
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval

@GJL GJL self-requested a review February 11, 2019 16:13
@GJL GJL self-assigned this Feb 11, 2019
@GJL
Copy link
Member

GJL commented Feb 11, 2019

@flinkbot approve description
@flinkbot approve consensus

}

@Override
public void startTaskExecutor(boolean localCommunication) throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like user friendly API. Shouldn't MiniCluster set this flag depending on the configuration? Wouldn't it be enough to expose a signature such as public void startTaskExecutor()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, MiniCluster cannot know whether to use local communication or not at start time, because TestingMiniCluster allows to start new TaskExecutors. Thus, the only option would be to always set localCommunication to false in the case of the MiniCluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced a new method useLocalCommunication which can be overriden by the TestingMiniCluster to always set local communication to false.

@tillrohrmann
Copy link
Contributor Author

Thanks for the review @GJL. I've addressed your comments and pushed a fixup.

}

private void runJobFailureWhenTaskExecutorTerminatesTest(
Supplier<HeartbeatServices> heartbeatSupplier,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does HeartbeatServices have to be lazily supplied?

Suggested change
Supplier<HeartbeatServices> heartbeatSupplier,
HeartbeatServices heartbeatServices,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not correct. Will change.

jobGraph,
haServices,
new TestingJobManagerSharedServicesBuilder().build(),
heartbeatSupplier.get(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
heartbeatSupplier.get(),
heartbeatServices,

public void testJobFailureWhenTaskExecutorHeartbeatTimeout() throws Exception {
final AtomicBoolean respondToHeartbeats = new AtomicBoolean(true);
runJobFailureWhenTaskExecutorTerminatesTest(
() -> fastHeartbeatServices,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
() -> fastHeartbeatServices,
fastHeartbeatServices,

@Test
public void testJobFailureWhenGracefulTaskExecutorTermination() throws Exception {
runJobFailureWhenTaskExecutorTerminatesTest(
() -> heartbeatServices,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
() -> heartbeatServices,
heartbeatServices,

private static final class TestingOnCompletionActions implements OnCompletionActions {

private final CompletableFuture<ArchivedExecutionGraph> jobReachedGloballyTerminalStateFuture = new CompletableFuture<>();
private final CompletableFuture<Void> jobFinishedByOtherFuture = new CompletableFuture<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unused field is for future extensions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

* Create a new {@link TerminatingFatalErrorHandler} for the {@link TaskExecutor} with
* the given index.
*
* @param index into the {{@link #taskManagers}} collection to identify the correct {@link TaskExecutor}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the double curly braces {{}} intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unintended. will change it.

}

@Override
public void startTaskExecutor(boolean localCommunication) throws Exception {
Copy link
Member

@GJL GJL Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it crucial for testing to be able to set the right localCommunication flag? If yes, a method overload that sets localCommunication to false would have been enough but I am not insisting on it.

edit: alternatively always use false with no option to override (if possible)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you suggesting to always start the TaskExecutors with localCommunication = false? Or only for the TestingMiniCluster? The latter should now be the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this functionality was needed for #7690.

@tillrohrmann
Copy link
Contributor Author

Thanks for the second round of review @GJL. I addressed your comments except for the last one for which I didn't understand your proposal yet.

@GJL
Copy link
Member

GJL commented Feb 13, 2019

@flinkbot approve all

TaskExecutorITCase is actually covered by TaskExecutorTest#testOfferSlotToJobMasterAfterTimeout.
- "detect a failing task manager" --> JobMaster#testHeartbeatTimeoutWithTaskManager

- "handle gracefully failing task manager" --> JobMasterTest#testJobFailureWhenGracefulTaskExecutorTermination

- "handle hard failing task manager" --> JobMasterTest#testJobFailureWhenTaskExecutorHeartbeatTimeout

- "go into a clean state in case of a TaskManager failure" --> TaskExecutorITCase#testNewTaskExecutorJoinsCluster

This closes apache#7676.
@tillrohrmann
Copy link
Contributor Author

Thanks for the review @GJL. Merging once Travis gives green light.

@tillrohrmann tillrohrmann deleted the FLINK-11364 branch February 13, 2019 14:58
asfgit pushed a commit that referenced this pull request Feb 13, 2019
- "detect a failing task manager" --> JobMaster#testHeartbeatTimeoutWithTaskManager

- "handle gracefully failing task manager" --> JobMasterTest#testJobFailureWhenGracefulTaskExecutorTermination

- "handle hard failing task manager" --> JobMasterTest#testJobFailureWhenTaskExecutorHeartbeatTimeout

- "go into a clean state in case of a TaskManager failure" --> TaskExecutorITCase#testNewTaskExecutorJoinsCluster

This closes #7676.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants