[FLINK-11587][tests] Port CoLocationConstraintITCase to new code base #7690

tillrohrmann · 2019-02-12T22:29:56Z

What is the purpose of the change

Port CoLocationConstraintITCase to new code base.

This PR is based on #7689.

Brief change log

"support colocation constraints and slot sharing" --> JobExecutionITCase#testCoLocationConstraintJobExecution

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not applicable)

flinkbot · 2019-02-12T22:30:12Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

✅ 1. The [description] looks good.
- Approved by @zentol [PMC]
✅ 2. There is [consensus] that the contribution should go into to Flink.
- Approved by @zentol [PMC]
❔ 3. Needs [attention] from.
✅ 4. The change fits into the overall [architecture].
- Approved by @zentol [PMC]
❌ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve the 1st aspect (similarly, it also supports the consensus, architecture and quality keywords)
@flinkbot approve all to approve all aspects
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval

tisonkun · 2019-02-13T09:08:09Z

flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/JobExecutionITCase.java

+		receiver.setSlotSharingGroup(slotSharingGroup);
+		sender.setSlotSharingGroup(slotSharingGroup);
+
+		receiver.setStrictlyCoLocatedWith(sender);


Even without L82-86 the test can pass. The reason is that some Sender/Receiver parallelism start and finish quickly. We can make sure that All Senders don't exit until all Receivers become running, maybe by setting a CountDownLatch like #6883

this is true. We don't have true assertions making sure that task are being co-located. The CountDownLatch would enforce that both tasks are online at the same time. I think this is not what we want to guarantee here. Instead we should test that the tasks are deployed in the same slot and, thus, using local channels for communication. Maybe a non serializable record could do the trick here. I'll try it out.

Hmm this doesn't work because we always serialize into a buffer independent of the channel type. The only difference is whether it goes through Netty or not I think.

I think this should be the solution. What we can do is to start the MiniCluster with only local communication enabled. That way we won't start netty and the communication needs to happen strictly locally :-).

make sense :-)

@tillrohrmann The test still succeeds even if local communication is set to false.

it's expected that the test succeeds if localCommunication is set to false because it's the less restricted case. If localCommunication is true TMs cannot speak with each other.

What you should try is to comment the colocation constraint out to see that the test fails, because that's what we are testing here.

of course, that makes more sense 🤦‍♂️ . Unfortunately the test still runs successfully if the colocation constraint is removed. Based on the logs the sender tasks are finishing before the receivers are even started, so we never run out of slots, which as I understand is the failure condition here.

As @zentol and me talked offline, the test actually tests not only the co-location constraints but also the input preferences of normal scheduling. Thus, one needs to remove the slot sharing as well in order to make this test fail.

zentol · 2019-02-13T12:17:36Z

@flinkbot approve description
@flinkbot approve consensus
@flinkbot approve architecture

zentol

.

zentol

@flinkbot approve quality

tillrohrmann · 2019-02-13T16:20:11Z

Thanks for the review @zentol. Merging once #7683 has been merged.

…ception, Deadline, long) Properly pass the retryIntervalMillis to the sleep call.

- "recover a task manager failure" --> TaskExecutorITCase#testJobRecoveryWithFailingTaskExecutor - "recover once failing forward job" --> JobRecoveryITCase#testTaskFailureRecovery - "recover once failing forward job with slot sharing" --> JobRecoveryITCase#testTaskFailureWithSlotSharingRecovery This closes apache#7683.

Increase the retry interval for TaskExecutorITCase and ZooKeeperLeaderElectionITCase since they take so long that the low retry interval won't have a big effect apart from a higher CPU load.

- "support colocation constraints and slot sharing" --> JobExecutionITCase#testCoLocationConstraintJobExecution This closes apache#7690.

tillrohrmann mentioned this pull request Feb 13, 2019

[FLINK-10610] [tests] Port slot sharing cases to new codebase #6883

Closed

tisonkun reviewed Feb 13, 2019

View reviewed changes

tillrohrmann mentioned this pull request Feb 13, 2019

[FLINK-11364][tests] Port TaskManagerFailsITCase to new code base #7676

Closed

tisonkun approved these changes Feb 13, 2019

View reviewed changes

zentol requested changes Feb 13, 2019

View reviewed changes

tillrohrmann force-pushed the FLINK-11587 branch from 002b80e to 5520ed9 Compare February 13, 2019 15:08

zentol approved these changes Feb 13, 2019

View reviewed changes

zentol self-assigned this Feb 13, 2019

tillrohrmann mentioned this pull request Feb 13, 2019

[FLINK-11592][tests] Port TaskManagerFailsWithSlotSharingITCase to new code base #7693

Closed

tillrohrmann force-pushed the FLINK-11587 branch from 97c19e1 to d6fa986 Compare February 13, 2019 16:20

tillrohrmann added 4 commits February 14, 2019 11:17

[hotfix][tests] Fix CommonTestUtils#waitUntilCondition(SupplierWithEx…

b364e3d

…ception, Deadline, long) Properly pass the retryIntervalMillis to the sleep call.

[hotfix][tests] Harden tests using CommonTestUtils.waitUntilCondition

04ba9b7

Increase the retry interval for TaskExecutorITCase and ZooKeeperLeaderElectionITCase since they take so long that the low retry interval won't have a big effect apart from a higher CPU load.

[FLINK-11587][tests] Port CoLocationConstraintITCase to new code base

3c34bfc

- "support colocation constraints and slot sharing" --> JobExecutionITCase#testCoLocationConstraintJobExecution This closes apache#7690.

tillrohrmann force-pushed the FLINK-11587 branch from d6fa986 to 3c34bfc Compare February 14, 2019 10:19

asfgit closed this in 3ff8d9c Feb 14, 2019

tillrohrmann deleted the FLINK-11587 branch February 14, 2019 12:48

rmetzger added the component=<none> label Mar 18, 2019

[FLINK-11587][tests] Port CoLocationConstraintITCase to new code base #7690

[FLINK-11587][tests] Port CoLocationConstraintITCase to new code base #7690

Uh oh!

Conversation

tillrohrmann commented Feb 12, 2019

What is the purpose of the change

Brief change log

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Feb 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Progress

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tillrohrmann Feb 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zentol commented Feb 13, 2019

Uh oh!

zentol left a comment

Choose a reason for hiding this comment

Uh oh!

zentol left a comment

Choose a reason for hiding this comment

Uh oh!

tillrohrmann commented Feb 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

flinkbot commented Feb 12, 2019 •

edited

Loading

tillrohrmann Feb 13, 2019 •

edited

Loading