-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-11587][tests] Port CoLocationConstraintITCase to new code base #7690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsBot commandsThe @flinkbot bot supports the following commands:
|
| receiver.setSlotSharingGroup(slotSharingGroup); | ||
| sender.setSlotSharingGroup(slotSharingGroup); | ||
|
|
||
| receiver.setStrictlyCoLocatedWith(sender); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even without L82-86 the test can pass. The reason is that some Sender/Receiver parallelism start and finish quickly. We can make sure that All Senders don't exit until all Receivers become running, maybe by setting a CountDownLatch like #6883
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is true. We don't have true assertions making sure that task are being co-located. The CountDownLatch would enforce that both tasks are online at the same time. I think this is not what we want to guarantee here. Instead we should test that the tasks are deployed in the same slot and, thus, using local channels for communication. Maybe a non serializable record could do the trick here. I'll try it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this doesn't work because we always serialize into a buffer independent of the channel type. The only difference is whether it goes through Netty or not I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be the solution. What we can do is to start the MiniCluster with only local communication enabled. That way we won't start netty and the communication needs to happen strictly locally :-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tillrohrmann The test still succeeds even if local communication is set to false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's expected that the test succeeds if localCommunication is set to false because it's the less restricted case. If localCommunication is true TMs cannot speak with each other.
What you should try is to comment the colocation constraint out to see that the test fails, because that's what we are testing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, that makes more sense 🤦♂️ . Unfortunately the test still runs successfully if the colocation constraint is removed. Based on the logs the sender tasks are finishing before the receivers are even started, so we never run out of slots, which as I understand is the failure condition here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @zentol and me talked offline, the test actually tests not only the co-location constraints but also the input preferences of normal scheduling. Thus, one needs to remove the slot sharing as well in order to make this test fail.
zentol
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
002b80e to
5520ed9
Compare
zentol
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flinkbot approve quality
97c19e1 to
d6fa986
Compare
…ception, Deadline, long) Properly pass the retryIntervalMillis to the sleep call.
- "recover a task manager failure" --> TaskExecutorITCase#testJobRecoveryWithFailingTaskExecutor - "recover once failing forward job" --> JobRecoveryITCase#testTaskFailureRecovery - "recover once failing forward job with slot sharing" --> JobRecoveryITCase#testTaskFailureWithSlotSharingRecovery This closes apache#7683.
Increase the retry interval for TaskExecutorITCase and ZooKeeperLeaderElectionITCase since they take so long that the low retry interval won't have a big effect apart from a higher CPU load.
- "support colocation constraints and slot sharing" --> JobExecutionITCase#testCoLocationConstraintJobExecution This closes apache#7690.
d6fa986 to
3c34bfc
Compare
What is the purpose of the change
Port
CoLocationConstraintITCaseto new code base.This PR is based on #7689.
Brief change log
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (no)Documentation