-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-11587][tests] Port CoLocationConstraintITCase to new code base #7690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
b364e3d
[hotfix][tests] Fix CommonTestUtils#waitUntilCondition(SupplierWithEx…
tillrohrmann b59e532
[FLINK-11486][tests] Port RecoveryITCase to new code base
tillrohrmann 04ba9b7
[hotfix][tests] Harden tests using CommonTestUtils.waitUntilCondition
tillrohrmann 3c34bfc
[FLINK-11587][tests] Port CoLocationConstraintITCase to new code base
tillrohrmann File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
151 changes: 151 additions & 0 deletions
151
flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/JobExecutionITCase.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,151 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.flink.runtime.jobmaster; | ||
|
|
||
| import org.apache.flink.runtime.execution.Environment; | ||
| import org.apache.flink.runtime.io.network.api.reader.RecordReader; | ||
| import org.apache.flink.runtime.io.network.api.writer.RecordWriter; | ||
| import org.apache.flink.runtime.io.network.partition.ResultPartitionType; | ||
| import org.apache.flink.runtime.jobgraph.DistributionPattern; | ||
| import org.apache.flink.runtime.jobgraph.JobGraph; | ||
| import org.apache.flink.runtime.jobgraph.JobVertex; | ||
| import org.apache.flink.runtime.jobgraph.tasks.AbstractInvokable; | ||
| import org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroup; | ||
| import org.apache.flink.runtime.minicluster.TestingMiniCluster; | ||
| import org.apache.flink.runtime.minicluster.TestingMiniClusterConfiguration; | ||
| import org.apache.flink.types.IntValue; | ||
| import org.apache.flink.util.TestLogger; | ||
|
|
||
| import org.junit.Test; | ||
|
|
||
| import java.util.concurrent.CompletableFuture; | ||
|
|
||
| import static org.hamcrest.Matchers.is; | ||
| import static org.junit.Assert.assertThat; | ||
|
|
||
| /** | ||
| * Integration tests for job scheduling. | ||
| */ | ||
| public class JobExecutionITCase extends TestLogger { | ||
|
|
||
| /** | ||
| * Tests that tasks with a co-location constraint are scheduled in the same | ||
| * slots. In fact it also tests that consumers are scheduled wrt their input | ||
| * location if the co-location constraint is deactivated. | ||
| */ | ||
| @Test | ||
| public void testCoLocationConstraintJobExecution() throws Exception { | ||
| final int numSlotsPerTaskExecutor = 1; | ||
| final int numTaskExecutors = 3; | ||
| final int parallelism = numTaskExecutors * numSlotsPerTaskExecutor; | ||
| final JobGraph jobGraph = createJobGraph(parallelism); | ||
|
|
||
| final TestingMiniClusterConfiguration miniClusterConfiguration = new TestingMiniClusterConfiguration.Builder() | ||
| .setNumSlotsPerTaskManager(numSlotsPerTaskExecutor) | ||
| .setNumTaskManagers(numTaskExecutors) | ||
| .setLocalCommunication(true) | ||
| .build(); | ||
|
|
||
| try (TestingMiniCluster miniCluster = new TestingMiniCluster(miniClusterConfiguration)) { | ||
| miniCluster.start(); | ||
|
|
||
| miniCluster.submitJob(jobGraph).get(); | ||
|
|
||
| final CompletableFuture<JobResult> jobResultFuture = miniCluster.requestJobResult(jobGraph.getJobID()); | ||
|
|
||
| assertThat(jobResultFuture.get().isSuccess(), is(true)); | ||
| } | ||
| } | ||
|
|
||
| private JobGraph createJobGraph(int parallelism) { | ||
| final JobVertex sender = new JobVertex("Sender"); | ||
| sender.setParallelism(parallelism); | ||
| sender.setInvokableClass(Sender.class); | ||
|
|
||
| final JobVertex receiver = new JobVertex("Receiver"); | ||
| receiver.setParallelism(parallelism); | ||
| receiver.setInvokableClass(Receiver.class); | ||
|
|
||
| // In order to make testCoLocationConstraintJobExecution fail, one needs to | ||
| // remove the co-location constraint and the slot sharing groups, because then | ||
| // the receivers will have to wait for the senders to finish and the slot | ||
| // assignment order to the receivers is non-deterministic (depending on the | ||
| // order in which the senders finish). | ||
| final SlotSharingGroup slotSharingGroup = new SlotSharingGroup(); | ||
| receiver.setSlotSharingGroup(slotSharingGroup); | ||
| sender.setSlotSharingGroup(slotSharingGroup); | ||
| receiver.setStrictlyCoLocatedWith(sender); | ||
|
|
||
| receiver.connectNewDataSetAsInput(sender, DistributionPattern.POINTWISE, ResultPartitionType.PIPELINED); | ||
|
|
||
| final JobGraph jobGraph = new JobGraph(getClass().getSimpleName(), sender, receiver); | ||
|
|
||
| return jobGraph; | ||
| } | ||
|
|
||
| /** | ||
| * Basic sender {@link AbstractInvokable} which sends 42 and 1337 down stream. | ||
| */ | ||
| public static class Sender extends AbstractInvokable { | ||
|
|
||
| public Sender(Environment environment) { | ||
| super(environment); | ||
| } | ||
|
|
||
| @Override | ||
| public void invoke() throws Exception { | ||
| final RecordWriter<IntValue> writer = new RecordWriter<>(getEnvironment().getWriter(0)); | ||
|
|
||
| try { | ||
| writer.emit(new IntValue(42)); | ||
| writer.emit(new IntValue(1337)); | ||
| writer.flushAll(); | ||
| } finally { | ||
| writer.clearBuffers(); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Basic receiver {@link AbstractInvokable} which verifies the sent elements | ||
| * from the {@link Sender}. | ||
| */ | ||
| public static class Receiver extends AbstractInvokable { | ||
|
|
||
| public Receiver(Environment environment) { | ||
| super(environment); | ||
| } | ||
|
|
||
| @Override | ||
| public void invoke() throws Exception { | ||
| final RecordReader<IntValue> reader = new RecordReader<>( | ||
| getEnvironment().getInputGate(0), | ||
| IntValue.class, | ||
| getEnvironment().getTaskManagerInfo().getTmpDirectories()); | ||
|
|
||
| final IntValue i1 = reader.next(); | ||
| final IntValue i2 = reader.next(); | ||
| final IntValue i3 = reader.next(); | ||
|
|
||
| if (i1.getValue() != 42 || i2.getValue() != 1337 || i3 != null) { | ||
| throw new Exception("Wrong data received."); | ||
| } | ||
| } | ||
| } | ||
| } | ||
133 changes: 133 additions & 0 deletions
133
flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/JobRecoveryITCase.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,133 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.flink.runtime.jobmaster; | ||
|
|
||
| import org.apache.flink.api.common.ExecutionConfig; | ||
| import org.apache.flink.api.common.restartstrategy.RestartStrategies; | ||
| import org.apache.flink.runtime.execution.Environment; | ||
| import org.apache.flink.runtime.io.network.partition.ResultPartitionType; | ||
| import org.apache.flink.runtime.jobgraph.DistributionPattern; | ||
| import org.apache.flink.runtime.jobgraph.JobGraph; | ||
| import org.apache.flink.runtime.jobgraph.JobVertex; | ||
| import org.apache.flink.runtime.jobmanager.Tasks; | ||
| import org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroup; | ||
| import org.apache.flink.runtime.minicluster.MiniCluster; | ||
| import org.apache.flink.runtime.testutils.MiniClusterResource; | ||
| import org.apache.flink.runtime.testutils.MiniClusterResourceConfiguration; | ||
| import org.apache.flink.util.FlinkRuntimeException; | ||
| import org.apache.flink.util.TestLogger; | ||
|
|
||
| import org.junit.ClassRule; | ||
| import org.junit.Test; | ||
|
|
||
| import java.io.IOException; | ||
| import java.util.concurrent.CompletableFuture; | ||
|
|
||
| import static org.hamcrest.Matchers.is; | ||
| import static org.junit.Assert.assertThat; | ||
|
|
||
| /** | ||
| * Tests for the recovery of task failures. | ||
| */ | ||
| public class JobRecoveryITCase extends TestLogger { | ||
|
|
||
| private static final int NUM_TMS = 1; | ||
| private static final int SLOTS_PER_TM = 11; | ||
| private static final int PARALLELISM = NUM_TMS * SLOTS_PER_TM; | ||
|
|
||
| @ClassRule | ||
| public static final MiniClusterResource MINI_CLUSTER_RESOURCE = new MiniClusterResource( | ||
| new MiniClusterResourceConfiguration.Builder() | ||
| .setNumberTaskManagers(NUM_TMS) | ||
| .setNumberSlotsPerTaskManager(SLOTS_PER_TM) | ||
| .build()); | ||
|
|
||
| @Test | ||
| public void testTaskFailureRecovery() throws Exception { | ||
| runTaskFailureRecoveryTest(createjobGraph(false)); | ||
| } | ||
|
|
||
| @Test | ||
| public void testTaskFailureWithSlotSharingRecovery() throws Exception { | ||
| runTaskFailureRecoveryTest(createjobGraph(true)); | ||
| } | ||
|
|
||
| private void runTaskFailureRecoveryTest(final JobGraph jobGraph) throws Exception { | ||
| final MiniCluster miniCluster = MINI_CLUSTER_RESOURCE.getMiniCluster(); | ||
|
|
||
| miniCluster.submitJob(jobGraph).get(); | ||
|
|
||
| final CompletableFuture<JobResult> jobResultFuture = miniCluster.requestJobResult(jobGraph.getJobID()); | ||
|
|
||
| assertThat(jobResultFuture.get().isSuccess(), is(true)); | ||
| } | ||
|
|
||
| private JobGraph createjobGraph(boolean slotSharingEnabled) throws IOException { | ||
| final JobVertex sender = new JobVertex("Sender"); | ||
| sender.setParallelism(PARALLELISM); | ||
| sender.setInvokableClass(Tasks.Sender.class); | ||
|
|
||
| final JobVertex receiver = new JobVertex("Receiver"); | ||
| receiver.setParallelism(PARALLELISM); | ||
| receiver.setInvokableClass(FailingOnceReceiver.class); | ||
| FailingOnceReceiver.reset(); | ||
|
|
||
| if (slotSharingEnabled) { | ||
| final SlotSharingGroup slotSharingGroup = new SlotSharingGroup(); | ||
| receiver.setSlotSharingGroup(slotSharingGroup); | ||
| sender.setSlotSharingGroup(slotSharingGroup); | ||
| } | ||
|
|
||
| receiver.connectNewDataSetAsInput(sender, DistributionPattern.POINTWISE, ResultPartitionType.PIPELINED); | ||
|
|
||
| final ExecutionConfig executionConfig = new ExecutionConfig(); | ||
| executionConfig.setRestartStrategy(RestartStrategies.fixedDelayRestart(1, 0L)); | ||
|
|
||
| final JobGraph jobGraph = new JobGraph(getClass().getSimpleName(), sender, receiver); | ||
| jobGraph.setExecutionConfig(executionConfig); | ||
|
|
||
| return jobGraph; | ||
| } | ||
|
|
||
| /** | ||
| * Receiver which fails once before successfully completing. | ||
| */ | ||
| public static final class FailingOnceReceiver extends JobExecutionITCase.Receiver { | ||
|
|
||
| private static volatile boolean failed = false; | ||
|
|
||
| public FailingOnceReceiver(Environment environment) { | ||
| super(environment); | ||
| } | ||
|
|
||
| @Override | ||
| public void invoke() throws Exception { | ||
| if (!failed && getEnvironment().getTaskInfo().getIndexOfThisSubtask() == 0) { | ||
| failed = true; | ||
| throw new FlinkRuntimeException(getClass().getSimpleName()); | ||
| } else { | ||
| super.invoke(); | ||
| } | ||
| } | ||
|
|
||
| private static void reset() { | ||
| failed = false; | ||
| } | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even without L82-86 the test can pass. The reason is that some Sender/Receiver parallelism start and finish quickly. We can make sure that All Senders don't exit until all Receivers become running, maybe by setting a CountDownLatch like #6883
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is true. We don't have true assertions making sure that task are being co-located. The
CountDownLatchwould enforce that both tasks are online at the same time. I think this is not what we want to guarantee here. Instead we should test that the tasks are deployed in the same slot and, thus, using local channels for communication. Maybe a non serializable record could do the trick here. I'll try it out.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this doesn't work because we always serialize into a buffer independent of the channel type. The only difference is whether it goes through Netty or not I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be the solution. What we can do is to start the
MiniClusterwith only local communication enabled. That way we won't start netty and the communication needs to happen strictly locally :-).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tillrohrmann The test still succeeds even if local communication is set to false.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's expected that the test succeeds if
localCommunicationis set tofalsebecause it's the less restricted case. IflocalCommunicationistrueTMs cannot speak with each other.What you should try is to comment the colocation constraint out to see that the test fails, because that's what we are testing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, that makes more sense 🤦♂️ . Unfortunately the test still runs successfully if the colocation constraint is removed. Based on the logs the sender tasks are finishing before the receivers are even started, so we never run out of slots, which as I understand is the failure condition here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @zentol and me talked offline, the test actually tests not only the co-location constraints but also the input preferences of normal scheduling. Thus, one needs to remove the slot sharing as well in order to make this test fail.