Skip to content

Conversation

@tisonkun
Copy link
Member

@tisonkun tisonkun commented Oct 19, 2018

What is the purpose of the change

Port CoLocationConstraintITCase and SlotSharingITCase to new codebase.

Brief change log

  1. Introduce CountDownLatchedReceiver, CountDownLatchedSender and CountDownLatchedAgnosticBinaryReceiver, which the sender only finished on all receiver running. It is for prevent the sender finish causing return the slot, which breaks testing purpose.

  2. do the porting job and replace TestingCluster with MiniCluster.

Verifying this change

This change is a trivial rework and it itself the test.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with (Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

cc @tillrohrmann

@tisonkun tisonkun force-pushed the FLINK-10610 branch 2 times, most recently from ef791d8 to 57b761d Compare October 19, 2018 11:03
@tisonkun tisonkun changed the title [FLINK-10610] [tests] Port CoLocationConstraintITCase to new codebase [FLINK-10610] [tests] Port slot sharing cases to new codebase Oct 19, 2018
@tisonkun
Copy link
Member Author

ping @tillrohrmann

@zentol zentol self-assigned this Feb 6, 2019
@zentol
Copy link
Contributor

zentol commented Feb 7, 2019

What about the TaskManagerFailsWithSlotSharingITCase? At a glance it looks similar and related to these tests.

@tisonkun
Copy link
Member Author

tisonkun commented Feb 8, 2019

Address comments. For TaskManagerFailsWithSlotSharingITCase, it can also be covered in this scope. Ported as testSlotSharingForForwardJobWithFailedTaskManager. Specifically, only tests calling shutdown that sent PoisonPill message. We don't use Kill message any more iirc.

@zentol
Copy link
Contributor

zentol commented Feb 8, 2019

checkstyle:

01:08:26.782 [ERROR] src/test/java/org/apache/flink/runtime/minicluster/MiniClusterITCase.java:[355] (whitespace) EmptyLineSeparator: 'METHOD_DEF' has more than 1 empty lines before.

@tisonkun
Copy link
Member Author

tisonkun commented Feb 8, 2019

Thanks for your review very much @zentol ! Address comments.

ResultPartitionType.PIPELINED);

final CountDownLatch countDownLatch = new CountDownLatch(parallelism);
CountDownLatchedSender.setLatch(countDownLatch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test must also not be run in parallel to testSlotSharingForForwardJobWithCoLocationConstraint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I add a ReentrantLock to guard settings to these CountDownLatchs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the synchronization can be removed since multiple tests are not run in parallel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: if the test had failed another test could've been blocked indefinitely since you aren't calling unlock in a finally block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm thanks for you education and sorry for so inexperienced. will remove the synchronization

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense that we move unlock to finally block for possible parallel testing? or defer the synchronization until we really make it parallelized.

@tisonkun tisonkun force-pushed the FLINK-10610 branch 3 times, most recently from 31f2461 to cc6b1b7 Compare February 12, 2019 13:32
@tillrohrmann
Copy link
Contributor

Sorry for joining so late to the party. I actually missed this PR and accidentally opened #7689. While porting the tests, I think that we can completely remove the SlotSharingITCase, since the test is no longer valid. If you remove the slot sharing groups the test would pass with the new code because of the queued scheduling. The only test which I would port is the CoLocationConstraintITCase (#7690).

Moreover, with #7676 I added some functionality for terminating and starting new TaskExecutors for the MiniCluster.

I also think that TaskManagerFailsWithSlotSharingITCase should simply be removed since it is the same as the #7676 just with slot sharing. One could argue that slot sharing is the more common case and, therefore should be used. But then we should simply adapt #7676.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry again for not seeing this PR. I didn't properly check JIRA. I think we should only keep the CoLocationConstraintITCase and remove the other tests.

checkState(running, "MiniCluster is not yet running.");
return taskManagers;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to do this differently. Please take a look at #7676.

.setConfiguration(configuration)
.build();

try (final MiniCluster miniCluster = new MiniCluster(cfg)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to separate tests which modify the MiniCluster from those which don't. For the latter, the MiniCluster could be started for the test class instead for every test. This will speed up the execution.

@tisonkun
Copy link
Member Author

@tillrohrmann thanks for your guide. Now I see the reason that slot sharing isn't really tested and approve that we should only port CoLocationConstraintITCase.

I've moved to #7690 and find that even we don't have code below we can pass the test. Maybe we can set a CountDownLatch like this pr does?

                final SlotSharingGroup slotSharingGroup = new SlotSharingGroup();
		receiver.setSlotSharingGroup(slotSharingGroup);
		sender.setSlotSharingGroup(slotSharingGroup);

		receiver.setStrictlyCoLocatedWith(sender);

basically it's OK to me that we close this thread and move discussions to #7689 #7690 and #7676

@tillrohrmann
Copy link
Contributor

You're right that #7690 does not contain any assertions that receiver and sender run in the same slot. The count down latch does not enforce this (at least not how we used it in this PR). Maybe we could send non serializable records which need to go through a local channel without serialization to test the functionality.

@tisonkun tisonkun closed this Feb 13, 2019
@tisonkun tisonkun deleted the FLINK-10610 branch February 13, 2019 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants