-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-12706] Introduce ShuffleServiceFactory interface and its configuration #8680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The PR is based on #8608 at the moment, only last two commits are to review. |
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
|
@flinkbot attention @tillrohrmann @zhijiangW |
| .key("shuffle.service.class.name") | ||
| .defaultValue("org.apache.flink.runtime.io.network.NettyShuffleService") | ||
| .withDescription("The full name of shuffle service implementation class. " + | ||
| "The default implentation uses netty network communication, local memory and file system on task executor"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implentation -> implementation
| /** | ||
| * The shuffle service implementation class full name. | ||
| */ | ||
| public static final ConfigOption<String> SHUFFLE_SERVICE = ConfigOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not put this option into runtime module? Are there any considerations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the one hand, it would not hurt to put it into the flink-runtime module because it should be a cluster only option and thus not needed in an API module. On the other hand we also put the NettyShuffleEnvironmentOptions and HistoryServerOptions into flink-core. Thus, one could argue that the current placement is in line with the other options.
| TaskEventPublisher taskEventPublisher, | ||
| MetricGroup metricGroup, | ||
| IOManager ioManager) { | ||
| checkNotNull(ioManager); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to obey the parameters sequence during checkNotNull.
also check for the left taskExecutorLocation and metricGroup?
| @VisibleForTesting | ||
| static NettyShuffleEnvironment createNettyShuffleEnvironment( | ||
| NettyShuffleEnvironmentConfiguration config, | ||
| ResourceID taskExecutorLocation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taskExecutorLocation -> resourceID, because we have the TaskManagerLocation class which might be confused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also be called taskExecutorResourceId
| } | ||
|
|
||
| private static void registerNetworkMetrics(MetricGroup metricGroup, NetworkBufferPool networkBufferPool) { | ||
| checkNotNull(metricGroup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might remove this if this check could be done before in createNettyShuffleEnvironment
| /** | ||
| * Task executor local context of shuffle service used to create {@link ShuffleEnvironment}. | ||
| */ | ||
| class ShuffleLocalContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class seems not be used atm. What is the motivation for bringing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, some rebase leftover, it is now ShuffleEnvironmentContext
| * Factory method to create a specific {@link ShuffleMaster} implementation in job master. | ||
| * | ||
| * @param configuration Flink configuration | ||
| * @return shuffle manager implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shuffle manager -> shuffle master
| /** | ||
| * Utility to load the pluggable {@link ShuffleService} implementations. | ||
| */ | ||
| public class ShuffleServiceLoader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might need a unit test for covering this class
zhijiangW
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR @azagrebin !
It looks overall good to me and I left some nit inline comments.
tillrohrmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR @azagrebin. I had some comments mainly concerning the configuration options which we should resolve before moving forward.
| // ------------------------------------------------------------------------ | ||
|
|
||
| /** | ||
| * The shuffle service implementation class full name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class name of the shuffle service implementation to be used by the cluster.
| /** | ||
| * The shuffle service implementation class full name. | ||
| */ | ||
| public static final ConfigOption<String> SHUFFLE_SERVICE = ConfigOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to call this option SHUFFLE_SERVICE_CLASS
| * The shuffle service implementation class full name. | ||
| */ | ||
| public static final ConfigOption<String> SHUFFLE_SERVICE = ConfigOptions | ||
| .key("shuffle.service.class.name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to use the following key: shuffle-service.class
| public static final ConfigOption<String> SHUFFLE_SERVICE = ConfigOptions | ||
| .key("shuffle.service.class.name") | ||
| .defaultValue("org.apache.flink.runtime.io.network.NettyShuffleService") | ||
| .withDescription("The full name of shuffle service implementation class. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full class name of the shuffle service implementation to be used by the cluster.
| .key("shuffle.service.class.name") | ||
| .defaultValue("org.apache.flink.runtime.io.network.NettyShuffleService") | ||
| .withDescription("The full name of shuffle service implementation class. " + | ||
| "The default implentation uses netty network communication, local memory and file system on task executor"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default implementation uses Netty for network communication and local memory as well disk space to store results on a TaskExecutor.
| ShuffleMaster<SD> createShuffleMaster(Configuration configuration); | ||
|
|
||
| /** | ||
| * Factory method to create a specific local {@link ShuffleEnvironment} implementation in task executor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove in task executor
| return shuffleService.orElseThrow(() -> new FlinkException( | ||
| String.format( | ||
| "Could not instantiate the ShuffleService '%s'. " + | ||
| "Please make sure that this class is on your class path.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation seems wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still the same argument, the string is broken by concatenation, I suggest we leave it this way to distinguish it from the next argument.
| /** | ||
| * Utility to load the pluggable {@link ShuffleService} implementations. | ||
| */ | ||
| public class ShuffleServiceLoader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could simply be an enum since it only contains static methods.
| ShuffleService.class, | ||
| classLoader)); | ||
| } catch (ClassNotFoundException e) { | ||
| return Optional.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we swallowing this exception here? It looks as if we recreate it at the call site of lookupEmbeddedShuffleServices if the result is empty. I think it would be better to simply return ShuffleService<?, ?, ?> and to allow this method to throw a ClassNotFoundException.
| shuffleServiceClassName))); | ||
| } | ||
|
|
||
| private static Optional<ShuffleService<?, ?, ?>> lookupEmbeddedShuffleServices(String shuffleServiceClassName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has some code duplication with HighAvailabilityServiceUtils#createCustomHAServices. Maybe we could factor a common method out which calls InstantiationUtil.instantiate and creates a FlinkException in case of a ClassNotFoundException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will move wrapping of ClassNotFoundException by FlinkException into InstantiationUtil.instantiate.
9281b24 to
5e0465f
Compare
|
Thanks for the review @zhijiangW @tillrohrmann, I have addressed comments |
| */ | ||
| public class SingleInputGateFactory { | ||
| @SuppressWarnings("LoggerInitializedWithForeignClass") | ||
| private static final Logger LOG = LoggerFactory.getLogger(SingleInputGate.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use SingleInputGateFactory.class here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was to preserve the structure of the previous logging messages because the factory was a result of refactoring. I can change this if it is not important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks for the explanation!
docs/ops/config.md
Outdated
|
|
||
| {% include generated/resource_manager_configuration.html %} | ||
|
|
||
| ### Shuffle service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upper case for Service for consistent with others.
docs/ops/config.zh.md
Outdated
|
|
||
| {% include generated/resource_manager_configuration.html %} | ||
|
|
||
| ### Shuffle service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: Service
| import org.apache.flink.runtime.io.network.api.writer.ResultPartitionWriter; | ||
| import org.apache.flink.runtime.io.network.partition.consumer.InputGate; | ||
| import org.apache.flink.util.FlinkException; | ||
| import org.junit.Test; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should add an empty line before this import.
| configuration.setString(SHUFFLE_SERVICE_FACTORY_CLASS, "org.apache.flink.runtime.shuffle.ShuffleServiceLoaderTest$CustomShuffleServiceFactory"); | ||
| ShuffleServiceFactory<?, ?, ?> shuffleServiceFactory = ShuffleServiceLoader.loadShuffleServiceFactory(configuration); | ||
| assertThat( | ||
| "Loaded shuffle service factory is not the default netty implementation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should adjust the message
|
@azagrebin thanks for the updates and I only left several minor comments. BTW, the commit of |
|
@zhijiangW |
|
Sorry I thought the renaming issue was caused by this PR in previous commit before. |
…Id and small fixes
… implementing class is not found in InstantiationUtil.instantiate
…ctory method and its default Netty implementation
…and its default Netty implementation
tillrohrmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good work @azagrebin. LGTM. Addressing my last two comments while merging this PR.
| LOG.info("Shutting down the network environment and its components."); | ||
|
|
||
| // terminate all network connections | ||
| //noinspection OverlyBroadCatchBlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's revert this.
| /** | ||
| * Test suite for {@link ShuffleServiceLoader} utility. | ||
| */ | ||
| public class ShuffleServiceLoaderTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extends TestLogger is missing
…and its default Netty implementation This closes apache#8680.
What is the purpose of the change
Introduce
ShuffleServiceinterface with two factory methods forShuffleMasterandShuffleEnvironmentand its configuration. Implement and configure by defaultNettyShuffleService.Brief change log
ShuffleServiceinterface with two factory methods forShuffleMasterandShuffleEnvironmentShuffleServiceLoaderutility class to load the configuredShuffleServiceimplementation from the class pathNettyShuffleService.Verifying this change
Refactoring, existing unit tests
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes)Documentation