-
Notifications
You must be signed in to change notification settings - Fork 51
Initial Kubernetes cluster manager implementation. #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Includes the following initial feature set: - Cluster mode with only Scala/Java jobs - Spark-submit support - Dynamic allocation Does not include, most notably: - Client mode support - Proper testing on both the unit and integration level; integration tests are flaky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: invert if/else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be more than 1 status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we have to look up the container status for the container that is hosting the Driver container... will need to think about this. Alternatively we can just wait for all container statuses to be ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably check if it's a file (which implicitly checks existence) to ensure it's not a directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name threads?
| Executors.newCachedThreadPool( | ||
| new ThreadFactoryBuilder() | ||
| .setDaemon(true) | ||
| .setNameFormat("kubernetes-executor-requests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want name format of "kubernetes-executor-requests-%d"
| WORKDIR /opt/spark | ||
|
|
||
| # TODO support spark.executor.extraClassPath | ||
| CMD ${JAVA_HOME}/bin/java -Dspark.executor.port=$SPARK_EXECUTOR_PORT -Xmx$SPARK_EXECUTOR_MEMORY -cp ${SPARK_HOME}/jars/\* org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $(dbus-uuidgen) --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $HOSTNAME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should set min heap as well: -Xms$SPARK_EXECUTOR_MEMORY to avoid a bunch of Full GC due to ergonomics as heap size dynamically adjusts
|
|
||
| WORKDIR /opt/spark | ||
|
|
||
| CMD ${JAVA_HOME}/bin/java -Dspark.shuffle.service.port=$SPARK_SHUFFLE_SERVICE_PORT -Xmx1g -cp ${SPARK_HOME}/jars/\* org.apache.spark.deploy.ExternalShuffleService |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configurable heap? Also set min heap = max?
ash211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Geez Maven is verbose with xml
|
|
||
| case KUBERNETES_EXPOSE_DRIVER_PORT => | ||
| value.split("=", 2).toSeq match { | ||
| case Seq(k, v) => exposeDriverPorts(k) = v.toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will throw if v isn't toIntable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Probably should have a try-catch for a more favorable message than what I presume the default will be though.
| protected final String KUBERNETES_APP_NAMESPACE = "--kubernetes-app-namespace"; | ||
| protected final String KUBERNETES_CLIENT_CERT_FILE = "--kubernetes-client-cert-file"; | ||
| protected final String KUBERNETES_CLIENT_KEY_FILE = "--kubernetes-client-key-file"; | ||
| protected final String KUBERNETES_CA_CERT_FILE = "--kubernetes-ca-cert-file"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alphabetize within this k8s group
| <artifactId>netty</artifactId> | ||
| <version>3.8.0.Final</version> | ||
| </dependency> | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space
| * the locally-generated ID from the superclass. | ||
| * @return The application ID | ||
| * | ||
| * @return The application ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spacing changed unnecessarily
- Run spark shuffle service independently of a Spark job - Run executor pods instead of replication controllers
|
@ash211 @mccheah Shall we move these changes to a branch in https://github.com/foxish/spark/ and continue the discussions there on the various points we discussed last week? |
|
@foxish yep! Just got legal signoff this morning so we'll be sending it over shortly |
|
Thanks! |
HostPath volume will be used both on the shuffle service and the executors that connect to it. The shuffle service picks up the files written by the executors via the shared host volume.
|
@foxish can I get permission to push to your fork? |
|
Code transferred to foxish#7 with discussion now happening there instead |
* Create README to better describe project purpose * Add links to usage guide and dev docs * Minor changes
* Create README to better describe project purpose * Add links to usage guide and dev docs * Minor changes
Includes the following initial feature set:
Does not include, most notably: