-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23153][K8s] Support client dependencies with a Hadoop Compatible File System #23546
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -208,8 +208,31 @@ If your application's dependencies are all hosted in remote locations like HDFS | |
by their appropriate remote URIs. Also, application dependencies can be pre-mounted into custom-built Docker images. | ||
Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the | ||
`SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles. The `local://` scheme is also required when referring to | ||
dependencies in custom-built Docker images in `spark-submit`. Note that using application dependencies from the submission | ||
client's local file system is currently not yet supported. | ||
dependencies in custom-built Docker images in `spark-submit`. We support dependencies from the submission | ||
client's local file system using the `file://` scheme or without a scheme (using a full path), where the destination should be a Hadoop compatible filesystem. | ||
A typical example of this using S3 is via passing the following options: | ||
|
||
``` | ||
... | ||
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.6 | ||
--conf spark.kubernetes.file.upload.path=s3a://<s3-bucket>/path | ||
--conf spark.hadoop.fs.s3a.access.key=... | ||
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem | ||
--conf spark.hadoop.fs.s3a.fast.upload=true | ||
--conf spark.hadoop.fs.s3a.secret.key=.... | ||
--conf spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp | ||
file:///full/path/to/app.jar | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does the submission client the user's intention is to upload to S3 instead of say an HDFS cluster? I don't think this can be determined 100% sure only based on the present of those There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I saw you have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code is agnostic of the protocol. I am just using S3 as an example in the docs. If they dont put the properties submit will fail. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok |
||
The app jar file will be uploaded to the S3 and then when the driver is launched it will be downloaded | ||
to the driver pod and will be added to its classpath. Spark will generate a subdir under the upload path with a random name | ||
to avoid conflicts with spark apps running in parallel. User could manage the subdirs created according to his needs. | ||
|
||
The client scheme is supported for the application jar, and dependencies specified by properties `spark.jars` and `spark.files`. | ||
|
||
Important: all client-side dependencies will be uploaded to the given path with a flat directory structure so | ||
srowen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
file names must be unique otherwise files will be overwritten. Also make sure in the derived k8s image default ivy dir | ||
has the required access rights or modify the settings as above. The latter is also important if you use `--packages` in | ||
cluster mode. | ||
|
||
## Secret Management | ||
Kubernetes [Secrets](https://kubernetes.io/docs/concepts/configuration/secret/) can be used to provide credentials for a | ||
|
@@ -455,7 +478,6 @@ There are several Spark on Kubernetes features that are currently being worked o | |
Some of these include: | ||
|
||
* Dynamic Resource Allocation and External Shuffle Service | ||
* Local File Dependency Management | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is the fact that this is an hadoop (compatible) FS based solution imply there are use cases for local deps that aren't served by this PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes there might be cases like the RSS server where users want to upload to a file server within the cluster. I am covering the cases mentioned in the design document which provide an API to use out of the box. The RSS implementation AFAIK needs improvements so its open for now, but we can work on it next. I could add a note there instead of removing that part of the doc saying ("partially done"), but since it is a working solution, I thought I could remove that from future work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Out of curiosity, does any s3 object store fit the HCFS category? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd say that's broadly applicable enough to call it "done" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have environments where there is nothing remotely HDFS like available and the systems are typically air-gapped so using external services like S3 isn't an option either. Primary storage is usually a high performance parallel file system (Lustre or IBM Spectrum Scale) which is just a POSIX compliant file system mounted to all nodes over the system interconnect. Using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my opinion we need both options: a) upload to some dfs/obect store service b) a file server. |
||
* Job Queues and Resource Management | ||
|
||
# Configuration | ||
|
@@ -1069,6 +1091,15 @@ See the [configuration page](configuration.html) for information on Spark config | |
Specify the grace period in seconds when deleting a Spark application using spark-submit. | ||
</td> | ||
</tr> | ||
<tr> | ||
<td><code>spark.kubernetes.file.upload.path</code></td> | ||
<td>(none)</td> | ||
<td> | ||
Path to store files at the spark submit side in cluster mode. For example: | ||
<code>spark.kubernetes.file.upload.path=s3a://<s3-bucket>/path</code> | ||
File should specified as <code>file://path/to/file </code> or absolute path. | ||
</td> | ||
</tr> | ||
</table> | ||
|
||
#### Pod template properties | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,18 +16,25 @@ | |
*/ | ||
package org.apache.spark.deploy.k8s | ||
|
||
import java.io.File | ||
import java.io.{File, IOException} | ||
import java.net.URI | ||
import java.security.SecureRandom | ||
import java.util.UUID | ||
|
||
import scala.collection.JavaConverters._ | ||
|
||
import io.fabric8.kubernetes.api.model.{Container, ContainerBuilder, ContainerStateRunning, ContainerStateTerminated, ContainerStateWaiting, ContainerStatus, Pod, PodBuilder} | ||
import io.fabric8.kubernetes.client.KubernetesClient | ||
import org.apache.commons.codec.binary.Hex | ||
import org.apache.hadoop.fs.{FileSystem, Path} | ||
|
||
import org.apache.spark.{SparkConf, SparkException} | ||
import org.apache.spark.deploy.SparkHadoopUtil | ||
import org.apache.spark.deploy.k8s.Config.KUBERNETES_FILE_UPLOAD_PATH | ||
import org.apache.spark.internal.Logging | ||
import org.apache.spark.launcher.SparkLauncher | ||
import org.apache.spark.util.{Clock, SystemClock, Utils} | ||
import org.apache.spark.util.Utils.getHadoopFileSystem | ||
|
||
private[spark] object KubernetesUtils extends Logging { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if this can be private[k8s] but no big deal There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just followed all the existing convention. All classes in this module use |
||
|
||
|
@@ -209,4 +216,77 @@ private[spark] object KubernetesUtils extends Logging { | |
Hex.encodeHexString(random) + time | ||
} | ||
|
||
/** | ||
* Upload files and modify their uris | ||
*/ | ||
def uploadAndTransformFileUris(fileUris: Iterable[String], conf: Option[SparkConf] = None) | ||
: Iterable[String] = { | ||
fileUris.map { uri => | ||
uploadFileUri(uri, conf) | ||
} | ||
} | ||
|
||
private def isLocalDependency(uri: URI): Boolean = { | ||
uri.getScheme match { | ||
case null | "file" => true | ||
case _ => false | ||
} | ||
} | ||
|
||
def isLocalAndResolvable(resource: String): Boolean = { | ||
resource != SparkLauncher.NO_RESOURCE && | ||
isLocalDependency(Utils.resolveURI(resource)) | ||
} | ||
|
||
def renameMainAppResource(resource: String, conf: SparkConf): String = { | ||
if (isLocalAndResolvable(resource)) { | ||
SparkLauncher.NO_RESOURCE | ||
} else { | ||
resource | ||
} | ||
} | ||
|
||
def uploadFileUri(uri: String, conf: Option[SparkConf] = None): String = { | ||
conf match { | ||
case Some(sConf) => | ||
if (sConf.get(KUBERNETES_FILE_UPLOAD_PATH).isDefined) { | ||
val fileUri = Utils.resolveURI(uri) | ||
try { | ||
val hadoopConf = SparkHadoopUtil.get.newConfiguration(sConf) | ||
val uploadPath = sConf.get(KUBERNETES_FILE_UPLOAD_PATH).get | ||
val fs = getHadoopFileSystem(Utils.resolveURI(uploadPath), hadoopConf) | ||
val randomDirName = s"spark-upload-${UUID.randomUUID()}" | ||
fs.mkdirs(new Path(s"${uploadPath}/${randomDirName}")) | ||
val targetUri = s"${uploadPath}/${randomDirName}/${fileUri.getPath.split("/").last}" | ||
log.info(s"Uploading file: ${fileUri.getPath} to dest: $targetUri...") | ||
uploadFileToHadoopCompatibleFS(new Path(fileUri.getPath), new Path(targetUri), fs) | ||
targetUri | ||
} catch { | ||
case e: Exception => | ||
throw new SparkException(s"Uploading file ${fileUri.getPath} failed...", e) | ||
} | ||
} else { | ||
throw new SparkException("Please specify " + | ||
"spark.kubernetes.file.upload.path property.") | ||
} | ||
case _ => throw new SparkException("Spark configuration is missing...") | ||
} | ||
} | ||
|
||
/** | ||
* Upload a file to a Hadoop-compatible filesystem. | ||
*/ | ||
private def uploadFileToHadoopCompatibleFS( | ||
src: Path, | ||
dest: Path, | ||
fs: FileSystem, | ||
delSrc : Boolean = false, | ||
overwrite: Boolean = true): Unit = { | ||
try { | ||
fs.copyFromLocalFile(false, true, src, dest) | ||
} catch { | ||
case e: IOException => | ||
throw new SparkException(s"Error uploading file ${src.getName}", e) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Packages in containers maybe slow if your net is slow since ivy cache will be empty. Users in practice should build their dependencies in the image or use a pre-populated cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This point might be a good addition to the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add it.