[Spark Load]Create spark load's repository in HDFS for dependencies #4163

xy720 · 2020-07-23T13:01:52Z

Proposed changes

Please see the main description in issue #4101

Resume

When users use spark load, they have to upload the dependent jars to hdfs every time.
This cl will add a self-generated repository under working_dir folder in hdfs for saving dependecies of spark dpp programe and spark platform.
Note that, the dependcies we upload to repository include:
1、spark-dpp.jar
2、spark2x.zip
1 is the dpp library which built with spark-dpp submodule. See details about spark-dpp submodule in pr #4146 .
2 is the spark2.x.x platform library which contains all jars in $SPARK_HOME/jars

The repository structure will be like this:

__spark_repository__/
    |-__archive_1_0_0/
    |        |-__lib_990325d2c0d1d5e45bf675e54e44fb16_spark-dpp.jar
    |        |-__lib_7670c29daf535efe3c9b923f778f61fc_spark-2x.zip
    |-__archive_2_2_0/
    |        |-__lib_64d5696f99c379af2bee28c1c84271d5_spark-dpp.jar
    |        |-__lib_1bbb74bb6b264a270bc7fca3e964160f_spark-2x.zip
    |-__archive_3_2_0/
    |        |-...

The followinng conditions will force fe to upload dependencies:
1、When fe find its dppVersion is absent in repository.
2、The MD5 value of remote file does not match the local file.
Before Fe uploads the dependencies, it will create an archive directory with name __archive_{dppVersion} under the repository.

Types of changes

New feature (non-breaking change which adds functionality)

Checklist

I have create an issue on Doris's issues, and have described the bug/feature there in detail
Commit messages in my PR start with the related issues ID, like "Add pull request template to doris project #4071 Add pull request template to doris project"
Compiling and unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works

fe/fe-core/src/main/java/org/apache/doris/common/FeConstants.java

fe/fe-core/src/main/java/org/apache/doris/load/loadv2/SparkRepository.java

morningman · 2020-07-23T13:28:33Z

fe/fe-core/src/main/java/org/apache/doris/load/loadv2/SparkEtlJobHandler.java

                .setAppResource(appResourceHdfsPath)
                .setMainClass(SparkEtlJob.class.getCanonicalName())
                .setAppName(String.format(ETL_JOB_NAME, loadLabel))
+                .setSparkHome(spark_home)


What is this for?

The SparkHome argument of sparkLaucher

morningman · 2020-07-23T13:38:55Z

fe/fe-core/src/main/java/org/apache/doris/load/loadv2/SparkRepository.java

+    }
+
+    private void initRepository() throws LoadException {
+        LOG.info("start to init remote repository");


the whole initRepository need to be lock protected

Added a synchronized lock in SparkEtlHandler for cluster id.
Now initRepository operations are protected by lock if they are in same cluster

fe/fe-core/src/main/java/org/apache/doris/load/loadv2/SparkRepository.java

fe/fe-core/src/main/java/org/apache/doris/catalog/SparkResource.java

fe/fe-core/src/main/java/org/apache/doris/load/loadv2/SparkEtlJobHandler.java

morningman · 2020-07-23T13:47:41Z

fe/fe-core/src/main/java/org/apache/doris/load/loadv2/SparkEtlJobHandler.java

+            sparkConfigs.put("spark.yarn.archive", jobArchiveHdfsPath);
+        }
+        if (Strings.isNullOrEmpty(sparkConfigs.get("spark.yarn.stage.dir"))) {
+            sparkConfigs.put("spark.yarn.stage.dir", jobStageHdfsPath);


What' s this spark.yarn.stage.dir means?

A self-generated path by spark in hdfs to save temporary configuration for spark application

fe/fe-core/src/main/java/org/apache/doris/catalog/SparkResource.java

fe/fe-core/src/main/java/org/apache/doris/common/Config.java

fe/fe-core/src/main/java/org/apache/doris/load/loadv2/SparkEtlJobHandler.java

fe/fe-core/src/main/java/org/apache/doris/catalog/SparkResource.java

fe/fe-core/src/main/java/org/apache/doris/load/loadv2/SparkRepository.java

fe/fe-core/src/main/java/org/apache/doris/catalog/SparkResource.java

fe/fe-core/src/main/java/org/apache/doris/common/util/BrokerUtil.java

fe/fe-core/src/main/java/org/apache/doris/load/loadv2/SparkEtlJobHandler.java

morningman · 2020-07-25T09:37:17Z

fe/fe-core/src/main/java/org/apache/doris/load/loadv2/SparkEtlJobHandler.java

+
+        // update archive and stage configs here
+        Map<String, String> sparkConfigs = resource.getSparkConfigs();
+        if (Strings.isNullOrEmpty(sparkConfigs.get("spark.yarn.archive"))) {


In what situation, the spark.yarn.archive config will NOT be empty?

If user has set spark.yarn.archivein resource, we prefer to use the archive set by user, otherwise we use archive generated by SparkRepository.

morningman · 2020-07-25T09:37:33Z

fe/fe-core/src/main/java/org/apache/doris/load/loadv2/SparkEtlJobHandler.java

                .setAppResource(appResourceHdfsPath)
                .setMainClass(SparkEtlJob.class.getCanonicalName())
                .setAppName(String.format(ETL_JOB_NAME, loadLabel))
+                .setSparkHome(sparkHome)


Why need to set spark home here?
Is it compatible with open source spark env?

This spark home is configurable. Users in open source environment need to configure this parameter in fe.conf

morningman

LGTM

xy720 added 5 commits July 23, 2020 17:43

save code

b74ca33

save code

1c9f256

save code

3a777e1

save code

56f8c9b

remove log

d2be4f2

morningman reviewed Jul 23, 2020

View reviewed changes

morningman self-assigned this Jul 23, 2020

morningman added area/spark-load Issues or PRs related to the spark load kind/feature Categorizes issue or PR as related to a new feature. labels Jul 23, 2020

xy720 added 2 commits July 24, 2020 11:40

add lock

d042861

add ut

25e6d1a

morningman reviewed Jul 24, 2020

View reviewed changes

add syncchronized and fix ut

16407db

morningman reviewed Jul 25, 2020

View reviewed changes

save code

f622cf3

morningman approved these changes Jul 26, 2020

View reviewed changes

morningman added the approved Indicates a PR has been approved by one committer. label Jul 26, 2020

morningman merged commit f2c9e1e into apache:master Jul 27, 2020

[Spark Load]Create spark load's repository in HDFS for dependencies #4163

[Spark Load]Create spark load's repository in HDFS for dependencies #4163

Uh oh!

Conversation

xy720 commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Resume

Types of changes

Checklist

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xy720 Jul 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

morningman left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xy720 commented Jul 23, 2020 •

edited

Loading

xy720 Jul 24, 2020 •

edited

Loading