Skip to content

Conversation

@xy720
Copy link
Member

@xy720 xy720 commented Jul 23, 2020

Proposed changes

Please see the main description in issue #4101

Resume

When users use spark load, they have to upload the dependent jars to hdfs every time.
This cl will add a self-generated repository under working_dir folder in hdfs for saving dependecies of spark dpp programe and spark platform.
Note that, the dependcies we upload to repository include:
1、spark-dpp.jar
2、spark2x.zip
1 is the dpp library which built with spark-dpp submodule. See details about spark-dpp submodule in pr #4146 .
2 is the spark2.x.x platform library which contains all jars in $SPARK_HOME/jars

The repository structure will be like this:

__spark_repository__/
    |-__archive_1_0_0/
    |        |-__lib_990325d2c0d1d5e45bf675e54e44fb16_spark-dpp.jar
    |        |-__lib_7670c29daf535efe3c9b923f778f61fc_spark-2x.zip
    |-__archive_2_2_0/
    |        |-__lib_64d5696f99c379af2bee28c1c84271d5_spark-dpp.jar
    |        |-__lib_1bbb74bb6b264a270bc7fca3e964160f_spark-2x.zip
    |-__archive_3_2_0/
    |        |-...

The followinng conditions will force fe to upload dependencies:
1、When fe find its dppVersion is absent in repository.
2、The MD5 value of remote file does not match the local file.
Before Fe uploads the dependencies, it will create an archive directory with name __archive_{dppVersion} under the repository.

Types of changes

  • New feature (non-breaking change which adds functionality)

Checklist

  • I have create an issue on Doris's issues, and have described the bug/feature there in detail
  • Commit messages in my PR start with the related issues ID, like "Add pull request template to doris project #4071 Add pull request template to doris project"
  • Compiling and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works

.setAppResource(appResourceHdfsPath)
.setMainClass(SparkEtlJob.class.getCanonicalName())
.setAppName(String.format(ETL_JOB_NAME, loadLabel))
.setSparkHome(spark_home)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SparkHome argument of sparkLaucher

}

private void initRepository() throws LoadException {
LOG.info("start to init remote repository");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole initRepository need to be lock protected

Copy link
Member Author

@xy720 xy720 Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a synchronized lock in SparkEtlHandler for cluster id.
Now initRepository operations are protected by lock if they are in same cluster

sparkConfigs.put("spark.yarn.archive", jobArchiveHdfsPath);
}
if (Strings.isNullOrEmpty(sparkConfigs.get("spark.yarn.stage.dir"))) {
sparkConfigs.put("spark.yarn.stage.dir", jobStageHdfsPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What' s this spark.yarn.stage.dir means?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A self-generated path by spark in hdfs to save temporary configuration for spark application

@morningman morningman self-assigned this Jul 23, 2020
@morningman morningman added area/spark-load Issues or PRs related to the spark load kind/feature Categorizes issue or PR as related to a new feature. labels Jul 23, 2020

// update archive and stage configs here
Map<String, String> sparkConfigs = resource.getSparkConfigs();
if (Strings.isNullOrEmpty(sparkConfigs.get("spark.yarn.archive"))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation, the spark.yarn.archive config will NOT be empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user has set spark.yarn.archivein resource, we prefer to use the archive set by user, otherwise we use archive generated by SparkRepository.

.setAppResource(appResourceHdfsPath)
.setMainClass(SparkEtlJob.class.getCanonicalName())
.setAppName(String.format(ETL_JOB_NAME, loadLabel))
.setSparkHome(sparkHome)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need to set spark home here?
Is it compatible with open source spark env?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spark home is configurable. Users in open source environment need to configure this parameter in fe.conf

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman added the approved Indicates a PR has been approved by one committer. label Jul 26, 2020
@morningman morningman merged commit f2c9e1e into apache:master Jul 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. area/spark-load Issues or PRs related to the spark load kind/feature Categorizes issue or PR as related to a new feature.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants