Skip to content

Conversation

@wyb
Copy link
Contributor

@wyb wyb commented Apr 28, 2020

  1. User interface:

1.1 Spark resource management

Spark is used as an external computing resource in Doris to do ETL work. In the future, there may be other external resources that will be used in Doris, for example, MapReduce is used for ETL, Spark/GPU is used for queries, HDFS/S3 is used for external storage. We introduced resource management to manage these external resources used by Doris.

-- create spark resource
CREATE EXTERNAL RESOURCE resource_name
PROPERTIES 
(                 
  type = spark,
  spark_conf_key = spark_conf_value,
  working_dir = path,
  broker = broker_name,
  broker.property_key = property_value
)

-- drop spark resource
DROP RESOURCE resource_name

-- show resources
SHOW RESOURCES
SHOW PROC "/resources"

-- privileges
GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity
GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name

REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity
REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name
  • CREATE EXTERNAL RESOURCE:

FOR user_name is optional. If there has, the external resource belongs to this user. If not, the external resource belongs to the system and all users are available.

PROPERTIES:

  1. type: resource type. Only support spark now.
  2. spark configuration: follow the standard writing of Spark configurations, refer to: https://spark.apache.org/docs/latest/configuration.html.
  3. working_dir: optional, used to store ETL intermediate results in spark ETL.
  4. broker: optional, used in spark ETL. The ETL intermediate results need to be read with the broker when pushed into BE.

Example:

CREATE EXTERNAL RESOURCE "spark0"
PROPERTIES 
(                                                                             
  "type" = "spark",                   
  "spark.master" = "yarn",
  "spark.submit.deployMode" = "cluster",
  "spark.jars" = "xxx.jar,yyy.jar",
  "spark.files" = "/tmp/aaa,/tmp/bbb",
  "spark.yarn.queue" = "queue0",
  "spark.executor.memory" = "1g",
  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
  "broker" = "broker0",
  "broker.username" = "user0",
  "broker.password" = "password0"
)
  • SHOW RESOURCES:
    General users can only see their own resources.
    Admin and root users can show all resources.

1.2 Create spark load job

LOAD LABEL db_name.label_name 
(
  DATA INFILE ("/tmp/file1") INTO TABLE table_name, ...
)
WITH RESOURCE resource_name
[(key1 = value1, ...)]
[PROPERTIES (key2 = value2, ... )]

Example:

LOAD LABEL example_db.test_label 
(
  DATA INFILE ("hdfs:/127.0.0.1:10000/tmp/file1") INTO TABLE example_table
)
WITH RESOURCE "spark0"
(
  "spark.executor.memory" = "1g",
  "spark.files" = "/tmp/aaa,/tmp/bbb"
)
PROPERTIES ("timeout" = "3600")

The spark configuartions in load stmt can override the existing configuration in the resource for temporary use.

#3010

@wyb wyb changed the title Add spark etl cluster and cluster manager [Spark load] Add spark etl cluster and cluster manager Apr 28, 2020
@wyb wyb force-pushed the spark_etl_cluster branch from 166cbf7 to 3c5b2d2 Compare April 30, 2020 03:18
@kangkaisen
Copy link
Contributor

@wyb Hi, why comment the update load cluster code?

@wyb
Copy link
Contributor Author

wyb commented Apr 30, 2020

@wyb Hi, why comment the update load cluster code?

Because EtlClusterDesc class is used in the load job process, and is not in this pr.
I will remove the comment in the load job process pr.

@imay imay self-assigned this Apr 30, 2020
@imay
Copy link
Contributor

imay commented Apr 30, 2020

SHOW PROC "/load_etl_clusters"

why load_etl_clusters? seems load_clusters is OK.

Copy link
Contributor

@imay imay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add documents for resource operations.

@kangpinghuang kangpinghuang added the area/load Issues or PRs related to all kinds of load label May 22, 2020
@kangpinghuang kangpinghuang added this to the 0.13.0 milestone May 22, 2020
morningman
morningman previously approved these changes May 25, 2020
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

imay
imay previously approved these changes May 26, 2020
Copy link
Contributor

@imay imay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@imay imay added release-note approved Indicates a PR has been approved by one committer. labels May 26, 2020
Copy link
Contributor

@imay imay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit 4978bd6 into apache:master May 26, 2020
@EmmyMiao87 EmmyMiao87 mentioned this pull request Aug 17, 2020
w41ter pushed a commit to w41ter/incubator-doris that referenced this pull request Jul 2, 2024
apache#3418)

…sion.suite.Suite.getProvider() is applicable for argument types`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. area/load Issues or PRs related to all kinds of load

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants