Skip to content

[Spark load] Doris support Spark load #3433

@wyb

Description

@wyb

For many users who want to load data into doris for the first time, they have large amount of data, about 10G+,it is hard to support to load so large data into doris at one time using Broker load or Stream load. To resolve this problem, We proposal a new solution to load data by using spark cluster.

Spark clusters are used to preprocess data (bitmap global dict build, partition, sort, aggregation) in spark load, which can improve Doris load performance of large data volume and save the computing resources of Doris.

Spark load is mainly used for the initial migration from other systems or loading large amounts of data into Doris.

                 +
                 | 0. User create spark load job
            +----v----+
            |   FE    |---------------------------------+
            +----+----+                                 |
                 | 3. FE send push tasks                |
                 | 5. FE publish version                |
    +------------+------------+                         |
    |            |            |                         |
+---v---+    +---v---+    +---v---+                     |
|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
+---^---+    +---^---+    +---^---+                     |
    |4. BE push with broker   |                         |
+---+---+    +---+---+    +---+---+                     |
|Broker |    |Broker |    |Broker |                     |
+---^---+    +---^---+    +---^---+                     |
    |            |            |                         |
+---+------------+------------+---+ 2.ETL +-------------v---------------+
|               HDFS              +------->       Spark cluster         |
|                                 <-------+                             |
+---------------------------------+       +-----------------------------+

Metadata

Metadata

Assignees

Labels

area/loadIssues or PRs related to all kinds of loadkind/designCategorizes issue or PR as related to design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions