Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DSIP-11][Feature][python] Use the pydolphinscheduler to dynamically generate workflows from the configuration file #10995

Closed
3 tasks done
Tracked by #14102
jieguangzhou opened this issue Jul 15, 2022 · 11 comments · Fixed by #11611
Closed
3 tasks done
Tracked by #14102
Assignees
Labels
backend discussion discussion DSIP feature new feature Python release cherry-pick Mark this issue/PR had cherry-pick for release version
Milestone

Comments

@jieguangzhou
Copy link
Member

jieguangzhou commented Jul 15, 2022

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Dynamically generate workflows from YAML configuration files, aka workflows-as-code.

Now we can use python-dolphinscheduler to build workflow easily. Workflows-as-code would be easier to use if we supported configuration file definition workflows. It will make the workflow clearer, shareable, and reviewable.

Then we can upload our project as a git repository. This can help us do richer operations, such as CICD, etc

We can create a workload by defining the following fields, all the field definitions are derived from PydolphinScheduler.

A YAML file defines a workflow :

# Define the process
Process:
  #process(workflow) name
  name: prepare_datas

  # Parameters for process creation
  param:
      project: /data/project

  # Whether to run the workflow after the creation is complete
  run: True

# Define the tasks under the process
Tasks:
  -
    # task type: Shell, python, SubProcess, Spark, etc
    TaskType: Shell

    # Upstream Task List
    dependencies: [xxxx]

    # Parameters for task creation
    params:
      name: task1
      other parameters: ...

  -  
    TaskType: Python
    params:
      name: task2
  - ...
  - ... 

Here is a simple example to show how to use YAML to manage workflows(A YAML file defines a workflow):

# Define the process
Process:
  name: prepare_datas
  param:
      project: /data/project

# Define the tasks under the process
Tasks:
  - 
    TaskType: Shell
    params:
      name: download_data
      command: |
                export PYTHONPATH=${project}
                source ${project}/env/bin/activate
                data_path=${project}/data/daily
                python -m dmsa.data.download ${data_path}
      local_params: 
        - { "prop": "project", "direct": "IN", "type": "VARCHAR", "value": "${project}"}

  - 
    TaskType: Shell
    dependencies: [download_data]
    params:
      name: calc_signals 
      command: |
                export PYTHONPATH=${project}
                source ${project}/env/bin/activate
                data_path=${project}/data/daily
                python -m dmsa.data_processing.calc_signals \
                      --data_path ${data_path} \
                      --name_file ${project}/feature_signal.txt 
      local_params: 
        - { "prop": "project", "direct": "IN", "type": "VARCHAR", "value": "${project}"}


  - 
    TaskType: Shell
    dependencies: [download_data]
    params:
      name: calc_features 
      command: |
                export PYTHONPATH=${project}
                source ${project}/env/bin/activate
                data_path=${project}/data/daily
                python -m dmsa.data_processing.calc_features \
                    --data_path $data_path \
                    --name_file ${project}/feature_signal.txt
      local_params:
        - { "prop": "project", "direct": "IN", "type": "VARCHAR", "value": "${project}"}

Alternatively, we can use some of the methods native to YAML files for easier definition,for example, using & and * :

# User-defined parameters. The parameter suggestions are all written here
Params:
  process_name: &process_name prepare_datas
  project: &project "/data/project"

# The variable definitions in the YAML file are used for the following configuration
Varible:
  local_params: &local_params { "prop": "project", "direct": "IN", "type": "VARCHAR", "value": "${project}"}

# Define the process
Process:
  name: *process_name
  param:
      project: *project

# Define the tasks under the process
Tasks:
  - 
    TaskType: Shell
    params:
      name: download_data
      command: |
                export PYTHONPATH=${project}
                source ${project}/env/bin/activate
                data_path=${project}/data/daily
                python -m dmsa.data.download ${data_path}
      local_params: 
        - *local_params 

  - 
    TaskType: Shell
    dependencies: [download_data]
    params:
      name: calc_signals 
      command: |
                export PYTHONPATH=${project}
                source ${project}/env/bin/activate
                data_path=${project}/data/daily
                python -m dmsa.data_processing.calc_signals \
                      --data_path ${data_path} \
                      --name_file ${project}/feature_signal.txt 
      local_params: 
        - *local_params 


  - 
    TaskType: Shell
    dependencies: [download_data]
    params:
      name: calc_features 
      command: |
                export PYTHONPATH=${project}
                source ${project}/env/bin/activate
                data_path=${project}/data/daily
                python -m dmsa.data_processing.calc_features \
                    --data_path $data_path \
                    --name_file ${project}/feature_signal.txt
      local_params:
        - *local_params 

A richer approach is to combine the DS features, and we can add some magic methods to make it easier to use. For example, we can read environment variables $Env{xxxx}, and we can read the contents of files $File{xxxx}:

# User-defined parameters. The parameter suggestions are all written here
Params:
  process_name: &process_name prepare_datas
  project: &project $Env{STOCK_PROJECT}

# The variable definitions in the YAML file are used for the following configuration
Varible:
  local_params: &local_params { "prop": "project", "direct": "IN", "type": "VARCHAR", "value": "${project}"}

# Define the process
Process:
  name: *process_name
  param:
      project: *project

# Define the tasks under the process
Tasks:
  - 
    TaskType: Shell
    params:
      name: download_data
      command: $File("download_data.sh")
      local_params: 
        - *local_params 

  - 
    TaskType: Shell
    dependencies: [download_data]
    params:
      name: calc_signals 
      command: $File("calc_signals")
      local_params: 
        - *local_params 


  - 
    TaskType: Shell
    dependencies: [download_data]
    params:
      name: calc_features 
      command: $File("calc_features")
      local_params:
        - *local_params 

Once we have defined the configuration file, we can use the CLI of the PydolphinScheduler to load the workflow

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@jieguangzhou jieguangzhou added feature new feature Waiting for reply Waiting for reply labels Jul 15, 2022
@github-actions
Copy link

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

  • In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
  • If you haven't received a reply for a long time, you can join our slack and send your question to channel #troubleshooting

@SbloodyS SbloodyS added backend and removed Waiting for reply Waiting for reply labels Jul 15, 2022
@SbloodyS SbloodyS added the discussion discussion label Jul 15, 2022
@SbloodyS
Copy link
Member

@EricGao888
Copy link
Member

Hi @jieguangzhou, this looks good to me. For some users who are not willing to spend some time picking up python but do have tons of workflows to generate, yaml is a better approach than dragging and connecting tasks through the UI. Just curious, why do you want this feature? Are you a yaml lover? 😄

@jieguangzhou
Copy link
Member Author

Hi @jieguangzhou, this looks good to me. For some users who are not willing to spend some time picking up python but do have tons of workflows to generate, yaml is a better approach than dragging and connecting tasks through the UI. Just curious, why do you want this feature? Are you a yaml lover? 😄

If we can use YAML to manage workflows, it might make it easier for engineers to manage workflows and do version management. Git can be used to manage and share workflows.
In addition, after communicating with some overseas engineers, I found that some engineers prefer to use YAML to manage programs, which I think is also suitable for me.
I think yaml is better than TOML or JSON for this scenario.

@caishunfeng
Copy link
Contributor

Good idea, it's better to add some design details.

@ruanwenjun
Copy link
Member

It's OK for me to define a DSL to create a workflow, some geeks may like this way, but this can not work well in big workflow.

@jieguangzhou
Copy link
Member Author

Can we parse the data directly and store it in the database? or do we have to do it in other ways?

@jieguangzhou
Copy link
Member Author

Good idea, it's better to add some design details.

Ok, I will add details here in the future

@jieguangzhou
Copy link
Member Author

It's OK for me to define a DSL to create a workflow, some geeks may like this way, but this can not work well in big workflow.

It's OK for me to define a DSL to create a workflow, some geeks may like this way, but this can not work well in big workflow.

Thank you for your suggestions. I will think again about how to handle the large workflow scenario, and if not, let the user use the original way to support the large workflow for the time being

@zhongjiajie
Copy link
Member

Hi @jieguangzhou I add DSIP to this issue, because it add new mechanism to Python API, please follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html to create a new DSIP

@zhongjiajie
Copy link
Member

BTW, I think it is a good adding for python api

@zhongjiajie zhongjiajie changed the title [Feature][pydolphinscheduler] Dynamically generate workflows from YAML configuration files [Feature][python] Dynamically generate workflows from YAML configuration files Jul 26, 2022
@zhongjiajie zhongjiajie changed the title [Feature][python] Dynamically generate workflows from YAML configuration files [DSIP-10][Feature][python] Dynamically generate workflows from YAML configuration files Jul 28, 2022
@zhongjiajie zhongjiajie changed the title [DSIP-10][Feature][python] Dynamically generate workflows from YAML configuration files [DSIP-11][Feature][python] Dynamically generate workflows from YAML configuration files Jul 28, 2022
@jieguangzhou jieguangzhou changed the title [DSIP-11][Feature][python] Dynamically generate workflows from YAML configuration files [DSIP-11][Feature][python] Use the PydolphinScheduler to dynamically generate workflows from the configuration file Jul 29, 2022
@jieguangzhou jieguangzhou changed the title [DSIP-11][Feature][python] Use the PydolphinScheduler to dynamically generate workflows from the configuration file [DSIP-11][Feature][python] Use the pydolphinscheduler to dynamically generate workflows from the configuration file Jul 29, 2022
jieguangzhou added a commit to jieguangzhou/dolphinscheduler that referenced this issue Aug 23, 2022
jieguangzhou added a commit to jieguangzhou/dolphinscheduler that referenced this issue Aug 26, 2022
@EricGao888 EricGao888 added this to the 3.1.0 milestone Aug 30, 2022
jieguangzhou added a commit to jieguangzhou/dolphinscheduler that referenced this issue Sep 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend discussion discussion DSIP feature new feature Python release cherry-pick Mark this issue/PR had cherry-pick for release version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants