Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] DAB: Support for YAML Dictionary variables #1298

Closed
stevenayers-bge opened this issue Mar 20, 2024 · 15 comments · Fixed by #1467
Closed

[FEATURE] DAB: Support for YAML Dictionary variables #1298

stevenayers-bge opened this issue Mar 20, 2024 · 15 comments · Fixed by #1467
Assignees
Labels
DABs DABs related issues Enhancement New feature or request

Comments

@stevenayers-bge
Copy link

When writing Databricks Asset Bundles, it would be great if we could define things like job cluster configuration as one config block:

 targets:
  dev:
    variables:
      spark_version: 14.3.x-scala2.12
      pool_id: 12334
      cluster_config:
        spark_version: ${var.spark_version}
        instance_pool_id: ${var.pool_id}
        data_security_mode: SINGLE_USER
        runtime_engine: PHOTON
        spark_conf:
          spark.master: local[*, 4]
          spark.databricks.cluster.profile: singleNode
        custom_tags:
          ResourceClass: SingleNode
        spark_env_vars:
          PYSPARK_PYTHON: /databricks/python3/bin/python3
        num_workers: 0
    prod:
      variables:
          spark_version: 14.3.x-scala2.12
          pool_id: aaabbbbcccc
          cluster_config:
            spark_version: ${var.spark_version}
            instance_pool_id: ${var.pool_id}
            data_security_mode: SINGLE_USER
            runtime_engine: PHOTON
            autoscale:
              min_workers: 10
              max_workers: 999999
            spark_env_vars:
              PYSPARK_PYTHON: /databricks/python3/bin/python3

resources:
  jobs:
    job1:
      name: job1
      git_source:
        git_url: ${var.git_url}
        git_provider: ${var.git_provider}
        git_branch: ${var.git_branch}
      tasks:
        - task_key: task1
          spark_python_task:
            python_file: task1.py
            source: GIT
            parameters:
              - --environment
              - ${bundle.target}
          job_cluster_key: cluster
      job_clusters:
        - job_cluster_key: cluster
          new_cluster: ${var.cluster_config}

It would also be nice to be able to group variables using this method:

variables:
  instance_pools:
    description: The pools the use for jobs
    default:
      driver:
        large: id here
        small: id here
      worker:
        large: id here
        small: id here
    
      cluster_config:
        spark_version: ${var.spark_version}
        driver_instance_pool_id: ${var.instance_pools.driver.large}
        instance_pool_id: ${var.instance_pools.worker.large}
        data_security_mode: SINGLE_USER
        runtime_engine: PHOTON
        autoscale:
          min_workers: 10
          max_workers: 999999
        spark_env_vars:
          PYSPARK_PYTHON: /databricks/python3/bin/python3

Please let me know if this is in progress, and either way I'll see if I can help contribute :)

@stevenayers-bge stevenayers-bge added the DABs DABs related issues label Mar 20, 2024
@andrewnester andrewnester added the Enhancement New feature or request label Mar 21, 2024
@andrewnester
Copy link
Contributor

@stevenayers-bge is YAML anchors helpful in your case?

@stevenayers-bge
Copy link
Author

@stevenayers-bge is YAML anchors helpful in your case?

Thanks for the reply 🙂 so it does, but it also feels messy to be honest. Maintaining your basic type variables in one way and your complex types in another.

My understanding of YAML anchors is that you also can't access nested properties either.

@stevenayers-bge
Copy link
Author

@andrewnester it's also worth mentioning that YAML anchors are only valid in the file they are defined in, so you can't define them where you define your variables, and then re-use them elsewhere.

@stevenayers-bge stevenayers-bge changed the title DAB: Support for YAML Dictionary variables [FEATURE] DAB: Support for YAML Dictionary variables Mar 24, 2024
@andrewnester
Copy link
Contributor

Indeed, you're correct on the points above. YAML anchors can be a solution in certain cases. Other than that, we've received multiple similar requests and planning to introduce some functionality which should improve the experience

@stevenayers-bge
Copy link
Author

Indeed, you're correct on the points above. YAML anchors can be a solution in certain cases. Other than that, we've received multiple similar requests and planning to introduce some functionality which should improve the experience

@andrewnester ok thanks that would be great. Is there a PR for this work already? If not I can look at kicking it off?

@andrewnester
Copy link
Contributor

There's no PR for this yet, we still need to figure out certain details about how we can surface this functionality in DABs configuration

@kenmyers-8451
Copy link

@stevenayers-bge is YAML anchors helpful in your case?

@andrewnester it's also worth mentioning that YAML anchors are only valid in the file they are defined in, so you can't define them where you define your variables, and then re-use them elsewhere.

@andrewnester I saw that you said you've gotten similar requests but just wanted to add one more data point to it. I was actually just going to make a similar feature request regarding job cluster definition when I found this thread. My teams running into the exact problem @stevenayers-bge mentions with anchors. We have been using a single bundle file with anchors for defining the job cluster definitions which we then reference in each of the jobs. However, since the file is getting really long — with 5 jobs so far, each with many tasks, and more to come — we wanted to try breaking the jobs out into their own files, but this throws an error right away because the anchor pointers don't have the anchor definitions in these sub-files, and it doesn't make sense to copy the anchor definitions to each sub-file as then we have to make updates to all of those files anytime we change the cluster.

So first idea that popped into my head was what if the job clusters were defined as a resource or similar top level item. I also like the dictionary idea suggested here.

@stevenayers-bge
Copy link
Author

stevenayers-bge commented Apr 6, 2024

@kenmyers-8451 👍 we have 9 jobs at the moment, and at the moment we're defining the YAML anchors in each our environment yml files and then defining them at target level (target > resources > jobs > job_name) but it's already starting to look really messy.

@andrewnester I'm keen to help contribute this feature if I get time. Do you know in which submodules there would need to be changes? Is it all in bundle/config/variable and bundle/config/mutator?

@andrewnester
Copy link
Contributor

andrewnester commented Apr 10, 2024

@stevenayers-bge @kenmyers-8451 Thanks everyone one for the interest in the feature and additional context!

To be honest, the feature and its implementation is a bit a tricky as we need to take care of a few things: how do we do dynamic value normalisation, how do we validate that the variable follows correct structure (like cluster definition), how do we make it backward compatible with current variable definitions and etc. These are not very trivial questions but we as a team will pick up designing and implementing this feature in the nearest future.

Thanks for your patience :)

@oschistad
Copy link

I would like to tack on to this issue the need for a dynamic way to switch between existing and job clusters, so you can re-use the same resource / job manifests both for development and in production.

This is not actually possible today without some sort of templating, as the Yaml is different and this prevents you from simply overriding at the target level.

As mentioned elsewhere, Yaml anchors do not help here as the resource manifests live separate from the targets, and cannot anchor between.

@stevenayers-bge
Copy link
Author

@pietern @andrewnester does #1398 add this feature?

@andrewnester
Copy link
Contributor

@stevenayers-bge no, the PR you mentioned is to address schema issue when primitive non string variables are used.
We're picking up work for complex variable this month so we will keep this thread updated

@andrewnester
Copy link
Contributor

Hey everyone! We have a PR in progress adding the support for complex variables, feel free to follow it for any updates
#1467

@wradstok
Copy link

wradstok commented Jun 20, 2024

Nice work @andrewnester , can't wait to be able to use this.

Does the PR also support when variables are not at top-level (so for instance inside a target)? An example that would be useful for us would be to have some default cluster settings per environment.

bundle:
  name: example

resources:
  jobs:
    SOME_JOB:
      name: some_job
      job_clusters:
        - job_cluster_key: some_job_cluster
          new_cluster: ${var.cluster}
          tasks:
            - task_key: test
              job_cluster_key: key

targets:
  dev:
    mode: development
    workspace:
      host: https://some-dev-host
    variables:
      cluster:
        description: "Common cluster definition for all jobs"
        type: complex
        default:
          spark_version: 15.1.x-scala2.12
          spark_env_vars:
            ENVIRONMENT: DEV


  prod:
    mode: production
    workspace:
      host: https://some-prod-host
    variables:
      cluster:
        description: "Common cluster definition for all jobs"
        type: complex
        default:
          spark_version: 15.1.x-scala2.12
          spark_env_vars:
            ENVIRONMENT: PROD

@andrewnester
Copy link
Contributor

@wradstok thank you! yes, with this change it will be possible to use variables in target overrides as well

@shreyas-goenka shreyas-goenka self-assigned this Jun 20, 2024
github-merge-queue bot pushed a commit that referenced this issue Jun 26, 2024
## Changes
Added support for complex variables

Now it's possible to add and use complex variables as shown below

```
bundle:
  name: complex-variables

resources:
  jobs:
    my_job:
      job_clusters:
        - job_cluster_key: key
          new_cluster: ${var.cluster}
      tasks:
      - task_key: test
        job_cluster_key: key

variables:
  cluster:
    description: "A cluster definition"
    type: complex
    default:
      spark_version: "13.2.x-scala2.11"
      node_type_id: "Standard_DS3_v2"
      num_workers: 2
      spark_conf:
        spark.speculation: true
        spark.databricks.delta.retentionDurationCheck.enabled: false
```

Fixes #1298

- [x] Support for complex variables
- [x] Allow variable overrides (with shortcut) in targets
- [x] Don't allow to provide complex variables via flag or env variable
- [x] Fail validation if complex value is used but not `type: complex`
provided
- [x] Support using variables inside complex variables 

## Tests
Added unit tests

---------

Co-authored-by: shreyas-goenka <88374338+shreyas-goenka@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DABs DABs related issues Enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants