Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying queue with SparkSubmitOperator is ignored from 2.6.0 onwards #38461

Closed
1 of 2 tasks
djuarezg opened this issue Mar 25, 2024 · 4 comments · Fixed by #38852
Closed
1 of 2 tasks

Applying queue with SparkSubmitOperator is ignored from 2.6.0 onwards #38461

djuarezg opened this issue Mar 25, 2024 · 4 comments · Fixed by #38852

Comments

@djuarezg
Copy link

djuarezg commented Mar 25, 2024

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.8.1

What happened?

We extend the SparkSubmitOperator and specify a queue to select which of the desired runners are going to run our tasks.

from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
(...)
class CustomSparkSubmitOperator(SparkSubmitOperator):
    """Extended spark submit operator."""

    template_fields = list(SparkSubmitOperator.template_fields)
    template_fields.append("_java_class")

def spark(
    self,
    application,
    java_class,
    task_id,
    priority_weight=1,
    alert_cc=None,
    auto_tags=None,
    **task_kwargs,
):
    """Create a task that runs a spark job."""
    _LOGGER.debug(f"Creating task: {task_id}")

    env = get_env()
    default_bp_config_path = (
        f"/data/business-parameters/business-parameters-master/data/all.{env}.conf"
    )
    extra_java_options = nlstrip(
        f"""
        -DApplication={application}
        (...)
        """
    )

    logstore_options = (
        f"{{{{ params.get('s3a_logstore_class', '{DEFAULT_LOGSTORE}') }}}}"
    )

    jar_path = f"{JAR_FOLDER}{{{{ params.jar_name }}}}"

    conf = {
        "spark.driver.memory": "{{ params.driver_memory }}",
        (...)
    }

    task = CustomSparkSubmitOperator(
        task_id=task_id,
        java_class=java_class,
        conn_id=f"spark_{env}_client",
        priority_weight=priority_weight,
        queue="airflow_spark",
        application=jar_path,
        verbose=True,
        spark_binary=SPARK_BINARY_PATH,
        conf=conf,
        env_vars=get_task_env(None),
        **get_task_kwargs(
            task_kwargs,
            alert_cc=alert_cc,
            dag=self.default_dag,
        ),
    )
    task.auto_tags = ["spark"]
    if auto_tags:
        task.auto_tags.extend(auto_tags)
    return task

This is then used but the value of at least queue is set to the default value generic to all dags, not queue="airflow_spark"

Before, this queue parameter overrode BaseOperator value where a queue is an Airflow queue

What you think should happen instead?

queue should be kept and used to determine which node to run on

How to reproduce

Set up a queue using SparkSubmitOperator on provider 4.5.0 or before vs 4.6.0 or after

Operating System

cat /etc/os-release NAME="Rocky Linux" VERSION="8.8 (Green Obsidian)"

Versions of Apache Airflow Providers

/data/airflow/env/bin/pip list | grep providers
apache-airflow-providers-amazon          8.19.0
apache-airflow-providers-apache-spark    4.5.0
apache-airflow-providers-celery          3.6.1
apache-airflow-providers-common-io       1.3.0
apache-airflow-providers-common-sql      1.11.1
apache-airflow-providers-elasticsearch   5.3.3
apache-airflow-providers-ftp             3.7.0
apache-airflow-providers-http            4.10.0
apache-airflow-providers-imap            3.5.0
apache-airflow-providers-postgres        5.10.2
apache-airflow-providers-redis           3.6.0
apache-airflow-providers-slack           8.6.1
apache-airflow-providers-smtp            1.6.1
apache-airflow-providers-sqlite          3.7.1
apache-airflow-providers-ssh             3.10.1

Deployment

Virtualenv installation

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@djuarezg djuarezg added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Mar 25, 2024
Copy link

boring-cyborg bot commented Mar 25, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@djuarezg
Copy link
Author

Probably caused due to #35911

@RNHTTR RNHTTR removed the needs-triage label for new issues that we didn't triage yet label Apr 5, 2024
@pateash
Copy link
Contributor

pateash commented Apr 8, 2024

Checking this, can any one please assign this to me.

@pateash
Copy link
Contributor

pateash commented Apr 9, 2024

Thanks @Lee-W @RNHTTR @djuarezg to flag this out.
I have raised a PR for the fix, please have a look.
#38852

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants