Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparkoperator.k8s.io/SparkApplication health check does not support dynamicAllocation #7557

Closed
3 tasks done
czchen opened this issue Oct 27, 2021 · 4 comments · Fixed by #11522
Closed
3 tasks done
Labels
bug Something isn't working

Comments

@czchen
Copy link
Contributor

czchen commented Oct 27, 2021

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

sparkoperator.k8s.io/SparkApplication has dynamic allocation mode (spec.dynamicAllocation.enabled: true), which does not have key spec.executor.instances. However,spec.executor.instances is used by current health check script for sparkoperator.k8s.io/SparkApplication. So in dynamic allocation mode, sparkoperator.k8s.io/SparkApplication will never reach Healthy status.

To Reproduce

  • Setup sparkoperator.k8s.io/SparkApplication with spec.dynamicAllocation.enabled: true via ArgoCD.
  • sparkoperator.k8s.io/SparkApplication will never reach Healthy status.

Expected behavior

  • sparkoperator.k8s.io/SparkApplication shall reach Healthy status when Spark application is running.

Version

argocd: v2.1.5+a8a6fc8
  BuildDate: 2021-10-20T15:16:40Z
  GitCommit: a8a6fc8dda0e26bb1e0b893e270c1128038f5b0f
  GitTreeState: clean
  GoVersion: go1.16.5
  Compiler: gc
  Platform: linux/amd64
@czchen czchen added the bug Something isn't working label Oct 27, 2021
@NivStav-RecoLabs
Copy link

We are also facing this issue, SparkApplications are stuck in the Pending state when using dynamicAllocation.
argocd version: v2.4.7+81630e6

@cbl315
Copy link
Contributor

cbl315 commented Oct 20, 2022

We have a workaround for this case: override the health check script as shown below:

    resource.customizations.health.sparkoperator.k8s.io_SparkApplication: |
      health_status = {}
      if obj.status ~= nil then
        if obj.status.applicationState.state ~= nil then
          if obj.status.applicationState.state == "" then
            health_status.status = "Progressing"
            health_status.message = "SparkApplication was added, enqueuing it for submission"
            return health_status
          end
          if obj.spec.dynamicAllocation.enabled == true then
            if obj.status.applicationState.state == "RUNNING" then
              health_status.status = "Healthy"
              health_status.message = "SparkApplication is Running"
              return health_status
            end
          end
          if obj.status.applicationState.state == "RUNNING" then
            if obj.status.executorState ~= nil then
              count=0
              executor_instances = obj.spec.executor.instances
              for i, executorState in pairs(obj.status.executorState) do
                if executorState == "RUNNING" then
                  count=count+1
                end
              end
              if executor_instances == count then
                health_status.status = "Healthy"
                health_status.message = "SparkApplication is Running"
                return health_status
              end
            end
          end
          if obj.status.applicationState.state == "SUBMITTED" then
            health_status.status = "Progressing"
            health_status.message = "SparkApplication was submitted successfully"
            return health_status
          end
          if obj.status.applicationState.state == "COMPLETED" then
            health_status.status = "Healthy"
            health_status.message = "SparkApplication was Completed"
            return health_status
          end
          if obj.status.applicationState.state == "FAILED" then
            health_status.status = "Degraded"
            health_status.message = obj.status.applicationState.errorMessage
            return health_status
          end
          if obj.status.applicationState.state == "SUBMISSION_FAILED" then
            health_status.status = "Degraded"
            health_status.message = obj.status.applicationState.errorMessage
            return health_status
          end
          if obj.status.applicationState.state == "PENDING_RERUN" then
            health_status.status = "Progressing"
            health_status.message = "SparkApplication is Pending Rerun"
            return health_status
          end
          if obj.status.applicationState.state == "INVALIDATING" then
            health_status.status = "Missing"
            health_status.message = "SparkApplication is in InvalidatingState"
            return health_status
          end
          if obj.status.applicationState.state == "SUCCEEDING" then
            health_status.status = "Progressing"
            health_status.message = [[The driver pod has been completed successfully. The executor pods terminate and are cleaned up.
                                      Under this circumstances, we assume the executor pod are completed.]]
            return health_status
          end
          if obj.status.applicationState.state == "FAILING" then
            health_status.status = "Degraded"
            health_status.message = obj.status.applicationState.errorMessage
            return health_status
          end
          if obj.status.applicationState.state == "UNKNOWN" then
            health_status.status = "Progressing"
            health_status.message = "SparkApplication is in UnknownState because either driver pod or one or all executor pods in unknown state  "
            return health_status
          end
        end
      end
      health_status.status = "Progressing"
      health_status.message = "Waiting for Executor pods"
      return health_status

In health.lua, we just check whether the dynamicAllocation is enabled:

         if obj.spec.dynamicAllocation.enabled == true then
            if obj.status.applicationState.state == "RUNNING" then
              health_status.status = "Healthy"
              health_status.message = "SparkApplication is Running"
              return health_status
            end
          end

And it works for us.

@eugen-fried
Copy link
Contributor

Hi there, attached a PR with checks to both Spark Operator API and plain Spark properties ways of configuring dynamic allocation, also taking in account DStreams applications + unit tests for all cases.
@pdrastil could you please help reviewing it?

@crenshaw-dev
Copy link
Member

I plan to merge and release the fix some time next week.

@pdrastil has tested and verified this on their installation. If anyone would like to test the new health check for their use case (or propose additional tests for the PR), it would be appreciated!

crenshaw-dev added a commit that referenced this issue Jan 27, 2023
…llocation is enabled (#7557) (#11522)

Signed-off-by: Yevgeniy Fridland <yevg.mord@gmail.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
crenshaw-dev added a commit that referenced this issue Jan 27, 2023
…llocation is enabled (#7557) (#11522)

Signed-off-by: Yevgeniy Fridland <yevg.mord@gmail.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
crenshaw-dev added a commit that referenced this issue Jan 27, 2023
…llocation is enabled (#7557) (#11522)

Signed-off-by: Yevgeniy Fridland <yevg.mord@gmail.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
crenshaw-dev added a commit that referenced this issue Jan 27, 2023
…llocation is enabled (#7557) (#11522)

Signed-off-by: Yevgeniy Fridland <yevg.mord@gmail.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
crenshaw-dev added a commit that referenced this issue Jan 27, 2023
…llocation is enabled (#7557) (#11522)

Signed-off-by: Yevgeniy Fridland <yevg.mord@gmail.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
emirot pushed a commit to emirot/argo-cd that referenced this issue Jan 27, 2023
…llocation is enabled (argoproj#7557) (argoproj#11522)

Signed-off-by: Yevgeniy Fridland <yevg.mord@gmail.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: emirot <emirot.nolan@gmail.com>
schakrad pushed a commit to schakrad/argo-cd that referenced this issue Mar 14, 2023
…llocation is enabled (argoproj#7557) (argoproj#11522)

Signed-off-by: Yevgeniy Fridland <yevg.mord@gmail.com>
Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>
Signed-off-by: schakrad <chakradari.sindhu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants