Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fixed parent level memoization broken. Fixes #11612 #11623

Merged
merged 1 commit into from
Aug 23, 2023

Conversation

shmruin
Copy link
Contributor

@shmruin shmruin commented Aug 19, 2023

Fixes #11612

Motivation

Previous commit of #11456 PR makes some error as described in #11612.
When there is a memoization in template level, it seems memoization is not applied to child.
When running workflow like in Verification,

<< Correct - Before apply commit of #11456 >>

  • First Run (without cache)
    스크린샷 2023-08-19 오후 1 55 56
  • Second Run (with cache)
    스크린샷 2023-08-19 오후 1 56 05

<< Current >>

  • First Run (without cache) - Same
    스크린샷 2023-08-19 오후 1 55 56
  • Second Run (with cache) - Broken
    스크린샷 2023-08-19 오후 1 56 31

It seems memoized template not escape the pending status.
So it is required for this to work as it works originally.

Modifications

The reason why : I omitted the fulfilled condition of memoization in executeTemplate

BEFORE commiting #11456 , executeTemplate function In operator.go works like below.
(In aspect of working synchronization & memoization)

// Original code before  #11456 applied
func (woc *wfOperationCtx) executeTemplate( .... ) {
	// ...
	
	// DO something with memoization
	if node == nil && processedTmpl.Memoize != nil {

		// No returns if not error
	}
	
	// Check fulfilled node phrase
	// 1. release synchronization
	// 2. do metrics process if exist
	// 3. RETURN if this node is fulfilled
	// 3. set 'startedAt' for memoization
	if node != nil {
		if node.Fulfilled() {
			woc.controller.syncManager.Release(woc.wf, node.ID, processedTmpl.Synchronization)

			woc.log.Debugf("Node %s already completed", nodeName)
			if processedTmpl.Metrics != nil {
				// Check if this node completed between executions. If it did, emit metrics. If a node completes within
				// the same execution, its metrics are emitted below.
				// We can infer that this node completed during the current operation, emit metrics
				if prevNodeStatus, ok := woc.preExecutionNodePhases[node.ID]; ok && !prevNodeStatus.Fulfilled() {
					localScope, realTimeScope := woc.prepareMetricScope(node)
					woc.computeMetrics(processedTmpl.Metrics.Prometheus, localScope, realTimeScope, false)
				}
			}
			return node, nil
		}
		woc.log.Debugf("Executing node %s of %s is %s", nodeName, node.Type, node.Phase)
		// Memoized nodes don't have StartedAt.
		if node.StartedAt.IsZero() {
			node.StartedAt = metav1.Time{Time: time.Now().UTC()}
			node.EstimatedDuration = woc.estimateNodeDuration(node.Name)
			woc.wf.Status.Nodes.Set(node.ID, *node)
			woc.updated = true
		}
	}
	
	// ----- Check 3 things with timeout -----
	
	// Check if we took too long operating on this workflow and immediately return if we did
	if time.Now().UTC().After(woc.deadline) {
		woc.log.Warnf("Deadline exceeded")
		woc.requeue()
		return node, ErrDeadlineExceeded
	}

	// Check the template deadline for Pending nodes
	// This check will cover the resource forbidden, synchronization scenario,
	// In above scenario, only Node will be created in pending state
	_, err = woc.checkTemplateTimeout(processedTmpl, node)
	if err != nil {
		woc.log.Warnf("Template %s exceeded its deadline", processedTmpl.Name)
		return woc.markNodePhase(nodeName, wfv1.NodeFailed, err.Error()), err
	}

	// Check if we exceeded template or workflow parallelism and immediately return if we did
	if err := woc.checkParallelism(processedTmpl, node, opts.boundaryID); err != nil {
		return node, err
	}
	
	// -----------------------------------
	
	
	// DO something with synchronization
	if processedTmpl.Synchronization != nil {
		
		// Can be returned to acquire lock
	}
	
	// ...
}

And as I just move the if node == nil && processedTmpl.Memoize != nil ... part below the if processedTmpl.Synchronization != nil ..., memoization template node never have a chance to checked fulfilled.

So the node goes down to executeTemplate logic, until switch processedTmpl.GetType() exist.
It makes the child step (or dag or so on) created even memoization hit.

To solve this problem, I need to reuse the Check fulfilled node phrase in original place and after the memoization checked.

  • Keep original place with little refactoring (At the very first place of executeTemplate)
    • Because don't want to mess fulfilled logic too much.
    • Synchronization fulfilled should be checked before synchronization phase.
  • Additional place after the memoization
    • As memoization cannot use Original place, needed to check it is fulfilled seperately.
    • Set 'startetdAt' field correctly, which is only for memoization node.

If this works correctly, I think the other logics will not be affected as Original place code still exist, and only memoization will use the other.

Verification

Test with UI for both STEP and DAG case.
Tests are also applied in operator_test.go

STEP

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: memoized-entrypoint-
spec:
  entrypoint: entrypoint
  templates:
  - name: entrypoint
    memoize:
      key: "entrypoint-key-1"
      cache:
        configMap:
          name: cache-top-entrypoint
    outputs:
        parameters:
          - name: url
            valueFrom:
              expression: |
                'https://argo-workflows.company.com/workflows/namepace/' + '{{workflow.name}}' + '?tab=workflow'
    steps:
      - - name: whalesay
          template: whalesay

  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["cowsay hello_world `date` > /tmp/hello_world.txt"]
    outputs:
      parameters:
      - name: hello
        valueFrom:
          path: /tmp/hello_world.txt

DAG

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: memoized-entrypoint-
spec:
  entrypoint: entrypoint
  templates:
  - name: entrypoint
    dag:
      tasks:
      - name: whalesay-task
        template: whalesay
    memoize:
      key: "entrypoint-key-1"
      cache:
        configMap:
          name: cache-top-entrypoint
    outputs:
      parameters:
      - name: url
        valueFrom:
          expression: |
            'https://argo-workflows.company.com/workflows/namepace/' + '{{workflow.name}}' + '?tab=workflow'
            
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["cowsay hello_world $(date) > /tmp/hello_world.txt"]
    outputs:
      parameters:
      - name: hello
        valueFrom:
          path: /tmp/hello_world.txt

Also test the case with synchronization with memoization, which deals with #11456

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: example-steps-simple
spec:
  entrypoint: main

  templates:
    - name: main
      steps:
        - - name: job-1
            template: sleep
            arguments:
              parameters:
                - name: sleep_duration
                  value: 30
          - name: job-2
            template: sleep
            arguments:
              parameters:
                - name: sleep_duration
                  value: 15

    - name: sleep
      synchronization:
        mutex:
          name: mutex-example-steps-simple
      inputs:
        parameters:
          - name: sleep_duration
      script:
        image: alpine:latest
        command: [/bin/sh]
        source: |
          echo "Sleeping for {{ inputs.parameters.sleep_duration }}"
          sleep {{ inputs.parameters.sleep_duration }}
      memoize:
        key: "memo-key-1"
        cache:
          configMap:
            name: cache-example-steps-simple

For Reviewers

Please tell me if you can expect any side effect by this change. Thanks.

Signed-off-by: shmruin <meme_hm@naver.com>
@terrytangyuan terrytangyuan merged commit 9693c02 into argoproj:master Aug 23, 2023
24 checks passed
shmruin added a commit to shmruin/argo-workflows that referenced this pull request Aug 23, 2023
terrytangyuan pushed a commit that referenced this pull request Aug 23, 2023
dpadhiar pushed a commit to dpadhiar/argo-workflows that referenced this pull request May 9, 2024
…oproj#11623) (argoproj#11660)

Signed-off-by: shmruin <meme_hm@naver.com>
Signed-off-by: Dillen Padhiar <dillen_padhiar@intuit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parent level memoization is broken
3 participants