Skip to content

Local mode fails when referencing FinalMetricDataList #5201

@drosin

Description

@drosin

Describe the bug
When using local mode, referencing FinalMetricDataList leads to an error.

To reproduce

To create and run the pipeline:

import os
import boto3
import sagemaker
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.fail_step import FailStep
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.pipeline_context import LocalPipelineSession
from sagemaker.workflow.steps import TrainingStep
from sagemaker.pytorch.estimator import PyTorch

execution_role = "arn:aws:iam::123456789012:role/MyPlaceholderRole"
boto_session = boto3.session.Session(region_name="eu-central-1")

localstack_hostname = os.getenv("LOCALSTACK_HOSTNAME")
localstack_edge_port = os.getenv("LOCALSTACK_EDGE_PORT")

sagemaker_session = LocalPipelineSession(
    boto_session=boto_session,
    default_bucket="some-bucket-name",
    s3_endpoint_url=f"http://{localstack_hostname}:{localstack_edge_port}",
)

train_model_step = TrainingStep(
    name="training_step",
    estimator=PyTorch(
        sagemaker_session=sagemaker_session,
        py_version="py310",
        framework_version="2.2",
        role=execution_role,
        instance_count=1,
        entrypoint=["python3", "/opt/ml/code/main.py"],
        instance_type="ml.c5.xlarge",
        entry_point="train.py",
        source_dir="src/steps/train_model_step",
        metric_definitions=[
            {
                "Name": "ganloss",
                "Regex": "GAN_loss=(0.138318);",
            },
        ],
    ),
)
fail_step = FailStep(
    name="FailPipeline",
    error_message="Model score did not meet the required threshold.",
)
condition_step = ConditionStep(
    name="CheckMetrics",
    conditions=[
        ConditionGreaterThanOrEqualTo(
            left=train_model_step.properties.FinalMetricDataList[
                "ganloss"
            ].Value,
            right=0,
        )
    ],
    depends_on=[train_model_step],
    if_steps=[],
    else_steps=[fail_step],
)
pipeline = Pipeline(
    name="my_pipeline",
    steps=[train_model_step, condition_step],
    sagemaker_session=sagemaker_session,
)
pipeline.upsert(
    role_arn=execution_role,
)
execution = pipeline.start()

This is the file that is executed by model training:
src/steps/train_model_step/train.py

    print(
        "GAN_loss=0.138318;  Scaled_reg=2.654134; disc:[-0.017371,0.102429] real 93.3% gen 0.0% disc-combined=0.000000; disc_train_loss=1.374587;  Loss = 16.020744;  Iteration 0 took 0.704s;  Elapsed=0s"
    )

Furthermore, a localstack container needs to be running:

docker run -it -e SERVICES=s3,sts,kms -p4566:4566 -p 4571:4571 localstack/localstack:latest

Expected behavior
I expect the pipeline to run successfully in local mode without errors. This is only possible when running the pipeline on SageMaker without local mode.

Screenshots or logs

[06/10/25 11:02:32] INFO     ===== Job Complete =====                                                                                                                                                                                             image.py:325
                    INFO     Pipeline step 'training_step' SUCCEEDED.                                                                                                                                                                          entities.py:798
                    INFO     Starting pipeline step: 'CheckMetrics'                                                                                                                                                                            entities.py:807
                    INFO     Pipeline step 'CheckMetrics' FAILED. Failure message is: {'Get': "Steps.training_step.FinalMetricDataList['ganloss'].Value"} is undefined.                                                                        entities.py:802
                    INFO     Pipeline execution 24084683-6ab9-4da1-a4b2-842e818630bd FAILED because step 'CheckMetrics' failed.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.243.3
  • Python version: 3.12

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions