Fix training artifacts for 2GB+ models and `MSELoss` #22414

jkbeavers · 2024-10-11T23:41:23Z

Description

generate_artifacts fails when creating training artifacts for a model using external data and MSELoss.

The use of a global base model when creating new training Blocks and onnx.save destroying any external data means any loss block (e.g. MSELoss) that builds more than one sub-Block will fail validation due to missing external data and raise an exception.

Fix

Saving using a deep copy of the global model circumvents this at the cost of holding 2x the model size in memory.

Other Implementations

An alternative approach using less memory would load the on-disk external data before it is deleted in Block::__del__ and insert the appropriate fields into the global ModelProto.
This seems a bit brittle due to the coupling to the specific way external data is destructively accessed in onnx.save. If there exists a non-modifying save in the onnx repo it would be ideal to use that in Block::__call__ instead.

Motivation and Context

Fixes generate_artifacts bug reported in #22411

The use of a global base model when creating new training `Blocks` and `onnx.save` destroying any external data meant any loss block (e.g. `MSELoss`) that builds more than one sub-`Block` will fail validation due to missing external data. Saving using a deep copy of the global model circumvents this. Fixes microsoft#22411

byt3n33dl3

blocks kinda (@microsoft-github-policy-service agree company="Microsoft")

jkbeavers · 2024-10-14T18:07:52Z

@microsoft-github-policy-service agree company="RWS"

snnn · 2024-10-15T17:21:41Z

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline

snnn · 2024-10-15T17:21:46Z

/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline,

snnn · 2024-10-15T17:21:56Z

/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline

azure-pipelines · 2024-10-15T17:22:08Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-10-15T17:22:11Z

Azure Pipelines successfully started running 5 pipeline(s).

azure-pipelines · 2024-10-15T17:22:30Z

Azure Pipelines successfully started running 9 pipeline(s).

WilliamTambellini · 2024-10-15T22:16:27Z

+1

WilliamTambellini · 2024-10-16T17:56:47Z

Tks @snnn and @baijumeswani
Any way for you to do a patch release asap ?

baijumeswani · 2024-10-17T22:00:46Z

I think this will be included in the upcoming 1.20 release.

WilliamTambellini · 2024-10-23T17:33:47Z

tks @baijumeswani
@snnn could you confirm it ll be in the next version end of the month?

byt3n33dl3 reviewed Oct 14, 2024

View reviewed changes

snnn approved these changes Oct 15, 2024

View reviewed changes

snnn added the training issues related to ONNX Runtime training; typically submitted using template label Oct 15, 2024

baijumeswani approved these changes Oct 15, 2024

View reviewed changes

baijumeswani merged commit a5e85a9 into microsoft:main Oct 15, 2024
72 checks passed

guschmue pushed a commit that referenced this pull request Oct 18, 2024

Fix training artifacts for 2GB+ models and MSELoss (#22414)

555d750

jkbeavers mentioned this pull request Oct 23, 2024

[Training] Error building gradient graph for bert models for on-device training #22465

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix training artifacts for 2GB+ models and `MSELoss` #22414

Fix training artifacts for 2GB+ models and `MSELoss` #22414

jkbeavers commented Oct 11, 2024

byt3n33dl3 left a comment

jkbeavers commented Oct 14, 2024

snnn commented Oct 15, 2024

snnn commented Oct 15, 2024

snnn commented Oct 15, 2024

azure-pipelines bot commented Oct 15, 2024

azure-pipelines bot commented Oct 15, 2024

azure-pipelines bot commented Oct 15, 2024

WilliamTambellini commented Oct 15, 2024

WilliamTambellini commented Oct 16, 2024

baijumeswani commented Oct 17, 2024

WilliamTambellini commented Oct 23, 2024

Fix training artifacts for 2GB+ models and MSELoss #22414

Fix training artifacts for 2GB+ models and MSELoss #22414

Conversation

jkbeavers commented Oct 11, 2024

Description

Fix

Other Implementations

Motivation and Context

byt3n33dl3 left a comment

Choose a reason for hiding this comment

jkbeavers commented Oct 14, 2024

snnn commented Oct 15, 2024

snnn commented Oct 15, 2024

snnn commented Oct 15, 2024

azure-pipelines bot commented Oct 15, 2024

azure-pipelines bot commented Oct 15, 2024

azure-pipelines bot commented Oct 15, 2024

WilliamTambellini commented Oct 15, 2024

WilliamTambellini commented Oct 16, 2024

baijumeswani commented Oct 17, 2024

WilliamTambellini commented Oct 23, 2024

Fix training artifacts for 2GB+ models and `MSELoss` #22414

Fix training artifacts for 2GB+ models and `MSELoss` #22414