-
Notifications
You must be signed in to change notification settings - Fork 629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint refactoring - recognize checkpoints by operator instance name. #5503
Conversation
CI MESSAGE: [15594081]: BUILD STARTED |
557bc05
to
1188880
Compare
CI MESSAGE: [15594721]: BUILD STARTED |
CI MESSAGE: [15594721]: BUILD FAILED |
OpCheckpoint &Checkpoint::GetOpCheckpoint(int idx) { | ||
DALI_ENFORCE_VALID_INDEX(idx, cpts_.size()); | ||
return cpts_[idx]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you still rely on the order of the operators, right? The exectuor accesses the checkpoints by opnode's ordinal (opnode.id
), not operator instance name. It feels like it becomes more brittle now.
Previously, when deserializing checkpoint we checked, one by one, that the op's name in the serialized checkpoint matches the name in the deserialized checkpoint (and thus in the graph, because the deserialized checkpoints were built first to match the graph ordering). Now, if the order somehow changes in the restored pipeline, we may succeed when deserializing (because we fetch the ops by name from the executor ignoring the order), but later we access the op checkpoints by the ordinal of the operator.
It seems like there should be one step further, either hiding the order whatsoever and accessing the op checkpoints by name (another map) or at least validate when deserializing that the order of op checkpoints matches the operators' ordinals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, here the executor is expected to remember the order in which the operators were added to the checkpoint. Still, there was a bug elsewhere and I've already posted a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a question of practicality - current executor uses indices. I could add a name->index map and a lookup by name function, but I can still do it later, if it's necessary. Note that when restoring the checkpoint, I reversed the old logic and now restore the operators' state in the order in which they appear in the checkpoint, getting operators from the executor by name.
1188880
to
5cbffaf
Compare
CI MESSAGE: [15740861]: BUILD STARTED |
CI MESSAGE: [15740861]: BUILD PASSED |
Use operator names rather than incidental order in checkpoints. Get operator instances from executor (by name) rather than from lowered graph (by index). Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
…stateless. Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
c82b7e9
to
b7d21b0
Compare
CI MESSAGE: [15766148]: BUILD STARTED |
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
CI MESSAGE: [15768135]: BUILD STARTED |
CI MESSAGE: [15768135]: BUILD PASSED |
Category:
Refactoring (Redesign of existing code that doesn't affect functionality)
Description:
Use operator names rather than incidental order in checkpoints.
Get operator instances from executor (by name) rather than from lowered graph (by index).
Rationale:
Additionally, OpCheckpoint is now only forward-declared in operator.h, because most operators are stateless and don't need to depend on OpCheckpoint definition.
Additional information:
Affected modules and functionalities:
Key points relevant for the review:
Tests:
Tests adjusted (not added)
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: DALI-3979