Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint state invalid for both Zarr and OCDBT formats #1484

Open
plra opened this issue Jan 13, 2025 · 1 comment
Open

Checkpoint state invalid for both Zarr and OCDBT formats #1484

plra opened this issue Jan 13, 2025 · 1 comment
Labels
checkpoint type:bug Something isn't working

Comments

@plra
Copy link

plra commented Jan 13, 2025

I'm using jax 0.4.34, flax 0.9.0 and orbax 0.7.0. Until recently I was using orbax 0.4.1. Certain checkpoints created with v0.4.1 have the following directory structure:

path/to/old_ckpt/
  0/
    commit_success.txt
    default/
      _sharding
      checkpoint
      commit_success.txt
      d/
        <hash>
        ...
      manifest.ocdbt
      <myparam>/
        kernel/
          d/
            <hash>
          manifest.ocdbt
      ...

For at least some of these checkpoints, when I try to restore with a PyTreeCheckpointHandler I get

ValueError: NOT_FOUND: Error opening "cast" driver: Error opening "zarr" driver:
Metadata at "<myparam>/kernel/scale/.zarray" in OCDBT database at
gs://<checkpoints>/<model>/<run>/<step>/default/ does not exist

Downgrading orbax back to 0.4.1 results in the same error. Did I corrupt my checkpoint state somehow? How can I rehabilitate these checkpoints?

For reference, my modern checkpoint dirs look like this:

path/to/new_ckpt/
  0/
    _CHECKPOINT_METADATA
    commit_success.txt
    default/
      _METADATA
      _sharding
      commit_success.txt
      d/
        <hash>
      manifest.ocdbt
      ocdbt.process_0/
        d/
          <hash>
          ...
        manifest.ocdbt

and I can load them just fine.

@cpgaffney1
Copy link
Collaborator

Once written, the checkpoints are not really corruptible unless you manually modify some files.

With 0.4.1, inspect the checkpoint file using either the structure or metadata methods.

You can use TensorStore APIs to verify the existence of the parameter name in the kvstore: https://orbax.readthedocs.io/en/latest/guides/checkpoint/debug_guide.html#array-value.

You can directly verify the keys TensorStore knows about using this: https://orbax.readthedocs.io/en/latest/guides/checkpoint/debug_guide.html#tree-metadata

If the parameter does exist, try checking the use_zarr3 parameter, since the naming could be either "/kernel/scale/.zarray" or "/kernel/scale/zarr.json" depending on which is enabled.

@selamw1 selamw1 added type:bug Something isn't working checkpoint labels Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpoint type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants