-
Notifications
You must be signed in to change notification settings - Fork 15
Training checkpoints (of large models) are unreadble due to metadata #57
Comments
I dug a little into this, I think it is related to torch using its own ZIP implementation PyTorchFileWriter. A ZIP file contains a metadata "Central Directory", which is a collection of headers that describe the files included in the ZIP, and the offset of where they are stored in the byte stream. These headers (one for each file in the zip) are located at the end of the file, with the offset to the start stored at the very end. So that a ZIP client can show you the contents of a file without unzipping the whole thing. An example header (the first one) looks like this, taken from an original PTL training checkpoint larger than 2GB (2.6GB):
In the corrupted checkpoint (the one where we added our metadata to), this first header is exactly the same. So far so good. So, the central directory is just a long list of these headers, one for each file, in the order of which they were added to the zip. Note that the header tells you the size of the file (Compressed Length and Uncompressed Length). Going through the headers, they are the same for original and corrupted at the beginning. But at some point the cumulative size of all files will become greater than 2GB. This is where the original and the corrupted headers start to differ. Original:
Corrupted:
Note the addition of an extra ZIP64 fields at the end, marked with the arrow. So the python ZipFile implementation is rewriting the headers of the PyTorch data in the central directory. I believe this is what causes the corruption. For whatever reason, the PyTorch ZIP implementation does not add or expect these fields in the central directory. The PyTorch ZIP implementation does seem to follow the ZIP64 spec, because it has ZIP64 headers at the end of the central directory, it just does not also add that extra field to each individual header:
|
Are there any updates on this? I am also experiencing the same issue. |
As a temporal solution you can 'patch' you training checkpoint to remove the metadata by doing: |
recommenting bc of a bug with the old steps Hi, I had this issue trying to run inference on a 9km model (the checkpoint is 3.3GB). With @gmertes help, the following steps resolved the issue. This is a script you can run which will fix your checkpoints. You just pass the checkpoint you want fixed as an input. #!/bin/bash
# This script fixes checkpoints larger then 2GB by removing and adding back in the metadata
if [[ $1 == "" ]]; then
echo "error! expected usage './bin/fix_ckpt </path/to/checkpoint>' exiting..."
exit 1
fi
set -xe
checkpoint=$1
echo "Cleaning $checkpoint..."
file_name="ai-models.json"
file_zip_path=`unzip -l $checkpoint | grep $file_name | awk '{print $NF}'`
parent_dirs=`dirname $file_zip_path`
unzip -j $checkpoint $file_zip_path 2>&1 > /dev/null #copy the json out of the zip
mkdir -p $parent_dirs
mv $file_name $parent_dirs
zip -d $checkpoint $file_zip_path 2>&1 > /dev/null #delete the json inside the zip
zip $checkpoint $file_zip_path 2>&1 > /dev/null #update the path
unzip -l $checkpoint | grep $file_name #check if worked by printing the copies path within the zip
rm -rf $file_zip_path |
Can we deactivate saving of metadata for these checkpoints by default? Otherwise training of somewhat larger model becomes quite difficult. |
Patching this in #166 for transfer learning. |
@icedoom888 can you link to where this is patched in #166 please? |
@mchantry I had to remove the patch from #166. Now the checkpoint is expected to be patched beforehead using the aforementioned:
|
…mwf#57) + removed fix for missing config
* Introduced resume flag and checkpoint loading for transfer learning, removed metadata saving in checkpoints due to corruption error on big models, fixed logging to work in the transfer leanring setting * Added len of dataset computed dynamically * debugging validation * Small changes * Removed prints * Not working * small changes * Imputer changes * Added sanification of checkpoint, effective batch size, git pre commit * gpc * gpc * New implementation: do not store modified checkpoint, load it directly after changing it * Added logging * Transfer learning working: implemented checkpoint cleaning with large models * Reverted some changes concerning imputer issues * Reverted some changes concerning imputer issues * Cleaned code for final review * Changed changelog and assigned TODO correctly * Changed changelog and assigned TODO correctly * Addressed review: copy checkpoint before removing metadata file * gpc passed * Removed logger in debugging mode * removed dataset lenght due to checkpointing issues * Reintroduced correct config on graphtansformer * gpc passed * Removed patched for issue #57, code expects patched checkpoint already * Removed new path name for patched checkpoint (ignoring fully issue #57) + removed fix for missing config * Adapted changelog * Switched logging to info from debug
* remove saving of unused metadata for training ckpt, fixing #57
Fixed, by removing unused metadata file in #190 |
What happened?
Training checkpoints for large models (num_channels equal or greater than 912) become unreadable by pytorch and hence can't be used to resume or fork runs.
Note - for now this issue can be overcome by:
These are temporary solutions, since if we also reach a point where our inference checkpoints are that large then we would also need a fix to be able to run inference
What are the steps to reproduce the bug?
run_id: previous_run_id
orfork_run_id: previous_run_id
RuntimeError: PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted
Version
0.1.0
Platform (OS and architecture)
ATOS
Relevant log output
Accompanying data
No response
Organisation
No response
The text was updated successfully, but these errors were encountered: