Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot model weights/solver state to HDF5 files #2836

Merged
merged 5 commits into from
Aug 7, 2015

Conversation

erictzeng
Copy link
Contributor

This pull request enables Caffe to snapshot model weights and solver states to HDF5 files and makes this format the default. This format provides a number of advantages:

  • It obeys weight-sharing, only snapshotting one copy of each of the parameters in the network. The old snapshotting method would save redundant copies of weight-shared parameters.
  • This should enable snapshotting of networks that are arbitrarily large, whereas protobuf imposes a hard limit.
    • Note that, while snapshots themselves can be arbitrarily large, parameters themselves receive new constraints: the maximum number of dimensions an HDF5 dataset can have is 32, and each dimension is capped at 2^64. If anyone is using 33+ dimensional parameters, we can discuss further...

To avoid confusion with the old snapshotting methods, snapshotting to HDF5 files adopts new file extensions, namely .caffemodel.h5 and .solverstate.h5. When restoring either weights or solver history from a file, the extension of the file is checked. If the extension is .h5, it is loaded as an HDF5 file. All other extensions are treated as a binary protobuf file and loaded as before.

The default snapshot format is switched to HDF5 in this PR. If you prefer the old method, you can add snapshot_format: BINARYPROTO to your solver prototxt to restore binary protobuf snapshotting.

A few miscellaneous details:

  • This PR is rebased off of one of @jeffdonahue's commits, primarily for its TestSnapshot test for gradient-based solvers.
  • The few HDF5 helper functions that previously resided in util/io.cpp have been moved out to their own file, util/hdf5.cpp, and additional helper functions have been added.
  • There were some nasty interface changes for both the Net and the Solver, since we now have methods for both BinaryProto and HDF5. Everything in Caffe checks out, but downstream users who have implemented their own non-SGD solvers/solvers with nonstandard snapshotting may have a bad time.

Potential caveats

  • Commit d896647 changes the behavior of the function hdf5_save_nd_dataset. Previously, said function always saved 4-D blobs. It has since been changed to save N-D blobs instead. This could potentially break people's workflows if they were relying on HDF5OutputLayers to output 4-D blobs.
  • Testing could be a bit more thorough. These are next on my list, but I wanted to throw this PR out there in the meanwhile. Off the top of my head:
    • There aren't any tests that compare the loaded solver history.
    • There aren't any tests that verify that weight-shared networks are correctly snapshotted/restored.

Possible extensions

These extensions won't end up in this PR, but possible things to do after this wraps up:

  • It's probably worth looking into how to enable HDF5 compression, as that could potentially drastically reduce the size of serialized models.
  • I like the idea of being able to include the network structure in the .h5 file itself, something that the binary protobuf snapshotting partially does, though it only captures the network topology and discards the layer-specific parameters. One possible way to do this is to write the prototxt as a string to an additional dataset in the .caffemodel.h5 file, though currently there's no complete way to turn a Net into a NetParameter, so this would require extra engineering effort.

@erictzeng erictzeng force-pushed the hdf5_snapshot branch 2 times, most recently from 0241c9f to 86cdba4 Compare July 30, 2015 05:42
@shelhamer
Copy link
Member

This satisfies part of #1211.

@Yeongtae
Copy link

Yeongtae commented Aug 1, 2015

can i access the networks which are blobs using hdf5? If it can, please show the example.

@jeffdonahue jeffdonahue mentioned this pull request Aug 6, 2015
@jeffdonahue
Copy link
Contributor

I've skimmed through this and it mostly looks good, thanks @erictzeng. My one piece of feedback right now is that kMaxBlobAxes should be changed from INT_MAX to 32 unless/until it becomes possible to serialize blobs with more dimensions than that. (One possibility is that all blobs could be stored as 1D HDF5 datasets, with shapes themselves also separately stored as 1D HDF5 datasets, but I don't think anything like that needs to be done here; supporting >2GB nets should probably be considered higher priority than supporting blobs with more than 32 axes.)

<< "Error reading weights from " << trained_filename;
// Check that source layer doesn't have more params than target layer
int num_source_params = hdf5_get_num_links(layer_hid);
CHECK_LE(num_source_params, target_blobs.size())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check equality? You might want to know for instance that the source layer has a bias but the target does not. Sorry, the check in 799-808 covers the rest.

@shelhamer
Copy link
Member

This will be a good switch, and the backward compatibility saves a lot of heartache, but we might consider bringing the documentation and examples along with us as there are references to the current extensions here and there.

This looks good to me code-wise (once Jeff's comment is addressed) but you could squash related changes and fixes when you're done.

Since the weight sharing tests don't cover save and restore (TestSharedWeightsResume does not use an actual Solver) you could add a snapshot test with a simple weight shared net for completeness.

Thanks @erictzeng!

@jeffdonahue
Copy link
Contributor

Since the weight sharing tests don't cover save and restore (TestSharedWeightsResume does not use an actual Solver) you could add a snapshot test with a simple weight shared net for completeness.

The tests I added in #2866 do cover this (though they're less unit tests and more integration tests than what you propose, as they also rely on the solver snapshot/restore correctness).

@shelhamer
Copy link
Member

@jeffdonahue oh sweet, TestSnapshot and company take care of it then. I think this should be merged once kMaxBlobAxes is switched and the history squashed.

@shelhamer shelhamer mentioned this pull request Aug 6, 2015
8 tasks
@bhack
Copy link
Contributor

bhack commented Aug 6, 2015

Why not https://google.github.io/flatbuffers/?

@shelhamer
Copy link
Member

@bhack this lets us keep the same dependencies and interface for defining models. Migrating away to protobuf for a new format needs a good argument and its own issue since model definitions would change.

@bhack
Copy link
Contributor

bhack commented Aug 6, 2015

@shelhamer Flatbuffers support .proto parsing for easier migration from Protocol Buffers

@erictzeng
Copy link
Contributor Author

That should be all comments addressed! The constant has been lowered to 32 as requested, and history has been squashed. Let me know if anything else seems off.

@Yeongtae I'm not sure I fully understand what you're asking, but this PR allows you to access network parameters via HDF5, if that's what you want. The parameters are stored in a fairly simple structure. Here's how you'd peek at the conv1 parameters in lenet:

$ h5ls examples/mnist/lenet_iter_10000.caffemodel.h5/data/conv1
0                        Dataset {20, 1, 5, 5}
1                        Dataset {20}

The datasets 0 and 1 correspond to the weights and biases of the layer, respectively.

H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
CHECK_GE(layer_data_hid, 0)
<< "Error saving weights to " << filename << ".";
hid_t layer_diff_hid = H5Gcreate2(diff_hid, layer_name.c_str(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the diff dataset only be created if write_diff is set?

Summary of changes:
- HDF5 helper functions were moved into a separate file util/hdf5.cpp
- hdf5_save_nd_dataset now saves n-d blobs, can save diffs instead of
  data
- Minor fix for memory leak in HDF5 functions (delete instead of
  delete[])
- Extra methods have been added to both Net/Solver enabling
  snapshotting and restoring from HDF5 files
- snapshot_format was added to SolverParameters, with possible values
  HDF5 or BINARYPROTO (default HDF5)
- kMaxBlobAxes was reduced to 32 to match the limitations of HDF5
@jeffdonahue
Copy link
Contributor

Everything looks good, thanks Eric!

jeffdonahue added a commit that referenced this pull request Aug 7, 2015
Snapshot model weights/solver state to HDF5 files
@jeffdonahue jeffdonahue merged commit fc77ef3 into BVLC:master Aug 7, 2015
@bhack
Copy link
Contributor

bhack commented Aug 7, 2015

My vote still go to flatbuffers as a natural google successor to protobuf. But with this merge hdf5 it is the de facto standard for caffe models now and nobody replied to the evaluation process of protobuff substitute.

This was referenced Aug 9, 2015
ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 9, 2015
Adapt HDF5DataLayer Prefetch to BVLC#2836
ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 9, 2015
Adapt HDF5DataLayer Prefetch to BVLC#2836
ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 9, 2015
Adapt HDF5DataLayer Prefetch to BVLC#2836
ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 9, 2015
Adapt HDF5DataLayer Prefetch to BVLC#2836
ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 9, 2015
Adapt HDF5DataLayer Prefetch to BVLC#2836
ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 9, 2015
Adapt HDF5DataLayer Prefetch to BVLC#2836
ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 10, 2015
Adapt HDF5DataLayer Prefetch to BVLC#2836
@shaibagon
Copy link
Member

What about a python interface for saving a net to HDF5? This can be useful for "net surgery".
I tried to hack it myself, adding

void Net_SaveToHDF5(const Net<Dtype>& net, string filename, bool write_diff) {
  net.ToHDF5(filename.c_str(), write_diff);
}

To python/caffe/_caffe.cpp, and .def("save_to_hdf5", &Net_SaveToHDF5) to the Net class definition.

However, I got this error:

HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 139737896589120:
#000: ../../../src/H5G.c line 310 in H5Gcreate2(): unable to create group
major: Symbol table
minor: Unable to initialize object
#1: ../../../src/H5Gint.c line 194 in H5G__create_named(): unable to create and link to group
major: Symbol table
minor: Unable to initialize object
#2: ../../../src/H5L.c line 1638 in H5L_link_object(): unable to create new link to object
major: Links
minor: Unable to initialize object
#3: ../../../src/H5L.c line 1882 in H5L_create_real(): can't insert link
major: Symbol table
minor: Unable to insert object
#4: ../../../src/H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#5: ../../../src/H5Gtraverse.c line 755 in H5G_traverse_real(): component not found
major: Symbol table
minor: Object not found
F0111 13:48:01.398217 28230 net.cpp:948] Check failed: layer_data_hid >= 0 (-1 vs. 0) Error saving weights to /path/to/model.h5.
*** Check failure stack trace: ***
Aborted (core dumped)

@shaibagon
Copy link
Member

@shelhamer There seems to be some "hiccups" with snapshoting to hdf5 format.
See, for example, this SO question. I had similar issues myself.
Are you aware of these "hiccups"? are you working to solve them?

@shelhamer
Copy link
Member

@shaibagon I'm not aware of any issue, so could you post an issue with details to reproduce the problem with Caffe master? I don't know anything about the OpenCV DNN package mentioned at that SO link.

Please mention @erictzeng in the issue as the author of this PR.

myfavouritekk added a commit to myfavouritekk/caffe that referenced this pull request Sep 12, 2016
Snapshot model weights/solver state to HDF5 files

* erictzeng/hdf5_snapshot: (29 commits)
  Update example bash scripts to expect .h5, new extensions in .gitignore
  TestSnapshot expects .h5 snapshots, explicitly checks history.
  Snapshot model weights/solver state to HDF5 files.
  TestGradientBasedSolver: add TestSnapshot to verify behavior when restoring net/solver from snapshot
  add double_data, double_diff to BlobProto for weights/snapshots saved when using Dtype == double
  Fix typo
  PythonLayer takes parameters by string
  [pytest] open exception file with mode for python3
  [pycaffe,build] include Python first in caffe tool
  ImageData layer default batch size of 1, and check for zero batch size
  Change log levels in upgrade_proto
  [docs] add CONTRIBUTING.md which will appear on GitHub new Issue/PR pages
  [docs] fix contrastive loss eq
  [docs] fix lmdb fetch url and path
  [docs] clear up PYTHONPATH confusion
  Fix path to mnist_autoencoder.prototxt
  [docs] set lmdb url to github mirror
  [docs] matlab 2015a compatible
  Travis scripts for python3 and pytest for cmake. Also fixes CUDA CMake build issue BVLC#2722.
  [examples] fix link to point to new tutorial notebook
  ...

Conflicts:
	.travis.yml
	include/caffe/python_layer.hpp
	scripts/travis/travis_build_and_test.sh
	scripts/travis/travis_install.sh
	src/caffe/proto/caffe.proto
	src/caffe/solver.cpp
	src/caffe/test/test_gradient_based_solver.cpp
	tools/caffe.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants