Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python layer for rapidly writing nets in Python #1020

Closed
wants to merge 5 commits into from

Conversation

longjon
Copy link
Contributor

@longjon longjon commented Sep 1, 2014

This PR depends on #1014, and shouldn't be merged before that one.

Caffe is fast, but adding new layers is a multistep process (see #684), and is subject to all of the pitfalls of development in a low-level, compiled language. For quickly trying new ideas, speed of development may be more important than runtime speed.

This PR addresses this gap by allowing layers to be written in Python. One adds layers to the net prototxt that look like the following:

layers {
  name: "python"
  type: PYTHON
  bottom: "data"
  top: "output"
  python_param {
    module: "my_module"
    layer: "MyLayer"
  }
}

Then, one implements layers in Python with more or less the same interface as in C++, as below.

class MyLayer(object):
    """Simple layer that multiplies input by ten."""

    def setup(self, bottom, top):
        pass

    def reshape(self, bottom, top):
        top[0].reshape(bottom[0].num, bottom[0].channels,
                bottom[0].height, bottom[0].width)

    def forward(self, bottom, top):
        top[0].data[...] = 10 * bottom[0].data

    def backward(self, top, propagate_down, bottom):
        if propagate_down[0]:
            bottom[0].diff[...] = 10 * top[0].diff

In order to make this work, the main caffe binaries need to be linked against the Python libraries. This is solved by adding a build option, WITH_PYTHON_LAYER, as well as a couple ifdefs. These changes are probably not quite ready for merge; in particular, this should break the cmake build system, and the build option is not being tested in Travis yet. (@akosiorek or others, if you're eager to make this work with cmake, PRs against this branch are welcome.)

Layers with learnable parameters are not supported yet.

In theory, this means you ought to be able to write layers in Theano (making use of Theano's symbolic differentiation), embed them in caffe nets, and solve using caffe. I haven't tried that yet, but it might be worth adding some Python-level helper code for that later on.

This hasn't really been tested at all (but does build and run).

@akosiorek
Copy link
Contributor

Looks like a helpful addition. Once it's ready I'll be eager to help with CMake.

@jeffdonahue
Copy link
Contributor

This will be awesome whenever it's ready.

I was just thinking this layer could be even more useful if a prefetch thread could call it as a data provider (in fact this is how all external data is provided in Alex Krizhevsky's cuda-convnet, which honestly makes it more flexible/extensible than our current set of data layers) -- it could be very useful if someone wanted to extend @longjon's python layer to work that way.

Or maybe that would be better implemented in some other way? (It's getting a bit meta with a Python wrapper calling a C++ library which itself sometimes calls Python code...) I guess you can technically already write your data provider in Python, editing the input blobs, but for training you'd have to reimplement the solver logic, and it would be done serially between passes rather than in a separate thread...so maybe we're back to the idea of having a solver callback? Anyway, now I'm just rambling; hopefully someone else thinks this would be useful and has some more coherent thoughts on a good design.

@shelhamer
Copy link
Member

Rebasing and fitting this into the build aside I think this is ready to go if I remember my last conversation with @longjon right.

@jeffdonahue having a prefetch hook to python / matlab / not C++ is a fine idea but I'm not sure what it should look like either. I do entirely agree it would be useful. While we have the solver callback now the data processing and solving are done in alternation instead of simultaneously by prefetching. Perhaps a PythonLayer and PythonData split is worthwhile since the layer interface is setup / forward / backward whereas for data it's really about setup / prefetch.

I'm not certain a layer interface for prefetch is entirely right although that is part of the data / transformation layer conversation and more broadly a question of what to do about phases and stages. Should there be privileged stages like PREFETCH and DEPLOY that the caffe actions, solver, and wrappers know about?

Now I've done my rambling too. Should we meet up for a brew session on this? The cold brew returns tomorrow so there's occasion for Caffe conversation.

@bhack
Copy link
Contributor

bhack commented Sep 22, 2014

I agree with @shelhamer that this will involve to think also on transformation augmentation (and generally synthetic generation). Finding a good design could be useful also in further for exploring transformation sampling space dynamically through monitoring loss or accuracy.

This is necessary to allow Python to access Blobs from the layer
interface, which takes raw pointers. (It does, however, mean that Python
layers must not hold onto Blobs beyond their layer calls.)
This is needed for passing propagate_down to Python layers.
This option links the standard caffe binaries, not just pycaffe, against
the Python libraries, making it possible to embed Python in caffe.
@longjon longjon force-pushed the python-layer branch 3 times, most recently from 3205563 to b46ede2 Compare September 29, 2014 06:33
@jeffdonahue
Copy link
Contributor

Hey Jon, I was thinking of trying this out. I didn't actually test this, but is the backward example below correct?

    def backward(self, top, propagate_down, bottom):
        if propagate_down[0]:
            bottom[0].data[...] = 10

Should it not be something like:

    def backward(self, top, propagate_down, bottom):
        if propagate_down[0]:
            bottom[0].diff = 10 * top[0].diff

@longjon
Copy link
Contributor Author

longjon commented Oct 26, 2014

Oh dear, @jeffdonahue, of course you're correct. I've fixed the example.

@jeffdonahue
Copy link
Contributor

I rebased and tried building this. I get a bunch of multiple definition errors when trying to compile pycaffe. Everything compiles correctly if I check out the second to last commit (before the PythonLayer is added) and comment out the Python object being added to OBJS in the Makefile:

[/home/jdonahue/caffe-bvlc 3]$ gd
diff --git a/Makefile b/Makefile
index c8be847..a57cbd4 100644
--- a/Makefile
+++ b/Makefile
@@ -102,9 +102,9 @@ OBJ_BUILD_DIR := $(BUILD_DIR)/src/$(PROJECT)
 LAYER_BUILD_DIR := $(OBJ_BUILD_DIR)/layers
 UTIL_BUILD_DIR := $(OBJ_BUILD_DIR)/util
 OBJS := $(PROTO_OBJS) $(CXX_OBJS) $(CU_OBJS)
-ifeq ($(USE_PYTHON_LAYER), 1)
-       OBJS += python/$(PROJECT)/_$(PROJECT).o
-endif
+# ifeq ($(USE_PYTHON_LAYER), 1)
+#      OBJS += python/$(PROJECT)/_$(PROJECT).o
+# endif
 # tool, example, and test objects
 TOOL_OBJS := $(addprefix $(BUILD_DIR)/, ${TOOL_SRCS:.cpp=.o})
 TOOL_BUILD_DIR := $(BUILD_DIR)/tools

So I think the multiple definition errors come from the fact that Caffe builds with the PyCaffe object and PyCaffe builds with the Caffe object, leading to multiple definitions of things in the PyCaffe object? Not sure though...

@jeffdonahue
Copy link
Contributor

(Sorry, never mind -- it works when I compile from your non-rebased branch so I must have introduced a problem while rebasing.)

@longjon
Copy link
Contributor Author

longjon commented Oct 28, 2014

@jeffdonahue, I suspect this is due to the addition of the --whole-archive linker option with the registration stuff.

Python layer doesn't really need to statically link against pycaffe; the problem is that dynamically loading _caffe.so when launching caffe causes protobuf's registration functions to be called twice. This actually isn't (wasn't?) a problem if caffe is only used through pycaffe, because then _caffe.so is loaded once, and statically links against the rest of caffe. (Not sure if caffe's new registration will affect that...) In fact, if a python layer net is invoked from python, using the statically linked pycaffe causes problems, because then there are two pycaffe modules around, and this breaks the dynamic rewriting of classes like Net.

In my local code, I link statically against _caffe.o, but load the _caffe module from _caffe.so iff the Python interpreter is running, which works in most cases (unless one tries to use the caffe_pb2 module from a Python layer, which also tries to re-register the protobuf classes).

So I think you've gathered by now that the linking situation is a confusing mess, and I don't have a straightforward answer right now. Actually my local version is very much in flux at the moment, so stay tuned, use the older branch if it suits your needs, and let me know if you run into more problems... or solutions.

@shelhamer
Copy link
Member

@Yangqing it'd be great to have your thoughts on the linking. The registry does keep the layer code neat, but the Python layer makes layer prototyping a breeze and gives a lot of flexibility. Lacking the C++ tool chain knowledge, it could take me a lot of cups of coffee to figure this out.

@Yangqing
Copy link
Member

Hmm, let me take a look tonight and see if I can find the problem... it seems that we may have mixed a few things in the source code (compiled and linked the same cc file twice, maybe?) and it's most likely not caused by the registry code.

@Yangqing
Copy link
Member

So I think I've found the problem. See my comments in the code prefixed with "[Compilation]" for details.

Overall, my feeling is that we should really put all the PythonLayer related things into python_layer.hpp/cpp, and avoid the loop of having caffe depending on _caffe and _caffe depending again on caffe - I think this causes the double definition problem.

@@ -103,6 +103,9 @@ OBJ_BUILD_DIR := $(BUILD_DIR)/src/$(PROJECT)
LAYER_BUILD_DIR := $(OBJ_BUILD_DIR)/layers
UTIL_BUILD_DIR := $(OBJ_BUILD_DIR)/util
OBJS := $(PROTO_OBJS) $(CXX_OBJS) $(CU_OBJS)
ifeq ($(USE_PYTHON_LAYER), 1)
OBJS += python/$(PROJECT)/_$(PROJECT).o
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Compilation] I think this causes the mutliple definition problem: python/caffe/_caffe.o is going to be linked into libcaffe.a because of this, and then when we make pycaffe, python/caffe/_caffe.cpp (which _caffe.o comes from) gets linked again - causing multiple definitions.

We should remove this line (together with other changes, see below).

@Yangqing
Copy link
Member

OK I've finished my pass. @shelhamer @longjon please take another look.

It does not seem to be the registration problem - mostly because we indeed linked the same cpp file twice. Should be relatively easy to fix. Happy to chip in if any further problems emerge.

@shelhamer
Copy link
Member

Great, thanks for the investigation @Yangqing. Note this isn't the latest
rebase, which is why the old factory code is there. I will try your
suggestion in my rebase on dev and report back.
On Fri, Oct 31, 2014 at 09:53 Yangqing Jia notifications@github.com wrote:

OK I've finished my pass. @shelhamer https://github.com/shelhamer
@longjon https://github.com/longjon please take another look.

It does not seem to be the registration problem - mostly because we indeed
linked the same cpp file twice. Should be relatively easy to fix.


Reply to this email directly or view it on GitHub
#1020 (comment).

@longjon
Copy link
Contributor Author

longjon commented Nov 1, 2014

It's a little more subtle than that. Before I go into the agonizing details though, do note that this PR is now quite out-of-date, and was not intended as a final linking solution.

The reason for linking against _caffe.o was not really to include PyBlob, which is gone in the latest version anyway. It's rather that the boost converters need to be set up before the Python layer can do anything, and that (normally) happens in the module initialization, which means that the caffe module needs to be initialized by Python layer.

This could be done just by having Python layer import caffe. However, this interferes with protobuf's registration. (Caffe links against the protobuf shared library. When you invoke the caffe binary with a network containing a Python layer, libprotobuf.so gets loaded twice: once by the caffe binary, and once when caffe gets imported, and this causes double-registration which crashes protobuf (https://code.google.com/p/protobuf/issues/detail?id=128).)

The hack that I used to work around this was: link statically against _caffe.o, giving access to the module initialization code. (The further hack, not in this PR right now, was to actually load from the shared library module when being invoked from within pycaffe, to fix a bug whose details are no longer important.) This worked fine (despite the double inclusion) until --whole-archive was added to the linking.

Actually now I think the double protobuf loading was probably due to the fact that the caffe module imports caffe_pb2. If this can be avoided, (at least when using Python layer), this may fix the issue. (Another way is to link against the static libprotobuf.a, but then one has to recompile protobuf with -fPIC, which we can't expect everyone to do.)

In any case I've already made a lot of changes which include not needing to link against _caffe.o (although importing caffe_pb2 from Python layer code will still cause a problem). I'm going to update this with a proposed mergeable version, but not for a couple weeks :)

@longjon
Copy link
Contributor Author

longjon commented Jan 10, 2015

Upgraded and replaced by #1703.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants