-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch Normalization Layer #3161
Conversation
@cdoersch that`s nice!
It is done in #1965, you can take from there. There is one more implementation here (just in case) |
#1965 (comment) and the next comment. |
@bhack I actually wondered about this a bit myself. Turns out I was wrong to say this was ~5 months old; I have an email conversation from Feb 17 about the original version of this layer, meaning that this implementation has actually been around longer than #1965. From there, I think once Berkeley/CMU folks knew @longjon and I had version, they tended to ask us for our version, rather than use the #1965 version. Really we should have PRed it earlier, but all of us are busy, and PRing is a fair bit of work. So why didn't we abandon our version to use the PR? Well, a couple of possible reasons. First off, ND blobs #1970 was merged literally a week and a half after #1965 was created, but #1965 was never updated to remove deprecated code. #1965 also has some other issues like having 3 commits where only 1 is needed, it still has some dead code, and the data layer reshuffling probably should have been a separate PR. It has no mechanism for computing the global statistics. There's also no good reason not to have the scaling/shifting implemented as a separate layer, since it doesn't really save computation to merge them, and the scaling/shifting can be used in other contexts. Finally, there's no comments on the very confusing backward computation that it does. I know MVN layer doesn't have comments either, but the idea is that the Caffe codebase should improve with time. Overall, I suspect the caffe devs looked at that PR, decided it would be a fair bit of work to understand and review it, and so they put it off. So yes, it's unfortunate that work got duplicated here, but let's focus for now on actually getting some version of this code merged. |
@cdoersch Surely, #1965 has some unfinished work. The reason for this is that I have seen other PRs, which haven been rebased to be mergable lots of time - even after review by core team. Since any core member hasn`t shown any interest to BN PR, then why do I need to waste my time to do permanent rebasing? However, I agree with you that "actually getting some version of this code merged." and will he happy when see BN in master branch. @bhack `s and my point is simple - with such policy, community will just fork repo and rarely PR back - because it will be never merged or even commented. |
Sorry I'm not concentrating on this specific feature. I think that it is only an example about a general problem of transparency for the community. It is hard to understand what @BVLC is brewing and its pipeline for contributors that doesn't have a telephonic, physical or personal contact with @BVLC members. See also #2313 |
@ducha-aiki Exactly!!! |
See fork-contrib back ratio at http://ghtorrent.org/pullreq-perf/BVLC-caffe/ |
This is a lot of conversation that's not really relevant to this PR... For the sake of providing some closure, remember that the 'caffe devs' are grad students whose real job is to do their own research. The system is not perfect. If the developers know your work, your experience level, and your coding style, that makes it easier to review your code, and means they can get more useful features in the limited time they spend on this. In my view, Caffe tries to mirror the vision community overall. If you want your code merged, you need to make a convincing case that your feature is what everyone wants. I don't think I can give you a recipe for this--just like I can't give you a recipe for doing good vision research, or for making a vision product that ships to customers. |
This doesn't justify that communications must be handled off the band (like for this PR) |
Compared with OpenCV which is managed by a foundation, PRs from non-core devs here sometimes are not reviewed in time and merged as soon as they're reviewed. To track how many months they have been open, there're a couple of tools to provide some statistics. Maybe it's time to donate this project to a similar foundation if the current core-devs don't have enough time to manage the growing community any more like this project. |
@futurely , @cdoersch i would agree with donation. Even from community side with small amounts, perhaps sometimes even symbolic ones. I believe it would not hurt actual academic ethics. Perhaps one day Nvidia / Amd or some interested parties could fund this project with full-time permanent core team. |
@cdoersch merging #2966 caused an unfortunate conflict but I've rebased this at shelhamer/caffe:batchnorm if you want to take that and push. Off topic: Adapting the Caffe dev workflow as forks, branches, and PRs proliferate is an endless task but one we try to improve on within the bounds of finite coffee and daylight (or starlight as it might be in CS). ☕ @ducha-aiki @bhack it is too bad that effort was duplicated and thank you @ducha-aiki for sending your PR! Even without merge the PR makes the branch visible for others to merge as they see fit in their fork which can help. @cdoersch did make the earliest branch we knew of and there is a lot to handle in our day-to-day work but it was still good to have your PR too and let's all try to have batch norm merged soon. @futurely foundations are not magic and the linked project has 10 contributors and <100 PRs. It takes hard work and resolve to develop, review, and merge. An official stamp is no use in itself. |
@shelhamer The open source foundation is not simply a change in name. The most important change is who participate in the project governance when the project have thousands of community members. A widely adopted distributed machine learning framework Spark has a diverse group of committers who have write access to the project even though its creators founded a company Databricks. The standard big data framework Hadoop have both a large PMC and many committers while at least two companies Cloudera and Hortonworks built their business models based on it. The current core developers can also found their own companies if they want to without losing the control or credits. Embracing external committers and PMC members brings more human resources to maintain the project and more financial supports such as hardware donation and committers salaries. Many organizations have already invested a lot to adopt Caffe inside. There will certainly be some of them willing to become the committers or join the PMC if you decide to open the door. A more open governance mechanism will accelerate the development and better deal with emerging competitions. Please give this idea a second thought. |
@shelhamer If we want to be all realistic the issue is that Caffe is one of the most forked projects in c++ but your governance model doesn't scale and probably many interesting things are fragmented in forks that developers don't try to contribute anymore with a PR. See Caffe Slow Pull Request lifelines. |
@ducha-aiki for the sake of proper history and credit this branch will be revised to include your initial commit then a commit for changes by @cdoersch and @jeffdonahue so that the merge reflects everyone's effort. |
Hi, @ducha-aiki , the mean and variance of batch normalization in your code I think are different with what the original paper meant. Because, for example, given Num x Channels x Heigth x Width inputs, its' batch mean and variance should be 1 x Channels x Heigth x Width, rather than 1 x Channels x 1 x 1. Anyway, thanks for your share. |
Closing as #3229 implements a superset of the functionality here and has been merged. |
This is a cleaned-up version of the BatchNorm layer that's been floating around Berkeley and CMU for the past ~8
5months, which I think nobody PRed because nobody wanted to figure out how to document the gradient computation. It's a joint effort between myself and @longjon, with contributions/testing from @jeffdonahue and small modifications from @philkr.It's a nearly complete implementation of the original paper. The only missing features are that (1) it uses the biased variance estimate at test time, rather than the unbiased one, (2) it does not currently help you compute batch mean and variance over the full dataset (it uses a moving average), and (3) it doesn't have a simple way to compute the final bias and scaling factor that is applied after batch normalization in the original paper (though you can build it yourself out of other layers).
Documentation is not quite complete yet, but is already better than MVNLayer since this layer actually explains how it's computing the gradient. Given the code comments, it should be straightforward for someone else to finish off documentation.