-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel / distributed training #1148
Conversation
Hi, thank you very much for this great PR! |
The high memory is probably due to ldbm. Each solver creates a data_layer, which maps the data file and reads from a different location so a lot of memory shows up as used. It's just a cache for the data file so it should not be problem, the os can discard it if needed. For info we should modify data_layer to only map the file only once per process, just to avoid running out of virtual memory. I tried once to start dozens of solvers with a 8TB map size and ran into x64 48-bits address limit! For imagenet we also don't have very convincing results here, increasing the GPU bandwidth helps but crashes our machines. I'm not sure why yet, we are still well below the PCI limit. Also I have a UDP based synchronization prototype that's a bit slower than raw sockets, but do not require a sudo install step. I could look at it again if there is demand for it. |
@cypof , thank you very much for comments and sharing the valuable experience! |
@cypof Do you have never evaluated zeromq multi transport or do it create too many overhead? |
@zszhong Data parallelism is also an interesting topic. Spark use Berkeley Resilient Distributed Datasets (RDD) for distributed SGD update |
@bhack , thank you for reply! |
Let's first focus on finishing the parallel implementation as begun here, fix crashes, and validate the approach for ImageNet training. All possible file systems and communication protocols can't be covered here and future extension is always an option. The most effective help is to address the TODO list and help check ILSVRC distributed training. I will give multi-GPU training a try and report back. |
In a distributed setting it's preferable to store the data files locally. Maybe sharding if they don't fit on each machine. The synchronization is likely to use all available bandwidth, or at least it should. There are probably a lot of optimization we could do on top of it, but as an API, user-space networking is supposed to be as fast as it can be, similar to writing kernel code. |
I've drafted a list of TODOs for merge and soon-to-come follow ups that include @cypof's initial TODOs -- see the PR description. |
For such a highly complex system mixing distributed systems, single node parallelisms and machine learning algorithms, simply reading the code can hardly help to grasp the whole big picture. From the existing feedbacks, there are still a lot of aspects to be researched. It would be good to have a detailed design document hosted on the wiki for everyone to easily understand and review the internals piece by piece so that different teams could attack the problems collaboratively. |
Let's work on this in pieces. Let's get the GPU stuff cleaned up and stable. Is there a simple reproduction for instability? Sounds like just cranking up the CHUNK size in parallel.hpp causes things to break. MNIST still the right place to start? |
Yes just increasing the bandwidth using CHUNK should be enough. MNIST seems stable but ImageNet makes the box unreachable after an hour of running gpus.bin on two GPUs. |
Yes, I can't reproduce with MNIST, rebuilding ImageNet now. Looking quickly at MNIST, there might be a memory leak as I watched resident memory cruise up in funny pulses during use, but it could also just be lmdb doing it's thing. I will also note there is a segfault when gpu.bin exits that traces back to the shutdown of cublas, looks like we have killed a buffer it is also trying to free. Larger question. Looking at GPUSync, it looks like you are trying to run a separate thread hitting the GPU pulling data as fast as you can. It seems that should be interlocked with the solver state on the same GPU, but perhaps I'm missing something in what you are trying to do. What is preventing you from issuing copies back to back without anything changing on the GPU? Seems you wouldn't want to bother with a copy unless the solver has progressed. In theory, if you are racing on the sync, as you get to larger transfers, you are just going to beat up the PCIe bus and interrupt bus traffic, could easily cause instability. Smaller chunks would give more scheduling slots for other work to slide in there. One "simple" check for that you can do would be to sleep the thread for a bit at the bottom of the loop (say 100ms) to see if things stabilze out. One more nitpick, in that infinite loop in GPUSync::run, should we check for a clean-up and exit criteria since in theory that current code is not performance critical (famous last words). Also, it seems unclear if we always need to transfer the maximum chunk size. Seems it should only need to be the active parameters, but again, perhaps I'm misunderstanding where you are trying to go. |
OK, I hope you will reproduce on imagenet. I agree that that the separate thread might cause other tasks to get delayed, but I’m not sure why that should be too much of a problem, and why the whole machine seems to hang. I think asynchrony is the right approach because it separates the training and sync code, and that the GPU processes batches faster than we can get the gradients across the bus. Several batches will go through while the sync thread does one pass over the buffer. If the two activities are not ridiculously unbalanced, SGD should still converge, and it gets us a lot of freedom to do more sophisticated processing on the sync thread in the long run. We could skip sparse gradients, encode on less bits etc. I think for now one way to simplify things and monitor what is going on on the bus would be to replace the data pre-fetch thread by a queue on the GPU, e.g. 100 long. The sync thread can continuously fill this queue with new batches, and only sync the gradients if the queue is already full. That would also get us better utilization of the GPUs in general by buffering against latency spikes, particularly if data is served from a network. I can't start it right now, but if you agree I could put that in the todo. |
Here is what I'm getting at. With MNIST and the default setup, I get this behavior where "step" is printed int the flow after each call inside Solve to ForwardBackward and "pull" in the GPUSync loop: I0925 16:58:41.672260 6082 solver.cpp:195] Step The data pull is outpacing the stepping. I understand your desire to be asynchronous, but I'm seeing data getting pulled faster than we process the batch. GPU call tracing confirms this. From what you write above, I would assume you would see >=1 step for each pull is the desired behavior you want. |
Ah I see, but do you write the pull for each chunk of at the end of the whole buffer? Each pull should be for a different location over the weights. For imagenet at the maximum rate I could get the thing to go, I still can only synchronize the whole weight buffer only once every couple seconds. |
@cypof Probably some of boost freelock data structures could be explored since caffe already depend on some boost modules. |
Ah, I see your point. you do multiple transfers to get all the data, so the logging should actually be inside the if check that resets the chunk offset pointer. |
I still see multiple pulls often for each step. My gut is an interlock is still needed there. It also looks like you are sending everything on the default cuda stream, so everything will end up sychronizing access to the GPU, e.g. all submissions to the device will go in order. (ImageNet still building, likely complete in the morning). |
And of course the database build died overnight with some issue in the exif data in one of the images. Super lovely. Patching the convert_imageset.cpp file to hack around the issue. |
Ha! That makes me wonder how many transfer errors we would get on weeks-long runs over networks with such large bandwidths. Maybe we should add CRCs at least in RawSync. Thanks for your feedback, I still hope we can get away with no interlock, more below. I am looking at CUDA channels, it looks like that should help. You said LMDB creates thread local caches. That definitely might be unhappy with pre-fetch threads that afaik get created and destroyed on each batch. I started an architecture document, but I don’t have much yet. Basically the same principle is used for all synchronization segments. The model weights are replicated on all nodes. For each segment, one of the ends is designated master. The master sends a set of weights to the slave. The slave compares them to a copy of it’s own weights. The copy did not change since the last message from the master. The difference between this copy and the slave weights is the gradient. The slave updates it’s weights using the master position plus the gradient, and sends the gradient to the master. On the GPU the copy is on the host to save memory, which was a bad idea. I need to look at moving it to the GPU to optimize bandwidth instead. This mechanism allows slaves to apply other solvers’ gradients and return their own in one pass over the memory. It also does not override progress on either side. In most cases it adds each other’s gradient while keeping slaves within a bounded distance to the master. Memory is scanned in order on all nodes, which helps locality. The amount of work lost due to races during additions on each node seems negligible. On the host, for single-socket CPU, I have seen it be zero for hours-long runs with all cores running. I’m not sure how, maybe the cores always manage to own the cache line during addition for large weight arrays. In terms of locking, there is no way for each node to know if the other side has made progress on a chunk since last exchange. I assume it’s extremely likely that at least one end will have made progress, particularly for hubs like host memory that receives updates from several other nodes and GPUS. So the best strategy is probably to let go of all synchronization and just max out all the links. As an engineering problem, there has to be a way to keep things stable once we have fixed the memory pressure and queuing problems. SGD runs on weights that can change anytime. The hardware has to be able to avoid torn reads on float-length words, which seems to be the case also for GPUs. I have played with ways to synchronize things since last year, and it just lowers the compute utilization. SGD doesn’t seem to be affected by partial writes, but depends heavily on solvers weights being as close as possible to each other. Any latency kills performance, and locking depends either on stopping SGD while buffers are copied, or having separate buffers for transfer, which means SGD is running on stale weights and is likely to produce wasted work. |
I don't know whether this is good news or bad news for Caffe and the larger deep learning community. A startup company Skymind has open sourced their Deeplearning4j project to run deep learning on the omnipotent big data framework Yarn which is also known as Hadoop 2.x or modern Hadoop. |
Two more players: |
Open source distributed machine learning frameworks with deep neural network support: |
Among all of the alternatives, deepdist is the simplest and most intuitive implementation of Downpour SGD[1]. [1] J Dean, GS Corrado, R Monga, K Chen, M Devin, QV Le, MZ Mao, M’A Ranzato, A Senior, P Tucker, K Yang, and AY Ng. Large Scale Distributed Deep Networks. NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada, 2012. |
I was able to run 2 GPUs okay, but I did notice a little swapping after a few hours. With 4 GPUs things start swapping hard after ~10 minutes and effectively swap lock the machine. 8 GPUs quite quickly goes south. You can see this happen with vmstat polling at a given interval. The current code initializes an mdb_env per thread and the resident size grows quickly. I tried to quickly hack around that by instantiating the database once, but then I get errors on the first transactions about invalid rlocks. Still working through that. I haven't been able to find examples on the right way to handle threaded readers. Anyone have ideas here or is more familiar with lmdb/leveldb and threading? This looks like the primary issue as the footprint appears near linear with the number of GPUs. However, there could also be a slow memory leak hiding somewhere as well as the memory utilization does slowly cruise up. Secondly, we should be preallocating pinned memory and doing async transfers. This grabs memory up front instead of dynamically pinning, which can be quite expensive under heavy VM load that the database interaction currently causes. Need to check other data transfer through the stack for the same issue. |
Okay, moving to CaffeMallocHost to call cudaMallocHost seems to have massively improved stability. It appears without this the VM is struggling move things around to dynamically pin memory. I understand the concern expressed in syncedmem.hpp, we can work around that by attempting to catch the error and default to regular malloc. The problem with assuming that unpinned memory is always safe is exactly the issue being hit here, mainly if something else in the system is really putting the VM under pressure, finding memory to pin can be difficult. That also aligns with things seeming worse as the chunk size is increased as it gets even more difficult to find memory to pin and can cause the VM to move things around. I'll continue my 4 GPU run, but I'm long past where things went bad before now. I'm still worried we are leaking memory somewhere. |
I can confirm that up to 8 GPUs changing the allocation in syncedmem.hpp to use cudaMallocHost and free to cudaFreeHost seems to fixes stability issues with multiple GPUs, even with much larger chunk sizes than the default. Waiting for clearance to submit a patch, but it's a simple change someone else can make while I get through the paperwork... The main issue is figuring out a clean way to only allocate pinned memory if GPUs are actually being used. Fall through if GPUs are not in the system is straightforward, check the return code and switch to regular malloc/free. However, if you are only using CPUs and GPUs are in the system, you don't want to be pinning memory. I don't see a great way around that without passing through a mode flag or some other mechanism to track CPU vs GPU use. |
New version finally! I had to rebranch from dev. Some of the todos and @futurely suggestions are fixed. There is still a lot to do, in particular momentum is still not working well, and ImageNet is unstable for 8+ gpus. But performance is much better and there are two new big components that are pretty promising. The main one is an RDMA-based networking layer that allows direct communication between GPU memory and the network adapter, either InfiniBand or RoCE. It supports multicast, which turned out to help a lot. GPUs are faster at receiving than sending, so multicast allows a small output stream from each gpu to expand into a lot of bandwidth on the receiving ends. The other big piece is a modif to data_layer that allows multiple sources, with one data-loading thread per source, one data-transform thread per GPU, and queues that prefetch asynchronously on a separate CUDA stream. We have not benchmarked too much yet but it seems to help a lot. In particular it opens only one database environment per file, and prevents threads from being continuously created and destroyed. Latest graphs: ImageNet CIFAR-10 |
@cypof and others. Anybody already checked ParamterServer caffe effort at CMU |
See also the last doc link of this ticket |
compilation failure?
|
another implementation: https://github.com/sailorsb/caffe-parallel data parallelism over MPI |
@cypof How this graph generated? Is it on training data or testing data? |
It was on testing data, the validation set. |
Hi, I got segmentation fault when running the cifar example on two nodes. Environment: Command on two machines: Error: More Details from gdb: |
Instead of attacking all axes of parallelism at once, see the multi-GPU data parallelism of #2114 for a start. Closing this since sync SGD with data parallelism covers a lot of cases. However, it clearly does not cover all so this PR can serve as the historical record for the good pieces we might pick up later. |
To add something post-modern to the historical record I reference https://github.com/Yangqing/caffe2/issues/11 |
Parallel and distributed training of Caffe models by streaming gradients among solvers. Parallelizes training without redefining models. This is the integration PR for @cypof's original contribution in #1140.
TODO
Follow-up
Please review and collaborate here. The original details of #1140 by @cypof are:
A set of classes to synchronize SGD between multiple solvers. Based on the Hogwild paper, and our work at Flickr to extend the model to GPUs and distributed configurations by streaming gradients between solvers.
Features
Limitations
Tests
Early results on MNIST seem to show linear scaling. We tested on up to 6 machines with 4 solvers each for CPU, and 2 machines with 2 GPUs each. GPUs do not perform well on this small network but still seem to scale linearly.
In the weeks to come we plan to start testing on larger networks and clusters. Currently our GPU machines are connected through 1G Ethernet, please contact us if you are interested to help benchmarking on better hardware.
Architecture
We made the Caffe singleton thread-local, to allow multiple solvers to run in parallel on their own thread. Synchronization works by sharing the weight buffers between solvers in the same address space, and by asynchronously measuring and exchanging gradients between address spaces.
Bugs / Todos