-
Notifications
You must be signed in to change notification settings - Fork 693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-threaded CPU kernels #1749
Comments
This would be cool, I agree. Can you elaborate on your |
re: If people think threading would be nice to have, I have good working knowledge of existing designs and could prepare a proposal for how this could be implemented in Glow. |
@nhynes Do you have a specific use-case in mind? I agree with your analogy to OpenCL. OpenCL views the CPU, with all of it's cores as a single compute device and it splits the work across multiple cores. This comparison makes sense. I agree that having a multithreaded implementation of some operators is the best way to implement a low-latency optimized networks. However, I think that this design would clash with a design that maximizes the throughput. Most production systems care about batches of inputs. If you care about the total number of inputs on a system then a better design would be to split the graph across multiple cores (data-flow architecture) in a way that would minimize the transfer of memory in and out of the cache. @bertmaher Started workin on such partitioning. This is something that needs to be designed. It's really easy to wrap the outermost loop in Convolution with OpenMP pragma, but I don't think that this is the right long term direction. Also, it would be really difficult to manage the complexity of the system as we implement production-level inference that takes into account multiple batches. Let's not do the easy thing and plan ahead and solve the big picture. |
Thanks for your reply @nadavrot. The specific use case I had in mind is actually very specific (so much so, actually, that not even OMP would suffice). Basically the idea is to link a training model into a trusted hardware enclave. Enclaves have a restricted programming model so porting PyTorch or even OpenBLAS isn't possible. Moreover enclaves can't spawn threads and need to request them from the untrusted OS. Also, context switching between enclaves is very high overhead, so splitting the model across cores might hurt performance. I haven't actually considered cache locality, though, so it might be worth testing out! Is there an easy way to do this in Glow? I'm currently using TVM because it supports training and threads, but the architecture of Glow is (imho) much cleaner and has better integration with a more popular ML framework. Here's a tech report describing the approach, if you're interested. |
@nhynes Ah, I understand. Thank you. That's an interesting use-case that I never would have guessed. I am not familiar with the secure enclave but it sounds interesting. Using OpenMP (or something similar) to parallelize the outer most loop of convolution will ensure that when you bring in the weights of the convolution then all of the cores can share the memory, or at least not compete and evict each others cache lines. Of course, this is only relevant for the last-level cache, which is large enough to hold the convolution weights. |
And here's the poster for Devcon which might help describe and motivate the use case :) |
I’d be interested in a design proposal for threading, especially since you have deep knowledge of other frameworks. I can’t promise whether it would mesh with our higher level goals (as @nadavrot mentioned we’re most interested in high throughput server workloads), or where it would fall in our roadmap, but at least it could be a useful point in the design space. |
Actually, we want to map a computation graph to a multi-core hardware. From previous discussion between @nhynes and @nadavrot, there are two kinds of mapping methods:
There are some tradeoffs between latency and throughput in these two methods. What I am wondering is how to represent a multi-core program in our low-level IR. A natural way would be generate low-level IR for each core. Following is an example which mapping an The original low-level IR is following: declare {
%input = weight float<8 x 28 x 28 x 1>, broadcast, 0.0
%filter = weight float<16 x 5 x 5 x 1>, xavier, 25.0
}
program {
%allo = alloc float<8 x 28 x 28 x 16>
%conv = convolution [5 1 2 16] @out %allo, @in %input, @in %filter3, @in %bias0
%allo0 = alloc float<8 x 28 x 28 x 16>
%relu = relu @out %allo0, @in %allo
} The transformed low-level IR is following after output channel stationary division(Please note that the default layout in Glow is NHWC): declare {
%input = weight float<8 x 28 x 28 x 1>, broadcast, 0.0
%filter = weight float<16 x 5 x 5 x 1>, xavier, 25.0
}
program-core0 {
%allo-core0 = alloc float<8 x 28 x 28 x 8>
%conv-core0 = convolution [5 1 2 8] @out %allo-core0, @in %input, @in %filter3, @in %bias0
sync_device_api
%allo0-core0 = alloc float<8 x 28 x 28 x 16>
%relu-core0 = relu @out %allo0-core0, @in %allo-core0
%relu-core0 + 8 x 28 x 28 x 8 = relu @out (%allo0-core0 + 8 x 28 x 28 x 8) , @in %allo-core1
}
program-core1 {
%allo-core1 = alloc float<8 x 28 x 28 x 8>
%conv-core1 = convolution [5 1 2 8] @out %allo-core1, @in %input, @in %filter3, @in %bias0
sync_device_api
} This is just a trial version, and may contain some mistakes. I am not sure if the design break the whole roadmap. Any comment is welcomed. Thanks! |
@QiJune that's an interesting proposal. Partitioning operators per core would certainly help when the workload is consistent over the entire graph (e.g., ResNet, stacked RNNs) or when using a NUMA core. Reasoning about what should go where (i.e. load balancing the flops) might be a bit challenging, though. Parallelizing ops is a bit more straightforward. For comparison, TVM will parallelize To do this, the TVM compiler will actually pack the computation into a lambda function and send that to the runtime for execution. I'm still piecing together out how this should be implemented in the Glow compiler, but the approach of packing up microtasks and shelling out to the runtime is very flexible and, indeed, is what OpenMP does. If you'd like, I can start a shared doc where we can collaborate on drafting a multi-threading proposal. I'm quite interested in such a feature and would be glad to help make it happen! |
@nhynes Sure, we can work together to make a design doc first. |
Cool, here's a link to the doc which I just created. I'll add to it over the course of this week. From a high level it looks like the right place to add the parallelism transformation is during |
@nhynes Here is also a doc of adding NewBackendSpecificNode for reference. |
There are L1, L2, L3 cache considerations for parallelism and also solving false sharing, especially for operations that involve a reduction like matrix multiplication and convolution. You cannot just parallelize the outer loop and hope to get a good result. If we take the matmul implemented in libjit and apply the analysis from Anatomy of High-Performance Many-Threaded Matrix Multiplication we have the following parallelisation possibilities: glow/lib/Backends/CPU/libjit/libjit_matmul.cpp Lines 225 to 244 in d0fa695
Now other options include parallelising in the inner GEBP kernel glow/lib/Backends/CPU/libjit/libjit_matmul.cpp Lines 150 to 158 in d0fa695
My experienceI tried option 3 and 4 on my dual-core i5-5257U (mobile Broadwell 2.7Ghz, turbo 3.1 Ghz).
|
@mratsim Thank you for this amazing analysis. Do you have performance numbers for the Glow implementation? Also, what matrix sizes are tested? At the moment @gcatron and @beicy are working on implementing a different dimension of parallelism. They are partitioning the whole neural network across multiple cores and implement data-flow parallelism. This has the advantage that the weights of some network stay pinned to a specific core. I think that distributing the neural network across core should be more efficient than distributing a specific operator, when considering the problem of pipelining lots of data. What do you think? |
Fantastic benchamrks, for sure! @mratsim would you mind running one additional benchmark using TVM? The code and lowered schedule can be found here: https://docs.tvm.ai/tutorials/optimize/opt_gemm.html#parallel |
Ah sorry I didn't mention the size, it was M=N=K=1920, probably a bit big for NN matrices, but I wanted the general case done before tackling small matrices. I will try to run the bench from Nim, the language I'm using if I manage to wrap them (Nim can compile and call C++). Edit Regarding dataflow parallelism, I think it's needed for distributed learning on clusters. A bit like OpenMP + MPI, both are complementary, see this OpenMP+MPI presentation. Abstracting that would be great and this is something Tensorflow is actively pursuing recently with their |
I've done the benchmark for libjit, it reaches 85% of OpenBLAS single-threaded, not bad! Changes for review mratsim/laser@2c9c8f1
@nhynes For TVM bench (I also would like to properly integrate Halide), I need some more time to see how to call them from Nim. |
@Laurae2 ran the benchmark on Dual Xeon Gold 6154 and Glow also reaches 66 GFlop/s even though the theoretical single-threaded peak is at 118 GFlop/s on M=N=K=2304 (2304=32*72, with 72 being the number of cores) (mratsim/laser#9) I think Glow is memory-bandwidth starved. |
I have a new workstation with an overclocked skylake-X i9-9980XE at
In the process of adding AVX512 (mratsim/laser#14), I rerun the benchmark (PyTorch Glow compiled with My previous benchmark was done on dual channel i5-5257U 2.7GHz (3.1GHz all turbo) mobile broadwell from a MacBook Pro 2015, so memory was single or dual channel. The following benchmark is done on the i9 (quad memory channels)
This confirms @Laurae2 benchmarks on dual Xeon Gold 6154 (hexa-memory channel)
Side-note on many-cores parallelismWhen parallelizing on the i9 (18 cores) I could only reach 1.8 TFLOP/s with huge 3840*3840 matrices, even though single-threaded performance was 166 GFLOP/s. Parallelizing a single loop is probably not enough with so many cores and BLIS has further advice in their multithreading README. I did not check yet how they implement nested parallelism without oversubscribing (thread explosion say 18 threads on the first loop then 18*18 on the second one). Intel suggests using the recent OpenMP tasks contruct in the OpenMP nested parallelism article. I guess Intel TBB would also be a good match when using a task-based model. |
Some updates on parallelization in my own library. Matrix multiplication and convolutions are probably the hardest to parallelize and I found a clean way using OpenMP to parallelize across the M and the N dimensions using OpenMP task while being easy to maintain and understand https://github.com/numforge/laser/blob/56df4643530ada0a01d29116feb626d93ac11379/laser/primitives/matrix_multiplication/gemm.nim#L74-L78 # #####################################
# 4. for jr = 0,...,nc−1 in steps of nr
for jr in countup(0, tiles.nc-1, NR):
omp_task("firstprivate(`jr`)"):
let nr = min(nc - jr, NR) # C[ic:ic+mc, jc+jr:jc+jr+nr] omp_parallel_if(parallelize):
# ####################################
# 3. for ic = 0,...,m−1 in steps of mc
omp_for(ict, tiles.ic_num_tasks, use_simd=false, nowait=true): In Glow that will boils down to replacing this glow/lib/Backends/CPU/libjit/libjit_matmul.cpp Lines 150 to 158 in f4ab7a9
and glow/lib/Backends/CPU/libjit/libjit_matmul.cpp Lines 226 to 244 in f4ab7a9
by something in the lines of: libjit_matmul_outer(size_t m, size_t n, size_t k, const float *a, size_t lda,
const float *b, size_t ldb, float *c, size_t ldc) {
float packedB[kc * nc] __attribute__((aligned(64)));
for (size_t p = 0; p < k; p += kc) {
size_t pb = MIN(k - p, kc);
for (size_t j = 0; j < n; j += nc) {
size_t jb = MIN(n - j, nc);
if (pack) {
pack_matrix_b<regsB>(jb, pb, &B(p, j), ldb, packedB);
}
#pragma omp parallel for nowait if(<size condition for parallelization>)
for (size_t i = 0; i < m; i += mc) {
size_t ib = MIN(m - i, mc);
libjit_matmul_inner<pack>(ib, jb, pb, &A(i, p), lda, &B(p, j), ldb,
&C(i, j), ldc, packedB);
}
}
}
} void libjit_matmul_inner_packed(int m, int n, int k, const float *packedA,
const float *packedB, float *c, int ldc) {
for (int j = 0; j < n - nr + 1; j += nr) {
#pragma omp task firstprivate(j)
for (int i = 0; i < m - mr + 1; i += mr) {
libjit_matmul_zdot<regsA, regsB>(k, &packedA[i * k], mr, &packedB[j * k],
k, &C(i, j), ldc);
}
}
} The bigger parallel pictureReminder, by convention, C = A * B with C of shape MxN, A of shape MxK, B of shape KxN
Some benchmarksI've integrated MKL-DNN in my benchmarks. Serial 1920x1920
Parallel 1920x1920
Serial 224x224
### Parallel 224x224
@nhynes I've been looking into how to integrate Halide and TVM in those but due to the code generation it's quite tricky |
@nadavrot @gcatron @beicy I've made some advances on my own compiler and research. Partitioning the graph would indeed provide higher level parallelism that would more efficient than multithreading at the loop-level, especially on NUMA systems which often suffers from naive parallel for loops or memory allocation. I did not check your latest research/implementation but the main issue I foresee is not providing enough parallelism. Some NN architectures have straightforward mapping to multi-cores, like GANs or dual-geaded NN with a CNN head for images and another CNN/RNN head for speech or text. However for many feed-forward architectures, parallelism at the computation-graph node is only available during backpropagation when you backpropagate through a function with 2 or more inputs and distribute 1 gradient per socket/core. Regarding this, Cpp-Taskflow by @tsung-wei-huang is quite interesting and according to his paper he is exploring parallelizing Tensorflow computation graph. There are feedforward benchmarks (MNIST and parallel-dnn) in Cpp-Taskflow example folder but as shown in my 18-cores machine, I can only get a 2x to 3x speedup (taskflow/taskflow#97). So you would need both task-parallelism and data-parallelism to best use CPU cores. On GPU though, data-parallelism is already implicit/handled by Cuda/Cudnn and you can use Cuda async to launch kernels with no dependencies between each other, it might be easily mappable to Cpp-Taskflow task model and unlike CPU, there is no risk of runtimes interfering with each other and oversubscription. For multi-GPU system, it's probably interesting to explore Cuda Unified Memory, while it used to have a lot of overhead on Kepler generation, I expect it improved a lot and this would avoid having to manage memory migration in the tasking system. Furthermore, Unified Memory is the easiest way to benefit from NVlink and we wouldn't need NCCL on a single machine to still have efficient multi-GPU computation (and NCCL doesn't work on Windows). |
In the past 6 months I've been doing a deep dive into multithreading runtimes for deep learning compilers. I have now a much deeper understanding of parallelism and how it could be best harnessed to maximize throughput of deep learning workloads. I have implemented most of the following in my Weave runtime. I will also be giving a talk on the subject of parallel runtime at FOSDEM in Brussels in a month (normally recorded and livestreamed). Parallel paradigms on CPUThere are as far as I'm aware, 3 dominant parallel paradigms on CPU. A multithreading runtime for deep learning applications requires all 3. Data-parallelismThis is the very well known parallel for loop supported by OpenMP and most runtimes. The way it is implemented has however tremendous implications on the runtime flexibility to cope with hardware and workload size. To be covered in a later paragraph with PyTorch specific examples. Task parallelismThe spawn/sync from Cilk or async/await in futures-based control flow. This facilitates implementing recursive divide-and-conquer algorithms for linear algebra kernels and also parallelize tree algorithms like MonteCarloTreeSearch for ELF, Beam Search for NLP or random forests / gradient boosted trees. This is less critical for regular tensor processing (feed-forward NN like ResNet). Dataflow parallelismAlso called graph parallelism, pipeline parallelism, stream parallelism, data-driven task parallelism. The idea is that you schedule delayed tasks that depends on another inputs (OpenMP depends clause, Cpp-Taskflow after/precede, TBB FlowGraph) and may also produce outputs required by an input down the pipeline/stream/graph. i.e. what was mentioned there #1749 (comment) and what is being worked on for Glow AFAIK. Scheduler challengesLoad-balancingDataflow parallelismAs mentioned in my previous post, using a pure dataflow approach misses a lot of parallelism opportunities at the intra-tensor level. Data parallelismOn another dimension, using plain OpenMP parallel for will create load-balancing/grain-size issues on generic algorithms. This is heavily detailed in https://github.com/zy97140/omp-benchmark-for-pytorch. PyTorch uses the parallel for algorithm for tensor addition and tensor exponentiation but with OpenMP, one is worth parallelizing on most machine and the other is only worth parallelizing on slow machines. Task parallelismSome algorithms may create lots of task on one branch of the tree and none on the other, a scheduler that has no load-balancing scheme for tasks will peg CPU to 100% but other will be left unused. The usual approach is work-stealing or another greedy scheduler like Parallel Depth-First scheduler. A greedy scheduler respects the busy-leaves property (as long as there is pending work, all cores are busy). The Cilk paper proves that a greedy scheduler can be at most 2x slower than the optimal schedule. Also, in particular for recursive tree algorithms, the leaf task may be very small compared to scheduling overhead, to reduce overhead the runtime should be able to dynamically package tasks together if they are too small (steal-half strategy). Note that Intel TBB is depth-restricted to limit unbounded memory growth and there are proven cases where it doesn't scale as it doesn't maintain the busy leaves property. Data dependencies and barriersA compiler approach can be of tremendous help to guarantee producer-consumer relationships (see Halide/TVM) and can satisfies itself with a simpler threadpool implementation, however I think a runtime with first-class dataflow would provide lots of benefits as well. Assuming you solve the grain size issue of parallel for with something like lazy task splitting so that you split the task only when they are thieves, you cannot use barriers anymore at all, because there is no guarantee that more than 1 thread will actually enter the barrier, and you will deadlock. So expressing dependencies based on control-flow (barriers, locks, condvar) wouldn't work. This is problematic for many algorithms, for example optimized matrix multiplication which requires packing to be done before computing tile-size matrix multiplication. Another impacted workload would be recurrent neural network which requires specific ordering. CompositionOpenMP for loops are non-nestable which is a huge limiting factor. A batched matrix multiplication would require parallelizing at the outer level if we have 256x32x32 matrices, but at the inner level if we have 3x224x224 matrices. This is very hard to allow with OpenMP based GEMM. NUMA & Distributed computingI'm leaving aside those as I didn't implement them yet. However my runtime unlike most was designed assuming a message-passing model instead of a shared-memory model so synchronizations is done through channels instead of atomics. |
Hi @mratsim. Thanks for your data above and for your updates. At this point the performance of Glow's CPU backend on multithreaded systems is not our highest priority, since Pytorch is already good at that. We're concentrating on making Glow a good framework for implementing backends to AI Accelerator devices, such as the Intel NNPI and Habana backends already integrated into Glow. What's your goal here? Are you interested in improving the multithreaded performance of the Glow CPU backend? |
I'm also building a deep learning compiler but I'm focusing on the runtime first to ensure that it can meet the demands of complex deep learning kernels that require data, task and graph parallelism to maximize throughput. So I'm sharing what I learned while building this runtime and the various challenges and use/corner cases that I encounter. |
Is there any way to integrate GLOW with PyTorch so that we can get additional performance gain for inference throughput? Also if GLOW's image-classifier is implemented as multi-threaded, it would be really helpful to compare performance of PyTorch vs GLOW and evaluate the % performance gain we can obtain by GLOW |
@shrutiramesh1988 We have torch_glow which is our intended long-term path from PyTorch to Glow. We also could execute PyTorch models if they are converted to ONNX or Caffe2 -- see this link. |
Is there any plan to add support for running data-parallel operations in several CPU threads? Threading gives frameworks like TVM almost linear speedup per core, and it'd be nice to see in Glow!
As for implementation, I imagine something like OpenCL's
enqueueKernel
would be the right abstraction.The text was updated successfully, but these errors were encountered: