You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the documentation and the code tell me that the last block of a parallel loop "may be a bit longer". In practice, this can be quite punishing and is less then optimal.
Simple example: A loop with 100 iterations, across 15 threads, will result in blocks of size 6, except for the last one, which will have 16 elements, more than double that of the other blocks. In addition, it being submitted last, its likely to also start last, hence adding to the wait time.
This seems to be the result of the computation rounding down (not explicitly in the code, but it's the result of how the block sizes are computed). I'd suggest to instead round up, so that the above example results in blocks of size 7, with the last one of size 2. That would be a more even distribution of work.
The text was updated successfully, but these errors were encountered:
Hi @lczech, thanks for opening this issue! You raise a good point. But I don't think rounding up works every time. Consider for example 100 iterations parallelized across 14 threads. 100/14=7.14, so with rounding down, this will result in 13 threads of 7 iterations plus one thread of 9 iterations. But if I round up, I will have 13 threads of 8 iterations, which is 104, so thread 13 will actually only have 4 iterations and thread 14 will have no iterations at all, and that is definitely not optimal.
I think the most optimal way to do this would be to not round down or up universally, but rather round to the nearest integer. That way 100/15=6.66 would round up to 7 and 100/14=7.14 would round down to 7, and both cases seem to be the most optimal.
Closing this issue since I will implement this in the next release. (But first I'll need to convince myself mathematically that this actually works no matter the number of iterations and threads; if you want to formulate a rigorous proof, that would be extremely helpful!)
A better solution would be in fact to round down and distribute the remaining elements individually to the threads. Ideally towards the first ones, so that the blocks that are one larger than the others are being run first, minimizing overall wait time.
Example: 22 elements, 4 threads. Threads 1 and 2 get 6 elements each, threads 3 and 4 get 5. That seems optimal. I've implemented that in my threading code now.
Describe the new feature
Currently, the documentation and the code tell me that the last block of a parallel loop "may be a bit longer". In practice, this can be quite punishing and is less then optimal.
Simple example: A loop with 100 iterations, across 15 threads, will result in blocks of size 6, except for the last one, which will have 16 elements, more than double that of the other blocks. In addition, it being submitted last, its likely to also start last, hence adding to the wait time.
This seems to be the result of the computation rounding down (not explicitly in the code, but it's the result of how the block sizes are computed). I'd suggest to instead round up, so that the above example results in blocks of size 7, with the last one of size 2. That would be a more even distribution of work.
The text was updated successfully, but these errors were encountered: