Reduce allocations during readNativeFrames, leading to ~15-20% performance improvement #157

suyashkumar · 2020-12-18T04:17:19Z

This change reduces the number of allocations in readNativeFrames. Instead of allocating a slice for each pixel's samples, a single flat buffer slice is allocated upfront of the size pixelsPerFrame*samplesPerPixel. Later, ranges in that slice are referred to in the larger 2D slice. This leads to there only being two calls to make, leading to significant performance gains.

On my machine running make bench-diff:

name           old time/op  new time/op  delta
Parse/1.dcm-8  2.29ms ± 1%  1.95ms ± 3%  -15.00%  (p=0.008 n=5+5)
Parse/2.dcm-8  2.27ms ± 1%  1.94ms ± 0%  -14.41%  (p=0.008 n=5+5)
Parse/3.dcm-8  2.10ms ± 0%  1.81ms ± 2%  -13.77%  (p=0.008 n=5+5)
Parse/4.dcm-8   240ms ± 1%   196ms ± 4%  -18.27%  (p=0.008 n=5+5)
Parse/5.dcm-8  33.6ms ± 1%  27.9ms ± 0%  -17.00%  (p=0.008 n=5+5)

We see similar results in the GitHub action benchmark.

Presumably the percentage performance gains would be even higher for DICOMs with more Native PixelData (e.g. more frames, pixels per frame, samples per pixel, etc).

This helps address #161.

kaxap · 2020-12-19T19:27:20Z

@suyashkumar could you please take a look on this fork? We're seeing 5-10x speed improvements due to reduced hot path allocations.

suyashkumar · 2020-12-19T22:06:03Z

hi @kaxap, thanks for reaching out, looks like we are both pulling at the same thread here!

Note that there will be a larger apparent percent latency reduction for files with more frames or larger frames (don't have such examples in testdata/ yet, so it's not as apparent in the benchmarks above).

Similar to your implementation, this change removes an unnecessary slice allocation in the loop. In this implementation, two slices are allocated per frame in advance of parsing the frame. In your implementation, I see you read the data into a single dimension slice until the user wants to unpack it (where the rest of the work is done). This is something I considered myself, however I was able to achieve similar results without changing the API surface (in the struct) in this change so I prefer that until the next minor release with api breaking changes. The reason for this is because slicing a golang slice doesn't lead to copying (though there is still some minor overhead).

One of the things I see in your implementation is replacing the calls to binary.Read, which seems very interesting and useful--I didn't know it allocated a slice under the hood!

Would you be able to run the benchmarks on your fork so we can see how they look? And/or, would you be willing to open a PR here with the uint read changes to move away from binary.Read (but we can keep the slice structure the same for now)? If you open a PR, it will automatically run the benchmarks as a GitHub action and compare to baseline. Alternatively, I can make a change to that effect and add you as a co-author.

I will probably go ahead and merge this change to separate it from the others (there are some others I'd like to make too).

kaxap · 2020-12-20T00:30:05Z

@suyashkumar Thanks for the feedback. I was thinking about creating PR but I was not sure about the struct change I made, glad we cleared that up.
I'll create PR with binary<order>.Uintxx asap.

suyashkumar force-pushed the s/performance branch from 440b321 to 2822a7c Compare December 18, 2020 04:44

suyashkumar added the performance label Dec 18, 2020

suyashkumar self-assigned this Dec 18, 2020

suyashkumar force-pushed the s/performance branch from 2822a7c to ae94f35 Compare December 18, 2020 05:44

suyashkumar changed the title ~~[Experimental] Reduce allocations during readNativeFrames~~ Reduce allocations during readNativeFrames, leading to ~15% latency improvement Dec 18, 2020

suyashkumar changed the title ~~Reduce allocations during readNativeFrames, leading to ~15% latency improvement~~ Reduce allocations during readNativeFrames, leading to ~15% performance improvement Dec 18, 2020

suyashkumar force-pushed the s/performance branch from ae94f35 to 26d22cd Compare December 18, 2020 17:51

Reduce allocations during readNativeFrames

28d3c0c

suyashkumar force-pushed the s/performance branch from 65b5e25 to 28d3c0c Compare December 18, 2020 18:23

suyashkumar changed the title ~~Reduce allocations during readNativeFrames, leading to ~15% performance improvement~~ Reduce allocations during readNativeFrames, leading to 15-20% performance improvement Dec 19, 2020

suyashkumar changed the title ~~Reduce allocations during readNativeFrames, leading to 15-20% performance improvement~~ Reduce allocations during readNativeFrames, leading to ~15-20% performance improvement Dec 19, 2020

suyashkumar linked an issue Dec 19, 2020 that may be closed by this pull request

readNativeFrames Performance Optimizations #161

Open

suyashkumar removed a link to an issue Dec 19, 2020

readNativeFrames Performance Optimizations #161

Open

suyashkumar merged commit 33d864f into main Dec 19, 2020

suyashkumar mentioned this pull request Dec 19, 2020

readNativeFrames Performance Optimizations #161

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce allocations during readNativeFrames, leading to ~15-20% performance improvement #157

Reduce allocations during readNativeFrames, leading to ~15-20% performance improvement #157

suyashkumar commented Dec 18, 2020 •

edited

Loading

kaxap commented Dec 19, 2020

suyashkumar commented Dec 19, 2020 •

edited

Loading

kaxap commented Dec 20, 2020

Reduce allocations during readNativeFrames, leading to ~15-20% performance improvement #157

Reduce allocations during readNativeFrames, leading to ~15-20% performance improvement #157

Conversation

suyashkumar commented Dec 18, 2020 • edited Loading

kaxap commented Dec 19, 2020

suyashkumar commented Dec 19, 2020 • edited Loading

kaxap commented Dec 20, 2020

suyashkumar commented Dec 18, 2020 •

edited

Loading

suyashkumar commented Dec 19, 2020 •

edited

Loading