-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IBA::Unsharp_mask() speed and memory optimization #4513
Conversation
instead of using 3x IBA + helper functions unsharp_mask_impl() function that use parallel_image() Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
for a kernels 5x5 or more can give a speedup from 40% and higher for a bigger kernels Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
@lgritz what to do? difference is 0.004 for one of channels. I guess an issue happen due to float rounding errors. When convolution_() estimate a scale for normalization and use it later. As an alternative we can try to use double for kernels and normalization and cast to float before apply to a real pixels. |
Signed-off-by: Larry Gritz <lg@larrygritz.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The changes resulted in 3 pixels having a single LSB difference versus the old code, which I think is totally understandable because you changed the code in a way that should be mathematically equivalent, but is a different sequence of floating point operations. I think this is acceptable, especially given the performance gain.
I took the liberty of updating the reference output and pushing that change to your tree, that should clear things up for the tests.
There are still some failed CI tests -- unrelated to your code -- in the heif test, which is the result of the GitHub runners upgrading the libheif version they are using. We've already fixed this in OIIO main, you branched off a little earlier than that for this fix, so as soon as we merge this PR, all will be ok.
I'll double check that CI passes in all cases except for the heif failures, and if that's the case, I will merge.
Sorry to take a while with this review, it's been a really busy last several days for me. But this is good work, thanks.
…areFoundation#4513) Replacing 3x IBA + Helper function that generate 4 fulls size image buffers with single unsharp_mask_impl() that use parallel_image() to compute unsharp: src + contr * (((src - blur) < threshold) ? 0.0 : (src - blur)) Added two pass 1D convolution for a kernels higher than 3x3 ## Tests ``` ImageBuf sharped(input.spec()); const int repeats = 50; std::cout << "Start sharpening\n"; auto start = std::chrono::high_resolution_clock::now(); for (int i = 0; i < repeats; i++) { //ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 15.0f, 10.0f, 0.01f); ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 5.0f, 2.0f, 0.05f); std::cout << "."; } std::cout << "\n"; auto part1 = std::chrono::high_resolution_clock::now(); std::chrono::duration<double> elapsed_part1 = part1 - start; std::cout << "Elapsed time: " << elapsed_part1.count() << " s\n"; ``` both single threaded (one IB at time) and multithreaded (multiply IB at time) show pretty good speedup: ~30-40% with less memory use. for 5x5 gaussian kernels two pass mode should add at least 20% speedup. (if someone can do independent benchmark, will be great. As soon as I had a big differences on them depend on real or synthetic use) --------- Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
…areFoundation#4513) Replacing 3x IBA + Helper function that generate 4 fulls size image buffers with single unsharp_mask_impl() that use parallel_image() to compute unsharp: src + contr * (((src - blur) < threshold) ? 0.0 : (src - blur)) Added two pass 1D convolution for a kernels higher than 3x3 ## Tests ``` ImageBuf sharped(input.spec()); const int repeats = 50; std::cout << "Start sharpening\n"; auto start = std::chrono::high_resolution_clock::now(); for (int i = 0; i < repeats; i++) { //ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 15.0f, 10.0f, 0.01f); ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 5.0f, 2.0f, 0.05f); std::cout << "."; } std::cout << "\n"; auto part1 = std::chrono::high_resolution_clock::now(); std::chrono::duration<double> elapsed_part1 = part1 - start; std::cout << "Elapsed time: " << elapsed_part1.count() << " s\n"; ``` both single threaded (one IB at time) and multithreaded (multiply IB at time) show pretty good speedup: ~30-40% with less memory use. for 5x5 gaussian kernels two pass mode should add at least 20% speedup. (if someone can do independent benchmark, will be great. As soon as I had a big differences on them depend on real or synthetic use) --------- Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
…areFoundation#4513) Replacing 3x IBA + Helper function that generate 4 fulls size image buffers with single unsharp_mask_impl() that use parallel_image() to compute unsharp: src + contr * (((src - blur) < threshold) ? 0.0 : (src - blur)) Added two pass 1D convolution for a kernels higher than 3x3 ## Tests ``` ImageBuf sharped(input.spec()); const int repeats = 50; std::cout << "Start sharpening\n"; auto start = std::chrono::high_resolution_clock::now(); for (int i = 0; i < repeats; i++) { //ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 15.0f, 10.0f, 0.01f); ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 5.0f, 2.0f, 0.05f); std::cout << "."; } std::cout << "\n"; auto part1 = std::chrono::high_resolution_clock::now(); std::chrono::duration<double> elapsed_part1 = part1 - start; std::cout << "Elapsed time: " << elapsed_part1.count() << " s\n"; ``` both single threaded (one IB at time) and multithreaded (multiply IB at time) show pretty good speedup: ~30-40% with less memory use. for 5x5 gaussian kernels two pass mode should add at least 20% speedup. (if someone can do independent benchmark, will be great. As soon as I had a big differences on them depend on real or synthetic use) --------- Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
…areFoundation#4513) Replacing 3x IBA + Helper function that generate 4 fulls size image buffers with single unsharp_mask_impl() that use parallel_image() to compute unsharp: src + contr * (((src - blur) < threshold) ? 0.0 : (src - blur)) Added two pass 1D convolution for a kernels higher than 3x3 ## Tests ``` ImageBuf sharped(input.spec()); const int repeats = 50; std::cout << "Start sharpening\n"; auto start = std::chrono::high_resolution_clock::now(); for (int i = 0; i < repeats; i++) { //ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 15.0f, 10.0f, 0.01f); ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 5.0f, 2.0f, 0.05f); std::cout << "."; } std::cout << "\n"; auto part1 = std::chrono::high_resolution_clock::now(); std::chrono::duration<double> elapsed_part1 = part1 - start; std::cout << "Elapsed time: " << elapsed_part1.count() << " s\n"; ``` both single threaded (one IB at time) and multithreaded (multiply IB at time) show pretty good speedup: ~30-40% with less memory use. for 5x5 gaussian kernels two pass mode should add at least 20% speedup. (if someone can do independent benchmark, will be great. As soon as I had a big differences on them depend on real or synthetic use) --------- Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Description
Replacing 3x IBA + Helper function that generate 4 fulls size image buffers with single unsharp_mask_impl() that use parallel_image() to compute unsharp:
src + contr * (((src - blur) < threshold) ? 0.0 : (src - blur))
Added two pass 1D convolution for a kernels higher than 3x3
Tests
both single threaded (one IB at time) and multithreaded (multiply IB at time) show pretty good speedup:
~30-40% with less memory use.
for 5x5 gaussian kernels two pass mode should add at least 20% speedup.
(if someone can do independent benchmark, will be great. As soon as I had a big differences on them depend on real or synthetic use)
Checklist:
need to update the documentation, for example if this is a bug fix that
doesn't change the API.)
(adding new test cases if necessary).
corresponding Python bindings (and if altering ImageBufAlgo functions, also
exposed the new functionality as oiiotool options).
already run clang-format before submitting, I definitely will look at the CI
test that runs clang-format and fix anything that it highlights as being
nonconforming.