Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opj_compress: 40% of encode time is spent freeing data #445

Closed
gcode-importer opened this issue Nov 28, 2014 · 16 comments
Closed

opj_compress: 40% of encode time is spent freeing data #445

gcode-importer opened this issue Nov 28, 2014 · 16 comments

Comments

@gcode-importer
Copy link

Originally reported on Google Code with ID 445

I am investigating compression perf on a 4096x2048 RGB image.

On my core-i7 under windows, it takes 4.389 seconds for lossy compression,
using the default compression settings.

I made a little experiment and allowed the encoder to function with null data. i.e.
no mct, dwt, t1 or rate distortion was performed. 

Timing for this was a whopping 1.667 seconds.

Next, I ran the Very Sleep profiler (http://www.codersnotes.com/sleepy) (sorry windows
only).  The profiler was telling me that 89% of the time was being spent in the "opj_tcd_code_block_enc_deallocate"
method freeing memory.  

We need to investigate ways of reducing this time.  One seemingly simple idea is to
be able to re-cycle the code blocks for use on the next frame, if one is encoding a
group of frames (DCP work flow, for example).

I am going to look into this.   



Reported by boxerab on 2014-11-28 20:14:45

@gcode-importer
Copy link
Author

I have solved this issue: please see my exp branch on github

https://github.com/OpenJPEG/openjpeg/tree/exp

Because I have several commits in the branch, it is easier to see what I did 
on github than by looking at a single patch, or multiple patches.

With these fixes, for the use case where a user is compressing multiple files with
the same image characteristics (same height, width, bpp, number of components) encoding
is almost double the speed.  For the DCP use case, for example, this will help a lot.

Note: there may be a few memory leaks, because the trunk code assumes that one is not
re-using the tcd object, for example, so it may not free memory correctly.  But, if
there is a leak, it must be minor, because I watched the memory usage over a long period
of time, and it did not grow.

Reported by boxerab on 2014-11-29 04:13:24

@gcode-importer
Copy link
Author

I have combined all of my changes into a single commit: attached is the diff.

Note: I have not run any regression tests, and my own testing involved default settings
for irreversible compression.

Reported by boxerab on 2014-11-30 01:39:58


- _Attachment: [double_encode_speed.diff](https://storage.googleapis.com/google-code-attachments/openjpeg/issue-445/comment-2/double_encode_speed.diff)_

@gcode-importer
Copy link
Author

@Aaron,

Thanks for the patch. However, there are multiple lines that are removed/added because
of space/tab alignment issues which makes it hard to read. Do you think you can provide
a "cleaner" version ?

Reported by mayeut on 2014-12-15 19:21:20

@gcode-importer
Copy link
Author

Hi Matthieu,
I've cleaned it up a bit. Let me know if this works for you.
Thanks,
Aaron

Reported by boxerab on 2014-12-16 00:00:56


- _Attachment: [double_encode.diff](https://storage.googleapis.com/google-code-attachments/openjpeg/issue-445/comment-4/double_encode.diff)_

@gcode-importer
Copy link
Author

By the way, you can find this patch on a branch of my github repo:


https://github.com/OpenJPEG/openjpeg/tree/double_encode_speed


Reported by boxerab on 2014-12-16 00:02:33

@gcode-importer
Copy link
Author

Aaron,

This diff is easier to read.
A few comments though.

It seems that NULL data is still allowed in this patch for all functions (mct, dwt,
t1, ...). I guess it shouldn't (& this is what's causing most of the "hard to read"
patch issues remaining).
The ABI and API are broken with this patch.
I do not feel comfortable patching this as is, maybe others can share there opinions
?

Nevertheless, I think something shall be done.
I guess the "easier" thing to do without breaking API/ABI is to allow recycling/reuse
of some objects as you propose (like tcd/code-blocks)

I did some profiling on my end and focused on problems you raised. For this I used
opj_compress -i OpenJpeg/data/input/nonregression/ElephantDream_4K.tif -o ElephantDream_4K.j2k
-cinema4K
  Running Time  Symbol Name
8644.0ms 99.9%  main
8174.0ms 94.5%    opj_j2k_encode
7915.0ms 91.5%      opj_j2k_post_write_tile
 259.0ms  2.9%      opj_tcd_init_tile
 227.0ms  2.6%        calloc
   6.0ms  0.0%        malloc
   4.0ms  0.0%        _platform_bzero$VARIANT$Merom
   1.0ms  0.0%        opj_tgt_create
   1.0ms  0.0%          calloc
 359.0ms  4.1%    tiftoimage
  92.0ms  1.0%    opj_j2k_end_compress
  91.0ms  1.0%      opj_j2k_end_encoding
  90.0ms  1.0%        opj_tcd_destroy
  75.0ms  0.8%          opj_tcd_code_block_enc_deallocate
  56.0ms  0.6%            szone_free_definite_size
  13.0ms  0.1%            deallocate_pages
   5.0ms  0.0%            free
   1.0ms  0.0%            _os_lock_spin_unlock
  13.0ms  0.1%          deallocate_pages
   1.0ms  0.0%          szone_free_definite_size
   1.0ms  0.0%          opj_tgt_destroy
   1.0ms  0.0%        free_large
   1.0ms  0.0%      opj_j2k_write_eoc
  18.0ms  0.2%    opj_destroy_codec
  18.0ms  0.2%      opj_j2k_destroy
  18.0ms  0.2%        opj_image_destroy
  18.0ms  0.2%          free_large
   1.0ms  0.0%    opj_stream_destroy

calloc in opj_tcd_init_tile (in fact in opj_tcd_code_block_enc_allocate but inlined
or profiler "lost") are taking way too much time. malloc seems the way to go here if
we don't need a "memset" to 0 (I tried & malloc is way faster on my platform but I
did not check non regression).

IMHO, a first patch should deal with malloc instead of calloc if possible + reuse of
tcd/code-blocks. This will allow for API/ABI compatibility with a first level of optimization.

I guess the amount of time spent in those allocation functions is related to number
of times they're called (inner loop of opj_tcd_init_tile) rather than overall allocated
size so another option might be to allocate a bigger block that will be split by OpenJPEG
(it might not be trivial though).

Reported by mayeut on 2014-12-16 20:23:18

@gcode-importer
Copy link
Author

Matthieu,

Thanks for reviewing.  Did you see a significant speed up on your system
with this patch?

I think I could simplify this further:  the code I added to support
encoding without any data could be removed: I was using it for debugging to
isolate the encoding slow-down.

What is left would allow the user to re-cycle the compress objects, rather
than
destroy and re-create for each frame. Do you see an issue with allowing
recycling of compress objects?

Regarding ABI/API changes, I am sure DCP encoder users would be more than
happy to change a little code to get perf improvement of this magnitude.

Reported by boxerab on 2014-12-16 21:46:14

@gcode-importer
Copy link
Author

Here is an even simpler patch.

Reported by boxerab on 2014-12-16 22:04:58


- _Attachment: [double_encode.diff](https://storage.googleapis.com/google-code-attachments/openjpeg/issue-445/comment-8/double_encode.diff)_

@gcode-importer
Copy link
Author

Aaron,

I previously did not test your patch but instead simply profile what your pointing
out. This is now done.
I'm in no way against recycling objects and I think it shall be working.

Regarding the ABI/API changes, I think that OpenJPEG requires one clean/radical change
rather than many small ones (2.0 => 2.1 is a small change that requires to maintain
two branches for minimal features gain, what you propose could be seen in the same
way requiring 3 branches for little features added). See issue 439 for my opinion which
is, in a few words, a 3.0 is needed, designed in a way that will allow for easier enhancement
integration without API/ABI being broken.

That being said, all enhancement that can be done in 2.1 without breaking the API/ABI
shall be merged in quickly (still in my opinion). BTW, as you mentioned, more than
80% of the time that this patch saves is with tcd reuse which can be done without API/ABI
changes.
There are parts of your patch (Those that don't rely on "owns_data" in opj_image_comp_t)
that could probably merged in.
The tricky part is that it has to work when passing different parameters, different
image sizes & so on. The test framework actually expect 1 image in, 1 compressed image
out. It would probably be easy to test with different image sizes given how opj_compress
works. For different parameters, this is another thing so more work is probably required
on opj_compress to allow this (it could take a list of commands in an input file for
example).

I won't merge a patch in if I can't test it. I know there are some parts of OpenJPEG
that aren't tested but I won't take the responsibility of doing such a thing. Maybe
you can ask Matthieu or Antonin if limitations could apply to such an enhancement (like,
recycling is OK as long as all parameters are unchanged, otherwise crash could happen).
 The limitation could be enforced by checks (in opj_setup_encoder for example).

Reported by mayeut on 2014-12-17 19:23:07

@gcode-importer
Copy link
Author

Hello Again,

First of all, let me say I do appreciate your testing these patches, and your feedback
on this issue.

We can leave it up to the project chefs to decide how to proceed on this one.

Best Regards,
Aaron

Reported by boxerab on 2014-12-17 19:41:48

@gcode-importer
Copy link
Author

Hello Aaron, Matthieu,

Thanks to both of you for having investigated this. 
I agree with Matthieu concerning the way we should implement this speed enhancement.

A first patch should done that does not break API/ABI, with appropriate checks in order
to know if recycling is possible or not. This should be part of version 2.1.1 that
I would like to release during february.

A second patch would allow API/ABI break but would be part of OpenJPEG 3.0 (that will
follows suggestions in issue 439).

A question I have about this enhancement: is recycling done at the image level or at
the tile level ? In the former case, is there a way to recycle at the tile level, which
would allow speed optimization for single image encoding with multi-tiles.

Thanks again for your work.

Reported by detonin on 2015-01-20 11:18:37

@gcode-importer
Copy link
Author

Hi Antonin,

The current patch will handle multi-tile encoding. To get the speed boost, though,
you need a work flow that encodes multiple images with the same image characteristics:
same width, height, bit depth etc. That is why I mentioned DCP workflow. Single tile
or multi-tile doesn't affect things. 

Cheers,

Aaron 

Reported by boxerab on 2015-01-20 14:08:41

@gcode-importer
Copy link
Author

@Aaron: thanks for the clarification

@Aaron and Matthieu: do you think one of you could prepare a patch that does not break
API/ABI but recycle objects in case of multiple image encoding ?

Reported by detonin on 2015-01-20 17:11:06

@gcode-importer
Copy link
Author

I'm afraid I don't have time to work on this at the moment.

Reported by boxerab on 2015-01-20 19:33:31

@gcode-importer
Copy link
Author

Well, I re-tested this patch on top of the latest trunk commit: unfortunately, (or fortunately)
I am only seeing an 11% speedup now.  Still, nice to have, but not as dramatic as before.

Reported by boxerab on 2015-01-29 19:33:27

@boxerab
Copy link
Contributor

boxerab commented Jul 3, 2015

On latest Visual Studio compiler, not freeing codec makes negligible difference to timing. So, there is no longer any reason to do this. Please close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants