Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opj_compress fails to compress lossless on gcc/x86 (-m32) #571

Closed
malaterre opened this issue Sep 1, 2015 · 23 comments
Closed

opj_compress fails to compress lossless on gcc/x86 (-m32) #571

malaterre opened this issue Sep 1, 2015 · 23 comments

Comments

@malaterre
Copy link
Collaborator

I am having an issue with OPJ 1.5 and the following file:

http://gdcm.sourceforge.net/thingies/lossless_issue.pgm

It looks as if I cannot do a round-trip lossless compression / decompression, within a 32bits schroot. Everything is fine within a 64bits schroot system (debian jessie if that matter).

@malaterre
Copy link
Collaborator Author

Steps from debian/jessie i386:

$ image_to_j2k -i lossless_issue.pgm -o lossless_issue.j2k
$ j2k_to_image -i lossless_issue.j2k -o lossless_issue.opj.pgm
$ vim -b lossless_issue.opj.pgm
-> remove comment
$ vbindiff lossless_issue.pgm lossless_issue.opj.pgm

@malaterre
Copy link
Collaborator Author

I can reproduce the exact same behavior using openjpeg 2.1.0 (as packaged in debian).

@malaterre
Copy link
Collaborator Author

It seems the compression step is not working as expected:

$ crc32 *
49d5c244    lossless_issue.pgm
c611132c    lossless_issue.sid32.j2k
a913d7c3    lossless_issue.sid32.kdu.pgm
a913d7c3    lossless_issue.sid32.pgm
9922c6b1    lossless_issue.sid.j2k
49d5c244    lossless_issue.sid.kdu.pgm
49d5c244    lossless_issue.sid.pgm

@malaterre
Copy link
Collaborator Author

The good news is that if I run:

$ valgrind  opj_compress -i lossless_issue.pgm -o lossless_issue.sid32.j2k

Everything works as expected. So valgrind is doing some kind of initialisation (rounding?) step that is missing in normal code executition.

@malaterre
Copy link
Collaborator Author

If I use now clang-3.5 from my sid 32bits chroot everything works nicely. So the issue is not within the libc and such, but rather in the code generated by gcc.

@malaterre
Copy link
Collaborator Author

gcc-4.8, gcc-4.9 and gcc-5.2 all have the same symptoms

@malaterre malaterre changed the title OPJ 1.5 is not lossless for 16bits grayscale input openjp2/t1.c:1517:28: runtime error: left shift of negative value -128 Sep 8, 2015
@malaterre
Copy link
Collaborator Author

If I recompile openjpeg with -fsanitize=undefined I get:

/home/mathieu/tmp/opj-bug/openjpeg/src/lib/openjp2/t1.c:1517:28:
runtime error: left shift of negative value -128

ref:

https://github.com/uclouvain/openjpeg/blob/master/src/lib/openjp2/t1.c#L1517

@malaterre malaterre modified the milestones: OPJ v2.1.1, OPJ v1.5.3 Sep 8, 2015
@malaterre malaterre changed the title openjp2/t1.c:1517:28: runtime error: left shift of negative value -128 opj_compress fails to compress lossless on gcc/x86 (-m32) Sep 8, 2015
@malaterre
Copy link
Collaborator Author

EVen if I fix the code using:

tiledp[tileIndex] *= (1 << T1_NMSEDEC_FRACBITS);

it still fails

@malaterre
Copy link
Collaborator Author

Now if I compile the whole code using gcc and then do:

(cd /home/mathieu/tmp/opj-bug/openjpeg/bin/src/lib/openjp2 && /usr/bin/clang  -Dopenjp2_EXPORTS -O0   -g -fPIC -I/home/mathieu/tmp/opj-bug/openjpeg/bin/src/lib/openjp2    -o CMakeFiles/openjp2.dir/tcd.c.o   -c /home/mathieu/tmp/opj-bug/openjpeg/src/lib/openjp2/tcd.c)

the code works as expected

@malaterre
Copy link
Collaborator Author

It appears that the issue is within this function opj_tcd_makelayer:

https://github.com/uclouvain/openjpeg/blob/master/src/lib/openjp2/tcd.c#L218

@malaterre
Copy link
Collaborator Author

To fix the symptoms one could use excess precision using: -msse2 -mfpmath=sse

@malaterre
Copy link
Collaborator Author

It is still not clear why such minor change leads to a lossy compressed stream, this must only be the tip of the iceberg.

@malaterre
Copy link
Collaborator Author

So the patch is simply, replace:

if (dd / dr >= thresh)

with:

OPJ_FLOAT64 div;
div = dd / dr;
if (div >= thresh)

Then compile the code with -ffloat-store to remove excess precision.

@malaterre
Copy link
Collaborator Author

The gcc documentation states that -fexcess-precision=standard should be enough but not in this case. We are still required to use -ffloat-store which feels like a hack.

@malaterre
Copy link
Collaborator Author

Confirmed by upstream gcc dev here. We really need to understand why the code is so sensible to excess precision.

@mayeut
Copy link
Collaborator

mayeut commented Sep 9, 2015

@malaterre,

I had a look at the piece of code that's mentioned here.
In this specific case, for one of the computed value (dd/dr "==" threshold), the comparison is true with -ffloat-store & false without it.

This can be shown easily with this piece of code (with a bit of context):

  if (!dr) {
    if (dd != 0)
      n = passno + 1;
    continue;
  }
  if ((dd / dr) < 0.45) {
    printf("%.32f - %.32f\n", (dd / dr), thresh);
  }

  if ((dd / dr) >= thresh) {
    n = passno + 1;
    if ((dd / dr) < 0.45) {
      printf("True compare\n");
    }
  }
}

layer->numpasses = n - cblk->numpassesinlayers;

We should not rely on "==" comparison for floats.

mayeut added a commit to mayeut/openjpeg that referenced this issue Sep 9, 2015
@mayeut
Copy link
Collaborator

mayeut commented Sep 9, 2015

I'll add this specific image to the test suite if needed

mayeut added a commit to mayeut/openjpeg-data that referenced this issue Sep 9, 2015
mayeut added a commit to mayeut/openjpeg that referenced this issue Sep 9, 2015
@malaterre
Copy link
Collaborator Author

Just in case my series of comments were not clear:

  • Using -ffloat-store is not the correct solution. I'd rather see -fexcess-precision=standard (implied using -std=c99)
  • -fexcess-precision=standard only have effect on cast and assigment so code need to be change (either store the result in a variable or add an explicit cast with a comment)
  • Get rid of -ffast-math, this option is just retar*d. So I should have made my commit 6d7f5cc much clearer and removed completely. If some linux distro decide to use it, then it's there right to shoot themself in the foot. But OpenJPEG should not make it a default option when compiling in Release, that's just wrong.

And finally as discussed with Antonin, I fail to understand this piece of code. So removing the excess precision using compiler specific option is just plain wrong. One would need to understand why this piece of code is so damn sensible to excess precision.

Thanks for adding the dataset to the test suite !

mayeut added a commit to uclouvain/openjpeg-data that referenced this issue Sep 11, 2015
@mayeut mayeut closed this as completed in 5d95355 Sep 11, 2015
@malaterre
Copy link
Collaborator Author

Should I re-open this issue for opj 1.5 branch ?

@mayeut mayeut modified the milestones: OPJ v1.5.3, OPJ v2.1.1 Sep 11, 2015
@mayeut
Copy link
Collaborator

mayeut commented Sep 11, 2015

The issue can only be affected to one milestone...
I think a new one shall be created or maybe just list this one it in #574 .

@vinc17fr
Copy link

Note that even if you remove the excess precision, your code might still be affected by double rounding (which occurs with a probability of 1/2048 on random data). So, you really need to understand the code and how it can be affected by floating-point rounding errors.
The bug was closed by doing "thresh - (dd / dr) < DBL_EPSILON" but one question is whether DBL_EPSILON is sufficient, i.e. what is the maximum error on thresh - (dd / dr) that you can expect?

@malaterre
Copy link
Collaborator Author

Salut Vincent ! Indeed I had seen this at least once through a funky example. Do you believe that simply increasing the precision for DBL_EPSILON will make the issue go away ?

@vinc17fr
Copy link

Depending on the context, a bound like DBL_EPSILON or something larger may or may not be correct. First, the code should be commented so that the reader can know what it tries to do, for instance some mathematical specification.
About the floating-point implementation: The variable dr is an integer, so that I suppose that it is exact. The variables dd and thresh are floating-point numbers. There are two questions about them: What is the range of thresh and dd / dr? What is the floating-point error bound on the variables dd and thresh?
Note that the current code may be regarded as suspicious, because the error bound DBL_EPSILON is an absolute one, while a relative one may be needed. For instance, if thresh and dd / dr can be large (or small) compared to 1, then the bound DBL_EPSILON does not make much sense. Hence my question about the range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants