Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libyuv colourspace conversion #973

Closed
totaam opened this issue Sep 2, 2015 · 14 comments
Closed

libyuv colourspace conversion #973

totaam opened this issue Sep 2, 2015 · 14 comments

Comments

@totaam
Copy link
Collaborator

totaam commented Sep 2, 2015

Issue migrated from trac ticket # 973

component: encodings | priority: blocker | resolution: fixed

2015-09-02 12:30:02: antoine created the issue


libyuv is Optimized for SSE2/SSSE3/AVX2 on x86/x64. and it is packaged in Fedora: libyuv pkgdb entry

It could compete with ffmpeg on speed, sadly scaling requires a separate step.

@totaam
Copy link
Collaborator Author

totaam commented Jan 11, 2016

Added preliminary support and packaging (Fedora only) in r11646, only for BGRX to YUV420P.
That thing is seriously fast. It beats the current record holder (ffmpeg's swscale) by a huge margin on 256x256 and up (this sort of size is pretty common for video).
On a somewhat old AMD FX 8150:

CSC 16x16 128x128 256x256 512x512 1920x1080 2560x1600
swscale 12 137 168 188 183 184
libyuv 18 811 1582 2101 1039 1087

Still TODO:

@totaam
Copy link
Collaborator Author

totaam commented Jan 11, 2016

2016-01-11 13:37:55: antoine uploaded file csc-libyuv-yuv420p-to-bgrx.patch (3.4 KiB)

work in progress patch

@totaam
Copy link
Collaborator Author

totaam commented Jan 11, 2016

r11647 adds support for scaling and improves the tests to measure the performance.
Scaling costs us about 10 to 20%, and swscale is a lot more competitive when scaling is involved.
I have also tested the size limits of this csc mode, and it can scale past 32kx32k which will be helpful for #969.


@smo:

  • please record CSC performance data on a reference system (Xeon?), also for csc opencl performance has regressed #926 - and maybe someone can generate some pretty graphs to visually verify the differences - as it is, libyuv should win the video pipeline scoring system because it has a lower "setup cost" than all the other csc modules (set to zero because we don't need any setup at all apart from instantiating our Cython adapter class!)
  • try the osx and win32 builds: you can probably ask for help on the "unusual" chromium build system (I do plan to use libyuv client side, see patch, which will be useful for opengl-challenged systems like OSX + Intel cards)
  • then re-assign to afarr for testing / some testing can be done in parallel already but the ticketing system does not have this option

For testing:

  • verify that it is installed and gets loaded, using the codec loader (xpra/codecs/loader.py / Encoding_info.exe) and video helper (xpra/codecs/video_helper.py) test classes
  • start the server with --csc-modules=libyuv (we only use libyuv for RGB to yuv, so it only gets used on the server) and verify that it gets used: xpra info | grep csc
  • maybe force downscaling to exercise that code: XPRA_SCALING_HARDCODED=1:2 xpra start ... , or just make the scaling more aggressive: xpra start --video-scaling=100 ... - the downscaling should be visible from xpra info | grep csc.
  • try to break it, look for memory leaks, visual corruption, etc

@totaam
Copy link
Collaborator Author

totaam commented Feb 7, 2016

2016-02-07 01:35:36: antoine uploaded file libyuv.pc (0.2 KiB)

example pkgconfig file for osx

@totaam
Copy link
Collaborator Author

totaam commented Feb 7, 2016

I have figured out the win32 build, as as of r11874 you just do:

cd libyuv-[r1446](../commit/9542dbaa713be35a3ae37c38115168237d6288d3)
mkdir out
cd out
cmake .. -G "Visual Studio 9 2008"

Then build the solution with the visual studio 9 GUI (or directly with nmake? should work too).
Note: I changed from "debug" to "Release" build before building.
If building for Python 3, we should probably use a new Visual Studio version... but the "Xpra-Build-Libs" directory structure is not version specific. Maybe it should be.

Talking about which, you have to place the files where our build system will find them, ie for me:

E:\Xpra-Build-Libs\
                   libyuv\
                          bin\
                              convert.exe
                          lib\
                              yub.lib
                          include\
                                  libyuv.h
                                  libyuv\
                                         -.h

The OSX build is more problematic, first you have to install cmake (that's easy), then it fails with:

libyuv-[r1446](../commit/9542dbaa713be35a3ae37c38115168237d6288d3)/source/row_gcc.cc: In function 'void libyuv::ARGBToUVRow_SSSE3(const uint8*, int, uint8*, uint8*, int)':
libyuv-[r1446](../commit/9542dbaa713be35a3ae37c38115168237d6288d3)/source/row_gcc.cc:886: error: can't find a register in class 'GENERAL_REGS' while reloading 'asm'
libyuv-[r1446](../commit/9542dbaa713be35a3ae37c38115168237d6288d3)/source/row_gcc.cc:886: error: 'asm' operand has impossible constraints

Because of this gcc bug: [https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11203].

The workaround:

LDFLAGS="-read_only_relocs suppress" \
CXXFLAGS="-march=i686 -O3 -fno-pic -fomit-frame-pointer -frename-registers -pipe" \
    cmake -DCMAKE_INSTALL_PREFIX:PATH=$PREFIX ..

It's ugly and uses -fno-pic... but at least it builds. (this is not problem with 64-bit builds)

Then you need to generate and install the pkgconfig file attached to this ticket since libyuv does not provide a template.
But it still fails because of the text relocs, and adding the flag to the compiler options does not help:

creating build/temp.macosx-10.5-i386-2.7/xpra/codecs/csc_libyuv
/usr/bin/gcc-4.2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -arch i386 -I/Developer/SDKs/MacOSX10.5.sdk/usr/include \
    -isysroot /Developer/SDKs/MacOSX10.5.sdk -mmacosx-version-min=10.5 -I/Users/osx/gtk/inst/include -arch i386 \
    -I/Developer/SDKs/MacOSX10.5.sdk/usr/include -isysroot /Developer/SDKs/MacOSX10.5.sdk -I/Users/osx/gtk/inst/include/python2.7 \
    -c xpra/codecs/csc_libyuv/colorspace_converter.cpp -o build/temp.macosx-10.5-i386-2.7/xpra/codecs/csc_libyuv/colorspace_converter.o \
    -Wall -fPIC -read_only_relocs suppress
cc1plus: warning: command line option "-Wstrict-prototypes" is valid for C/ObjC but not for C++
/Developer/usr/bin/g++-4.2 -bundle -undefined dynamic_lookup \
    -L/Users/osx/gtk/inst/lib -L/Users/osx/gtk/inst/lib -arch i386 -L/Developer/SDKs/MacOSX10.5.sdk/usr/lib \
    -isysroot /Developer/SDKs/MacOSX10.5.sdk -mmacosx-version-min=10.5 -Wl,-headerpad_max_install_names \
    -L/Users/osx/gtk/inst/lib -L/Users/osx/gtk/inst/lib -arch i386 -L/Developer/SDKs/MacOSX10.5.sdk/usr/lib \
    -isysroot /Developer/SDKs/MacOSX10.5.sdk -mmacosx-version-min=10.5 -Wl,-headerpad_max_install_names -arch i386 \
    -I/Developer/SDKs/MacOSX10.5.sdk/usr/include -isysroot /Developer/SDKs/MacOSX10.5.sdk -mmacosx-version-min=10.5 \
    -I/Users/osx/gtk/inst/include -arch i386 -I/Developer/SDKs/MacOSX10.5.sdk/usr/include -isysroot /Developer/SDKs/MacOSX10.5.sdk \
    build/temp.macosx-10.5-i386-2.7/xpra/codecs/csc_libyuv/colorspace_converter.o -L/Users/osx/gtk/inst/lib \
    -lyuv -o build/lib.macosx-10.5-i386-2.7/xpra/codecs/csc_libyuv/colorspace_converter.so -Wall -read_only_relocs suppress
ld: absolute addressing (perhaps -mdynamic-no-pic) used in _GetARGBBlend from /Users/osx/gtk/inst/lib/libyuv.a(planar_functions.cc.o) \
   not allowed in slidable image. Use '-read_only_relocs suppress' to enable text relocs

And at this point I give up. We'll enable libyuv when we get 64-bit builds.

@totaam
Copy link
Collaborator Author

totaam commented Mar 13, 2016

Not sure why I am only seeing this now, but I get reliable crashes with a different user over tcp with vp8 and d-feet as client app:

2016-03-13 11:13:46,179 libyuv.ColorspaceConverter.init_context(499, 316, 'BGRX', 499, 316, 'YUV420P', 48)
2016-03-13 11:13:46,180 buffer size=243712, scaling=0, filtermode=None
2016-03-13 11:13:46,183 libyuv.ARGBToI420 took 0.2ms
2016-03-13 11:13:46,188 YUVImageWrapper.free() cython_buffer=0x7fbc883ee220
2016-03-13 11:13:46,188 YUVImageWrapper.free() cython_buffer=0x0
2016-03-13 11:13:49,606 libyuv.ColorspaceConverter.init_context(499, 311, 'BGRX', 499, 311, 'YUV420P', 38)
2016-03-13 11:13:49,606 buffer size=239616, scaling=0, filtermode=None
2016-03-13 11:13:49,609 libyuv.ARGBToI420 took 0.3ms
2016-03-13 11:13:49,616 YUVImageWrapper.free() cython_buffer=0x7fbc887e6050
*** Error in `/bin/python': double free or corruption (!prev): 0x00007fbc887e6050 ***
===#### Backtrace:=====
/lib64/libc.so.6(+0x77da5)[0x7fbcc256dda5]
/lib64/libc.so.6(+0x804fa)[0x7fbcc25764fa]
/lib64/libc.so.6(cfree+0x4c)[0x7fbcc2579cac]
/usr/lib64/python2.7/site-packages/xpra/codecs/csc_libyuv/colorspace_converter.so(+0xdbaf)[0x7fbc92983baf]

@totaam
Copy link
Collaborator Author

totaam commented Mar 13, 2016

The crash only occurs with the vpx codec (happens with both with vp8 and vp9)... h264, and mpeg4 are not affected when using the exact same csc (using XPRA_FORCE_CSC_MODE=YUV420P for x264)

But the libyuv code looks fine, and temporarily removing all calls to free() the memory does not help!?
Raising. This may require using valgrind (oh noes).

@totaam
Copy link
Collaborator Author

totaam commented Mar 31, 2016

I cannot reproduce the problem on an Intel system, maybe it was a bad build somehow, or maybe this crashes on a different setup. (was on an AMD CPU system)
valgrind didn't show anything even remotely suspicious.

@smo: back over to you, can you break it? if not, just record some performance stats.

@totaam
Copy link
Collaborator Author

totaam commented Apr 1, 2016

Hit it again, on Intel this time :(

@totaam
Copy link
Collaborator Author

totaam commented Apr 2, 2016

Turns out it's a trivial rounding error, fixed in r12303 - which I never hit because the tests used even dimensions, the bug only occurred with an odd input height.
(not sure why this fired more with vpx than x264! The memory corruption would affect any buffer allocated after this one)

@totaam
Copy link
Collaborator Author

totaam commented Apr 7, 2016

2016-04-07 21:15:34: smo commented


Ran some of these tests on my machine to confirm libyuv is much faster here are my results

Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
BGRX to YUV420P

CSC 16x16 128x128 256x256 512x512 1920x1080 2560x1600
cython 11 60 64 64 61 62
swscale 10 126 165 182 188 199
libyuv 16 427 1039 1432 1743 1444

I will do more testing like mentioned in comment:2 to make sure there are no crashes and update this ticket if there is.

It is quite obvious that libyuv is much faster than the alternatives.

@totaam
Copy link
Collaborator Author

totaam commented Apr 14, 2016

2016-04-14 19:14:09: smo commented


I haven't run across any issues on the server end in linux with this but I haven't got to compiling and bundling libyuv on win32 yet. I will get to this shortly and update the ticket if I find any issues.

@totaam
Copy link
Collaborator Author

totaam commented Apr 21, 2016

2016-04-21 22:29:06: smo commented


No issues building this on win32 following your instructions.

The command line way of building this is

msbuild Project.sln /p:Configuration=Release

@totaam totaam closed this as completed Apr 21, 2016
@totaam
Copy link
Collaborator Author

totaam commented Oct 11, 2018

See also #1280, #1883, #2004

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant