Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General optimization idea's and findings #14

Open
NickHu opened this issue Apr 6, 2020 · 29 comments
Open

General optimization idea's and findings #14

NickHu opened this issue Apr 6, 2020 · 29 comments
Labels
discussion Brainstorming

Comments

@NickHu
Copy link

NickHu commented Apr 6, 2020

Given that ffmpeg is in entware, has anyone tried to use a real video codec to grab from /dev/fb0 instead of using lz4 on the raw bytes? I think this should in principle implement @rien's bsdiff idea (changes between frames are small, so this will reduce IO throttle) that I saw on the reddit post. I was able to get a stream to show by doing

ssh root@192.168.0.25 -- /opt/bin/ffmpeg -f fbdev -framerate 1 -i /dev/fb0 -c:v libx264 -preset ultrafast -pix_fmt yuv420p -f rawvideo - | ffplay -i -

but it seems heavily laggy. It does seem to encode at a framerate of just over 1 per second, so there's clearly a long way to go. It also definitely seems like ffplay is waiting for a buffer to accumulate before playing anything. I'm really curious as to whether a more ingenious choice of codecs/ffmpeg flags would be availing.

Here's some sample output of ffmpeg if I set the loglevel of ffplay to quiet:

ffmpeg version 3.4.7 Copyright (c) 2000-2019 the FFmpeg developers
  built with gcc 8.3.0 (OpenWrt GCC 8.3.0 r1148-dbf3d603)
  configuration: --enable-cross-compile --cross-prefix=arm-openwrt-linux-gnueabi- --arch=arm --cpu=cortex-a9 --target-os=linux --prefix=/opt --pkg-config=pkg-config --enable-shared --enable-static --enable-pthreads --enable-zlib --disable-doc --disable-debug --disable-lzma --disable-vaapi --disable-vdpau --disable-outdevs --disable-altivec --disable-vsx --disable-power8 --disable-armv5te --disable-armv6 --disable-armv6t2 --disable-inline-asm --disable-mipsdsp --disable-mipsdspr2 --disable-mipsfpu --disable-msa --disable-mmi --disable-fast-unaligned --disable-runtime-cpudetect --enable-lto --disable-vfp --disable-neon --enable-avresample --enable-libopus --enable-small --enable-libshine --enable-gpl --enable-libx264
  libavutil      55. 78.100 / 55. 78.100
  libavcodec     57.107.100 / 57.107.100
  libavformat    57. 83.100 / 57. 83.100
  libavdevice    57. 10.100 / 57. 10.100
  libavfilter     6.107.100 /  6.107.100
  libavresample   3.  7.  0 /  3.  7.  0
  libswscale      4.  8.100 /  4.  8.100
  libswresample   2.  9.100 /  2.  9.100
  libpostproc    54.  7.100 / 54.  7.100
[fbdev @ 0xb80400] w:1404 h:1872 bpp:16 pixfmt:rgb565le fps:1/1 bit_rate:42052608
[fbdev @ 0xb80400] Stream #0: not enough frames to estimate rate; consider increasing probesize
Input #0, fbdev, from '/dev/fb0':
  Duration: N/A, start: 1586221544.651449, bitrate: 42052 kb/s
    Stream #0:0: Video: rawvideo (RGB[16] / 0x10424752), rgb565le, 1404x1872, 42052 kb/s, 1 fps, 1000k tbr, 1000k tbn, 1000k tbc
Stream mapping:
  Stream #0:0 -> #0:0 (rawvideo (native) -> h264 (libx264))
Press [q] to stop, [?] for help
[libx264 @ 0xb84590] using cpu capabilities: none!
[libx264 @ 0xb84590] profile Constrained Baseline, level 5.0, 4:2:0, 8-bit
Output #0, rawvideo, to 'pipe:':
  Metadata:
    encoder         : Lavf57.83.100
    Stream #0:0: Video: h264 (libx264), yuv420p, 1404x1872, q=-1--1, 1 fps, 1 tbn, 1 tbc
    Metadata:
      encoder         : Lavc57.107.100 libx264
    Side data:
      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: -1
frame=   10 fps=1.0 q=2.0 size=     216kB time=00:00:09.00 bitrate= 196.8kbits/s speed=0.919x

I think one of the big slowdowns here is it's taking the input stream as [fbdev @ 0xb80400] w:1404 h:1872 bpp:16 pixfmt:rgb565le fps:1/1 bit_rate:42052608, rather than the 2 bytes-per-pixel gray16le stream that reStream is using - I can't seem to configure this though.

Also, is lz4 really faster than zstd?

@levincoolxyz
Copy link

levincoolxyz commented Apr 7, 2020

I just (enjoyably) wasted an evening testing different options. My findings (all thru usb):

  1. tried zstd from entware with different --fast=# parameter (ranging from 1 to 200), lz4 still seems faster (and it takes only a single core process ~30% cpu on rM comparing to 15-60% on two cores by zstd). [But maybe zstd can improve overall speed with bad internet connection since it can give a higher compression ratio?]
  2. using ffmpeg to simply cut frame size in half (704x936) (I/O both -f rawvideo) before compression with lz4 still significantly slowed the pipeline down.
  3. lossy image compression (per frame) without invoking ffmpeg also seems to do a worse job comparing to lz4 (both in compression ratio and cpu time). [played with gm convert from entware, for reference gm convert to .jpg compresses to about 5% size of raw frame, while lz4 gives a bit over 6%. (zstd with default option gives a bit below 4%). and gm convert is much much slower.]

So my conclusion is that unless a custom video encoding with streaming in mind is written in a single compiled program, there's probably not much more chance for improvement speed-wise.

---edit---
forgot to mention that without any knowledge in these areas I dared to naively use bsdiff to get a delta encoding... now that I know what bsdiff is used for I realized how dumb that idea was. probably would have fared better compiling a simple for loop in c for this usage.

@rien
Copy link
Owner

rien commented Apr 7, 2020

Hi guys! Great you've been experimenting with different encodings. It definitely seems possible to achieve near zero lag with the right tools.

In writing this script I did some of the same experimentation. But the reMarkable processor is just too weak handle something as heavy as video encoding.

The clue in making this as fast as possible is getting the framebuffer data out of the reMarkable as quickly as possible. I've tried using no compression, but then the kernel (read: TCP or USB IO) seems to be the bottleneck.

Writing something in C (or Rust 🎉) will probably the long-term solution. But in the meanwhile I think experimenting with bsdiff could give some nice results.

@levincoolxyz what did you exactly try with bsdiff? What I had in mind, was keeping a temporary framebuffer to store the previous frame and use bsdiff to compute the difference between that frame and the current one, send the diff, and then reconstruct the image it at the receiving end in the same manner.

@levincoolxyz
Copy link

levincoolxyz commented Apr 7, 2020

that is what I attempted to code with bsdiff. but I quit before fully debugging as I saw the cpu usage shot up like crazy on reMarkable. from the man page etc., i think it is written to generate a small patch of very large files (e.g. software updates) without real time applications in mind.

For reference, this took 1.6s on my laptop (doing virtually nothing): time ( bsdiff fb_old fb_old fb_patch )

@NickHu
Copy link
Author

NickHu commented Apr 7, 2020

I have a couple of observations to make. I think lots of the potential of h264 lie in the colourspace, and also having specialised instructions to get hardware acceleration. Seeing as the video stream coming out of the reMarkable should be entirely gray-scale, this suggests that it's not the right codec to use. I don't know if any codecs which are better suited to gray-scale exist, but to me it's really surprising that a general purpose compressor (lz4) is so much faster than a specialised video codec. It might be worth trying to write a naïve video codec, but I don't really know anything about graphics.

Secondly, according to https://github.com/thoughtpolice/minibsdiff, the bsdiff binary is incorporating gzip. My guess is that's where it's spending most of its time, and replacing that with, say, lz4 ought to give you something faster than what we have right now. I still feel like the 'morally correct' solution is some sort of video compressor though.

@rien
Copy link
Owner

rien commented Apr 7, 2020

What a normal video encoder would do is throwing away information (e.g. colours, image complexity, ...) in order to create a smaller video.

The reason why a general purpose compressor works is because there is a lot of repeated information (a bunch of white FF bytes in the background and some darker bytes which will be more to the darker end).

There are probably codecs which support gray-scale images, but I doubt they will be effective because of the performance constraints we have. Our 'codec' should be as simple as possible.

Maybe bsdiff has an option to not compress its result? If want to do more research what is slowing bsdiff down, you could try to use a profiler like perf.

As an alternative to bsdiff we could xor two images byte-by-byte. The xor-ed image will contain only changes and thus be even more compressible then it is now.

@NickHu
Copy link
Author

NickHu commented Apr 7, 2020

If you link against https://github.com/thoughtpolice/minibsdiff, you can change the compressor that bsdiff uses.

XOR is an interesting idea too - I would guess it also has less complexity, seeing as the byte buffer is the same size all the time - bsdiff probably has some overhead as it has to account for file size increases/decreases
(accidentally hit close, whoops)

@NickHu NickHu closed this as completed Apr 7, 2020
@NickHu NickHu reopened this Apr 7, 2020
@rien
Copy link
Owner

rien commented Apr 7, 2020

I currently don't have the time to dive deeper into this, so feel free to experiment with it! I'm open to PR's.

@levincoolxyz
Copy link

levincoolxyz commented Apr 8, 2020

I tested the xor and then compress with lz4 idea. To my surprise, even with my rusty c programming skills I actually did not slow the pipeline down, which should mean that someone better at coding these low level stuff should bring down streaming latency (especially by choosing a better buffer size and merging xor operation into lz4 binary).
Moreover if the xor buffer is kept for a longer period of time instead of being replaced at every read, I think the performance will increase b/c of using less fread/fwrite (this part maybe possible with some shell script also?). But then we are really approaching the point of writing an efficient embedded device video encoder from scratch ...

(I just put some codes I used for testing in my forked project (https://github.com/levincoolxyz/reStream) for reference. If I get more time I might fiddle with it more...)

@rien
Copy link
Owner

rien commented Apr 8, 2020

That looks great. It will indeed be faster to make your xorstream really streaming by constantly reading stdin and writing the XOR-ed output to stdout.

@levincoolxyz
Copy link

It is already doing that, currently the file read and write are for (updating) the reference file. Maybe I should rewrite it to keep the reference in memory since the supplied stdin is already continuously giving out the new buffer... will try that next.

@jwittke
Copy link

jwittke commented Apr 8, 2020

Maybe the the JBIG1 data compression standard is something to look at...

https://www.cl.cam.ac.uk/~mgk25/jbigkit/

@NickHu
Copy link
Author

NickHu commented Apr 8, 2020

@levincoolxyz I coded up a similar experiment, using a double buffer for input, xor'ing it together, and then compressing the result with lz4 - unfortunately, it doesn't seem any faster. I think there might be too much memory latency in doing two memcpys, but I didn't profile it.
https://gist.github.com/NickHu/8eb7ead78a5489d6a95ad5c7473994f5

I also tried to code up a minimal example using lz4's streaming compression, in the hopes that this would be faster (as it uses a dictionary-based approach), but again it was slightly slower.
https://gist.github.com/NickHu/95e8e5e1b8b326d2cb46ce461d3ec701

I'm not sure how the bash script outperformed me on this one, but I guess the naive C implementation I did is no good!

@levincoolxyz
Copy link

levincoolxyz commented Apr 8, 2020

I just fixed a decoding bug and moved my fread out of the while loop. Now it (with xorstream) does perform better than the master branch over my wifi connection, and is on par through usb. I also found that since the data transmission would be slightly laggy anyway, it sometimes improve performance (esp. via wifi) by adding something like `sleep 0.03' into the read loop. This reduces load on reMarkable as well obviously.

@NickHu
Copy link
Author

NickHu commented Apr 8, 2020

Another thing I wanted to mention is that the CPU inside the remarkable is the Freescale i.MX 6 SoloLite, which has NEON, so in principle the XOR loop can utilise SIMD which may be faster. I'm not sure where the bottleneck is at this point.

@NickHu
Copy link
Author

NickHu commented Apr 8, 2020

@levincoolxyz I coded up a similar experiment, using a double buffer for input, xor'ing it together, and then compressing the result with lz4 - unfortunately, it doesn't seem any faster. I think there might be too much memory latency in doing two memcpys, but I didn't profile it.
https://gist.github.com/NickHu/8eb7ead78a5489d6a95ad5c7473994f5

Actually I had forgotten to turn on compiler optimisations! If you compile with -O3 then basically the whole program runs in no time (except for compressing) (this time I profiled with gprof; turns out the memcpy is free and the xor gets pretty optimised). It doesn't seem to make much difference to do compression inside or outside the C program. However, for me it isn't any faster than reStream.sh (at least over wifi). You can get a slight improvement by using nc instead of ssh too (makes sense, as ssh is doing encryption too).

@fmagin
Copy link

fmagin commented Apr 19, 2020

I am treating this issue as the general place for optimization and compression discussions:

Has anyone played with some lz4 options?

Currently the script uses none, but with some tweaking it might be possible to find some easy improvements. I am not sure how to properly benchmark this though, especially the latency.

I think one approach would be to see if --fast with various levels is a relevant improvement, and if a combination of setting the block size with -B# to the size of one frame and then using -BD to allow blocks to depend on their predecessors (i.e. the previous frame) is a noteworthy improvement.

@rien
Copy link
Owner

rien commented Apr 19, 2020

@fmagin I haven't played with any options yet, so please go ahead and report your findings!

We need a way to objectively evaluate optimizations. I've been using pv to look at the decompressed data throughput (higher = more frames and thus a more fluent stream). I have added the -t --throughput option which does just that.

@rien rien changed the title Performance vs a real video codec General optimization idea's and findings Apr 19, 2020
@rien rien added enhancement New feature or request discussion Brainstorming and removed enhancement New feature or request labels Apr 19, 2020
@rien rien mentioned this issue Apr 19, 2020
@fmagin
Copy link

fmagin commented Apr 19, 2020

Some raw numbers for reference:

Synthetic Benchmark

--fast is a straight up 30% throughput improvement for a barely noticeable decrease in compression on a synthetic benchmark on the remarkable with the binary from this repo (does this version use SIMD? might be worthwhile to compile it with optimizations for this specific CPU):

remarkable: ~/ ./lz4 -b
 1#Synthetic 50%     :  10000000 ->   5960950 (1.678),  47.1 MB/s , 241.2 MB/s 
remarkable: ~/ ./lz4 -b --fast
-1#Synthetic 50%     :  10000000 ->   6092262 (1.641),  61.5 MB/s , 260.7 MB/s 

Benchmarking with the actual framebuffer:

This heavily depends on the content of the framebuffer, for testing I am using an empty "Grid Medium" sheet, with the toolbox open.

remarkable: ~/ dd if=/dev/fb0 count=1 bs=5271552 of=fb.bench
1+0 records in
1+0 records out
remarkable: ~/ ls -lh fb.bench
-rw-r--r--    1 root     root        5.0M Apr 19 14:54 fb.bench

Compress to PNG

~/P/remarkable $ convert -depth 16 -size 1408x1872+0 gray:fb.bench fb.png
~/P/remarkable $ ls -lh fb.png
-rw-r--r-- 1 fmagin fmagin 20K Apr 19 17:22 fb.png

fb

So, PNG compression gets this down to 20kb which we can assume is the best possible result in this case

LZ4

Baseline

remarkable: ~/ ./lz4 -b fb.bench
 1#fb.bench          :   5271552 ->     36423 (144.731), 535.7 MB/s , 732.3 MB/s

--fast

remarkable: ~/ ./lz4 --fast -b fb.bench
-1#fb.bench          :   5271552 ->     36551 (144.225), 539.8 MB/s , 733.3 MB/s

Extras:

Pixel Frequency:
$ od -t x1 -w2 -v -An fb.bench | sort | uniq -c | sort -n
    1  16 b6
    1  53 9d
    1  73 9d
    1  92 94
    1  a9 4a
    1  b5 ad
    1  d3 9c
    1  f2 94
    2  04 21
    2  14 a5
    2  55 ad
    2  5b df
    2  86 31
    2  b9 ce
    3  0b 5b
    3  37 be
    3  b1 8c
    3  fa d6
    4  07 3a
    4  8a 52
    4  9c e7
    4  ad 6b
    4  c7 39
    4  d6 b5
    5  08 42
    5  48 42
    5  eb 5a
    6  0c 63
    6  7c e7
    7  49 4a
    7  8e 73
    7  b2 94
    8  ca 52
    9  34 a5
    9  4d 6b
   10  70 84
   10  cb 5a
   11  45 29
   11  8d 6b
   11  f6 b5
   12  24 21
   12  3b df
   12  f3 9c
   13  2f 7c
   13  4c 63
   13  c3 18
   13  c6 31
   14  71 8c
   14  85 29
   15  28 42
   15  2c 63
   16  03 19
   16  74 a5
   16  b6 b5
   18  41 08
   18  a6 31
   19  13 9d
   19  bd ef
   20  33 9d
   20  38 c6
   21  65 29
   21  95 ad
   22  69 4a
   22  99 ce
   23  79 ce
   23  aa 52
   23  f7 bd
   24  30 84
   24  da d6
   25  6d 6b
   26  82 10
   26  ce 73
   27  44 21
   27  91 8c
   27  ae 73
   28  ba d6
   28  d2 94
   28  e3 18
   29  89 4a
   30  17 be
   34  c2 10
   35  58 c6
   37  40 00
   37  75 ad
   39  61 08
   39  81 08
   44  5c e7
   44  fb de
   46  50 84
   46  cf 7b
   48  3c e7
   53  9d ef
   57  1b df
   57  7d ef
   66  a2 10
   81  54 a5
   87  0f 7c
  109  be f7
  182  10 84
  279  20 00
  307  de f7
  417  e7 39
25499  00 00
63030  ef 7b
2544108  ff ff

@rien
Copy link
Owner

rien commented Apr 19, 2020

Thanks for the very detailed report of your findings! Have you had the change to experiment with the block size and dependency? Maybe that will be able to reduce the latency even more because it then knows when to 'forward' the next byte.

I think we can conclude (as you did, but removed) that lz4 is doing a pretty decent job compressing the data while using as few precious CPU cycles as possible.

Compiling lz4 with SIMD enabled is indeed something worthwhile to look at.

@fmagin
Copy link

fmagin commented Apr 19, 2020

As seen above lz4 already has a great throughput and compression ratio. The more I think about it, the more it seems that we don't care about throughput but about latency.

Frame Latency

Theoretically this is just size / throughput which is ~10ms at 5MB framebuffer and 500MB/s throughput.
Assuming we target 24hz[0] as the framerate, then we have ~40ms to process one frame. The framebuffer is ~5MB so we just need 120MB/s throughput:

> 1/24hz
41.66666 millisecond (time)
> 5MB/(1/24hz) to MB/s
120 megabyte / second

lz4 is far above, at least for the above image, so we might actually want to focus on decreased CPU usage instead here. Sleeping in the loop probably solves this.

[0] I don't actually know what a reasonable framerate to target is, the reMarkable can't render objects in motion smoothly anyway

The Bash Loop

On the topic of loops:

remarkable: ~/ time for i in {0..1000}; do dd if=/dev/null of=/dev/null count=1 bs=1 2>/dev/null; done

real	0m4.899s
user	0m0.150s
sys	0m0.690s

Simply calling dd with arguments so it basically does nothing already has 5ms latency, which is half of the lz4 compression latency per frame on the above benchmark. Maybe dd has some continuous mode that doesn't require a command invocation each time we want to read the next block, i.e. the framebuffer again? Or maybe there is some other linux utility that is better suited for this.

Network

USB

remarkable: ~/ iperf -c 10.11.99.2
------------------------------------------------------------
Client connecting to 10.11.99.2, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[  3] local 10.11.99.1 port 39860 connected with 10.11.99.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   330 MBytes   277 Mbits/sec

Wifi

remarkable: ~/ iperf -c 192.168.0.45
------------------------------------------------------------
Client connecting to 192.168.0.45, TCP port 5001
TCP window size: 70.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.164 port 44018 connected with 192.168.0.45 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  74.4 MBytes  62.3 Mbits/sec

This obviously depends on the wifi network the device is in. An interesting thing to note is that the ping from my host to the reMarkable is absolutely atrocious 150ms on average, pinging the other way around is ~5ms. No idea what is going on here.

Conclusion

I am actually not even that sure what metric I want to optimize, for any real use this already works really well and while the latency is definitely noticeable, I don't see much of a reason to care about it if it is in the range of less than a second. I am drawing on the device anyway, so I don't need low latency feedback, and for video conferencing I can't think of a reason why it would be problematic either.

I will probably be using this over the next few weeks for participating in online lectures/study sessions for uni, so maybe I will run into some bottlenecks.

Future Work

I think for proper benchmarking we would need some way to measure the average latency per frame from the very first read to the framebuffer until the data reaches the host computer. The host will most likely have so much more CPU speed, a GPU, etc that anything beyond shouldn't make a difference anymore.

Unsorted Ideas

  • get an ffmpeg version with NEON support on the remarkable and try various codecs. The one in entware has NEON disabled which probably makes it useless
  • don't use SSH (though having encryption by default is great)
  • anything that actually uses UDP or even RTP. Probably requires ffmpeg

@fmagin
Copy link

fmagin commented Apr 19, 2020

As everyone knows, mpv is the videoplayer to use if you want to waste invest your Sunday afternoon optimizing video software to entirely pointless levels as close to the theoretical limit as possible. https://mpv.io/manual/stable/#low-latency-playback and mpv-player/mpv#4213 discuss various low latency options.

So instead of piping into ffplay one can pipe into

mpv - --demuxer=rawvideo --demuxer-rawvideo-w=1408 --demuxer-rawvideo-h=1872 --demuxer-rawvideo-mp-format=rgb565 --profile=low-latency with possibly some added options like

  • --no-cache
  • --untimed

there is probably some way to benchmark the latency between a frame going into mpv and rendering, and comparing it to ffplay, at least the discussion in the issue sounds like people are measuring it somehow. This latency is probably also entirely irrelevant compared to other latency anyway.

@rien
Copy link
Owner

rien commented Apr 19, 2020

I really appreciate the time and effort you've put into this. If mpv is noticeably faster, we could use mpv by default and gracefully degrade to ffplay. But as you mentioned, this is probably not the case?

Related to your other findings:

  • I'm aware of the slow bash-loop, but I haven't found a good alternative.
  • I don't think the network speed is the problem here. I have tried using UDP once, but a better compression method seemed far more important than network throughput then. But by using netcat (with TCP) instead of ssh we could probably gain a few milliseconds encryption overhead.
  • Again, optimizing the block throughput (by playing with lz -B) sounds promising in order to reduce latency.

@fmagin
Copy link

fmagin commented Apr 19, 2020

lz4 doesn't seem to accept a blocksize above 4MB

remarkable: ~/ ./lz4 -B5271552   
using blocks of size 4096 KB 
refusing to read from a console

I tried piping into | lz4 -d | tee >(mpv - --profile=low-latency --demuxer=rawvideo --demuxer-rawvideo-w=1408 --demuxer-rawvideo-h=1872 --demuxer-rawvideo-mp-format=rgb565 --no-cache --untimed --framedrop=no) earlier, output looked identical. Maybe something could be optimized there, but maybe there isn't much that can be done when piping in raw data anyway. Might be interesting if any real codec is ever used.

There is definitely some noticeable delay but I don't really know where it would come from. Every part of the pipeline looks fairly good so far. I have unsettling visions of a future where we find out that the actual framebuffer device has latency because of their partial refresh magic

@fmagin
Copy link

fmagin commented Apr 19, 2020

SSH Benchmarking:

remarkable: ~/ openssl speed -evp aes-128-ctr
Doing aes-128-ctr for 3s on 16 size blocks: 5136879 aes-128-ctr's in 3.00s
Doing aes-128-ctr for 3s on 64 size blocks: 1630590 aes-128-ctr's in 2.99s
Doing aes-128-ctr for 3s on 256 size blocks: 455894 aes-128-ctr's in 2.98s
Doing aes-128-ctr for 3s on 1024 size blocks: 130422 aes-128-ctr's in 2.98s
Doing aes-128-ctr for 3s on 8192 size blocks: 17083 aes-128-ctr's in 2.99s
OpenSSL 1.0.2o  27 Mar 2018
built on: reproducible build, date unspecified
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr) 
compiler: arm-oe-linux-gnueabi-gcc  -march=armv7-a -mfpu=neon -mfloat-abi=hard -mcpu=cortex-a9  -DL_ENDIAN 	-DTERMIO  -O2 -pipe -g -feliminate-unused-debug-types  -Wall -Wa,--noexecstack -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-ctr      27396.69k    34902.26k    39164.05k    44816.15k    46803.99k

Did you ever try OpenSSH compression?

@davidaspden
Copy link

Heads up.

So i've installed a vnc server and when looking at the connection, they use ZRLE compression. Hints we are on the right tack.

Also when uncompressed at the other end is there any need to change pixel format, is it not just advisable to play the gray16le which is known to ffmpeg? Are we not introducing a step of transcoding otherwise that would require buffering?

Just thoughts.

@pl-semiotics
Copy link

You might be interested in some hacking I've done recently on streaming the reMarkable screen. I've come to the conclusion that VNC/RFB is a very nice protocol for this, since it has good support for sending updates only when the screen changes, and standard encoding methods like ZRLE are quite good for the use-cases we have (where most of the screen is a single color).

The only difficulty in a VNC based solution is that, even though we have very precise damage tracking information (since the epdc needs it to refresh the changed regions of the display efficiently), that information isn't exposed to userspace. I've just published a few days' hacking on this to https://github.com/peter-sa/mxc_epdc_fb_damage and https://github.com/peter-sa/rM-vnc-server. I don't have a nice installation/runner script yet, but with those projects I can get a VNC server serving up the framebuffer in gray16le with very nice bandwidth/latency---I've only seen noticeable stuttering when drawing when using SSH tunneling over a WiFi connection; without encryption or on the USB networking, the performance is quite usable. I don't have quantitative performance observations, or comparisons with reStream's fixed-framerate approach, but I expect resource usage should be lower, since the VNC server only sends actually changed pixels.

RFB is of course a bit more of a pain to get into other applications than "anything ffmpeg can output", but I've managed to get the frames into a GStreamer pipeline via https://github.com/peter-sa/gst-libvncclient-rfbsrc, and been using gstreamer sinks to shove them into other applications.

@jncraton
Copy link

jncraton commented Aug 28, 2020

After reading through this thread, I'm curious if anyone has tried writing a C or Rust program to decrease the bit depth before sending the fb0 stream to lz4. It seems like this could cut down on the work that lz4 has to do. Theoretically, this could go all the way down to mono graphics and leave lz4 with 1/16 of the data to process.

@rien
Copy link
Owner

rien commented Aug 28, 2020

Writing a C or Rust native binary will probably be the best improvement currently. I would definitely try to use the differences between two subsequent frames, because these differences will be very small.

Mono graphics is something we could support, but I wouldn't be a big fan because I like to use different intensities of grey in my presentations. Unless we would use dithering, but that's maybe overkill?

@jncraton
Copy link

That makes sense. I'm not necessarily advocating for monochrome, but I was curious if reducing color depth had been tried. I agree that it would be better to use proper interframe prediction, but that seems much more complicated, unless someone can figure out ffmpeg encoding settings that work quickly enough.

I'm not planning to work more on this at the moment, as I discovered that the VNC-based solutions work well for what I'm trying to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Brainstorming
Projects
None yet
Development

No branches or pull requests

8 participants