-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General optimization idea's and findings #14
Comments
I just (enjoyably) wasted an evening testing different options. My findings (all thru usb):
So my conclusion is that unless a custom video encoding with streaming in mind is written in a single compiled program, there's probably not much more chance for improvement speed-wise. ---edit--- |
Hi guys! Great you've been experimenting with different encodings. It definitely seems possible to achieve near zero lag with the right tools. In writing this script I did some of the same experimentation. But the reMarkable processor is just too weak handle something as heavy as video encoding. The clue in making this as fast as possible is getting the framebuffer data out of the reMarkable as quickly as possible. I've tried using no compression, but then the kernel (read: TCP or USB IO) seems to be the bottleneck. Writing something in C (or Rust 🎉) will probably the long-term solution. But in the meanwhile I think experimenting with bsdiff could give some nice results. @levincoolxyz what did you exactly try with |
that is what I attempted to code with bsdiff. but I quit before fully debugging as I saw the cpu usage shot up like crazy on reMarkable. from the man page etc., i think it is written to generate a small patch of very large files (e.g. software updates) without real time applications in mind. For reference, this took 1.6s on my laptop (doing virtually nothing): time ( bsdiff fb_old fb_old fb_patch ) |
I have a couple of observations to make. I think lots of the potential of h264 lie in the colourspace, and also having specialised instructions to get hardware acceleration. Seeing as the video stream coming out of the reMarkable should be entirely gray-scale, this suggests that it's not the right codec to use. I don't know if any codecs which are better suited to gray-scale exist, but to me it's really surprising that a general purpose compressor (lz4) is so much faster than a specialised video codec. It might be worth trying to write a naïve video codec, but I don't really know anything about graphics. Secondly, according to https://github.com/thoughtpolice/minibsdiff, the |
What a normal video encoder would do is throwing away information (e.g. colours, image complexity, ...) in order to create a smaller video. The reason why a general purpose compressor works is because there is a lot of repeated information (a bunch of white There are probably codecs which support gray-scale images, but I doubt they will be effective because of the performance constraints we have. Our 'codec' should be as simple as possible. Maybe As an alternative to |
If you link against https://github.com/thoughtpolice/minibsdiff, you can change the compressor that bsdiff uses. XOR is an interesting idea too - I would guess it also has less complexity, seeing as the byte buffer is the same size all the time - bsdiff probably has some overhead as it has to account for file size increases/decreases |
I currently don't have the time to dive deeper into this, so feel free to experiment with it! I'm open to PR's. |
I tested the xor and then compress with lz4 idea. To my surprise, even with my rusty c programming skills I actually did not slow the pipeline down, which should mean that someone better at coding these low level stuff should bring down streaming latency (especially by choosing a better buffer size and merging xor operation into lz4 binary). (I just put some codes I used for testing in my forked project (https://github.com/levincoolxyz/reStream) for reference. If I get more time I might fiddle with it more...) |
That looks great. It will indeed be faster to make your |
It is already doing that, currently the file read and write are for (updating) the reference file. Maybe I should rewrite it to keep the reference in memory since the supplied stdin is already continuously giving out the new buffer... will try that next. |
Maybe the the JBIG1 data compression standard is something to look at... |
@levincoolxyz I coded up a similar experiment, using a double buffer for input, xor'ing it together, and then compressing the result with lz4 - unfortunately, it doesn't seem any faster. I think there might be too much memory latency in doing two memcpys, but I didn't profile it. I also tried to code up a minimal example using lz4's streaming compression, in the hopes that this would be faster (as it uses a dictionary-based approach), but again it was slightly slower. I'm not sure how the bash script outperformed me on this one, but I guess the naive C implementation I did is no good! |
I just fixed a decoding bug and moved my fread out of the while loop. Now it (with xorstream) does perform better than the master branch over my wifi connection, and is on par through usb. I also found that since the data transmission would be slightly laggy anyway, it sometimes improve performance (esp. via wifi) by adding something like `sleep 0.03' into the read loop. This reduces load on reMarkable as well obviously. |
Another thing I wanted to mention is that the CPU inside the remarkable is the Freescale i.MX 6 SoloLite, which has NEON, so in principle the XOR loop can utilise SIMD which may be faster. I'm not sure where the bottleneck is at this point. |
Actually I had forgotten to turn on compiler optimisations! If you compile with |
I am treating this issue as the general place for optimization and compression discussions: Has anyone played with some lz4 options? Currently the script uses none, but with some tweaking it might be possible to find some easy improvements. I am not sure how to properly benchmark this though, especially the latency. I think one approach would be to see if |
@fmagin I haven't played with any options yet, so please go ahead and report your findings! We need a way to objectively evaluate optimizations. I've been using |
Some raw numbers for reference: Synthetic Benchmark
remarkable: ~/ ./lz4 -b
1#Synthetic 50% : 10000000 -> 5960950 (1.678), 47.1 MB/s , 241.2 MB/s
remarkable: ~/ ./lz4 -b --fast
-1#Synthetic 50% : 10000000 -> 6092262 (1.641), 61.5 MB/s , 260.7 MB/s Benchmarking with the actual framebuffer:This heavily depends on the content of the framebuffer, for testing I am using an empty "Grid Medium" sheet, with the toolbox open. remarkable: ~/ dd if=/dev/fb0 count=1 bs=5271552 of=fb.bench
1+0 records in
1+0 records out
remarkable: ~/ ls -lh fb.bench
-rw-r--r-- 1 root root 5.0M Apr 19 14:54 fb.bench Compress to PNG~/P/remarkable $ convert -depth 16 -size 1408x1872+0 gray:fb.bench fb.png
~/P/remarkable $ ls -lh fb.png
-rw-r--r-- 1 fmagin fmagin 20K Apr 19 17:22 fb.png So, PNG compression gets this down to 20kb which we can assume is the best possible result in this case LZ4Baselineremarkable: ~/ ./lz4 -b fb.bench
1#fb.bench : 5271552 -> 36423 (144.731), 535.7 MB/s , 732.3 MB/s --fastremarkable: ~/ ./lz4 --fast -b fb.bench
-1#fb.bench : 5271552 -> 36551 (144.225), 539.8 MB/s , 733.3 MB/s Extras:Pixel Frequency:
|
Thanks for the very detailed report of your findings! Have you had the change to experiment with the block size and dependency? Maybe that will be able to reduce the latency even more because it then knows when to 'forward' the next byte. I think we can conclude (as you did, but removed) that lz4 is doing a pretty decent job compressing the data while using as few precious CPU cycles as possible. Compiling lz4 with SIMD enabled is indeed something worthwhile to look at. |
As seen above lz4 already has a great throughput and compression ratio. The more I think about it, the more it seems that we don't care about throughput but about latency. Frame LatencyTheoretically this is just
lz4 is far above, at least for the above image, so we might actually want to focus on decreased CPU usage instead here. Sleeping in the loop probably solves this. [0] I don't actually know what a reasonable framerate to target is, the reMarkable can't render objects in motion smoothly anyway The Bash LoopOn the topic of loops: remarkable: ~/ time for i in {0..1000}; do dd if=/dev/null of=/dev/null count=1 bs=1 2>/dev/null; done
real 0m4.899s
user 0m0.150s
sys 0m0.690s Simply calling NetworkUSB
Wifi
This obviously depends on the wifi network the device is in. An interesting thing to note is that the ping from my host to the reMarkable is absolutely atrocious 150ms on average, pinging the other way around is ~5ms. No idea what is going on here. ConclusionI am actually not even that sure what metric I want to optimize, for any real use this already works really well and while the latency is definitely noticeable, I don't see much of a reason to care about it if it is in the range of less than a second. I am drawing on the device anyway, so I don't need low latency feedback, and for video conferencing I can't think of a reason why it would be problematic either. I will probably be using this over the next few weeks for participating in online lectures/study sessions for uni, so maybe I will run into some bottlenecks. Future WorkI think for proper benchmarking we would need some way to measure the average latency per frame from the very first read to the framebuffer until the data reaches the host computer. The host will most likely have so much more CPU speed, a GPU, etc that anything beyond shouldn't make a difference anymore. Unsorted Ideas
|
As everyone knows, mpv is the videoplayer to use if you want to So instead of piping into
there is probably some way to benchmark the latency between a frame going into mpv and rendering, and comparing it to ffplay, at least the discussion in the issue sounds like people are measuring it somehow. This latency is probably also entirely irrelevant compared to other latency anyway. |
I really appreciate the time and effort you've put into this. If mpv is noticeably faster, we could use mpv by default and gracefully degrade to ffplay. But as you mentioned, this is probably not the case? Related to your other findings:
|
lz4 doesn't seem to accept a blocksize above 4MB remarkable: ~/ ./lz4 -B5271552
using blocks of size 4096 KB
refusing to read from a console I tried piping into There is definitely some noticeable delay but I don't really know where it would come from. Every part of the pipeline looks fairly good so far. I have unsettling visions of a future where we find out that the actual framebuffer device has latency because of their partial refresh magic |
SSH Benchmarking:
Did you ever try OpenSSH compression? |
Heads up. So i've installed a vnc server and when looking at the connection, they use ZRLE compression. Hints we are on the right tack. Also when uncompressed at the other end is there any need to change pixel format, is it not just advisable to play the gray16le which is known to ffmpeg? Are we not introducing a step of transcoding otherwise that would require buffering? Just thoughts. |
You might be interested in some hacking I've done recently on streaming the reMarkable screen. I've come to the conclusion that VNC/RFB is a very nice protocol for this, since it has good support for sending updates only when the screen changes, and standard encoding methods like ZRLE are quite good for the use-cases we have (where most of the screen is a single color). The only difficulty in a VNC based solution is that, even though we have very precise damage tracking information (since the epdc needs it to refresh the changed regions of the display efficiently), that information isn't exposed to userspace. I've just published a few days' hacking on this to https://github.com/peter-sa/mxc_epdc_fb_damage and https://github.com/peter-sa/rM-vnc-server. I don't have a nice installation/runner script yet, but with those projects I can get a VNC server serving up the framebuffer in gray16le with very nice bandwidth/latency---I've only seen noticeable stuttering when drawing when using SSH tunneling over a WiFi connection; without encryption or on the USB networking, the performance is quite usable. I don't have quantitative performance observations, or comparisons with reStream's fixed-framerate approach, but I expect resource usage should be lower, since the VNC server only sends actually changed pixels. RFB is of course a bit more of a pain to get into other applications than "anything ffmpeg can output", but I've managed to get the frames into a GStreamer pipeline via https://github.com/peter-sa/gst-libvncclient-rfbsrc, and been using gstreamer sinks to shove them into other applications. |
After reading through this thread, I'm curious if anyone has tried writing a C or Rust program to decrease the bit depth before sending the fb0 stream to lz4. It seems like this could cut down on the work that lz4 has to do. Theoretically, this could go all the way down to mono graphics and leave lz4 with 1/16 of the data to process. |
Writing a C or Rust native binary will probably be the best improvement currently. I would definitely try to use the differences between two subsequent frames, because these differences will be very small. Mono graphics is something we could support, but I wouldn't be a big fan because I like to use different intensities of grey in my presentations. Unless we would use dithering, but that's maybe overkill? |
That makes sense. I'm not necessarily advocating for monochrome, but I was curious if reducing color depth had been tried. I agree that it would be better to use proper interframe prediction, but that seems much more complicated, unless someone can figure out ffmpeg encoding settings that work quickly enough. I'm not planning to work more on this at the moment, as I discovered that the VNC-based solutions work well for what I'm trying to do. |
Given that ffmpeg is in entware, has anyone tried to use a real video codec to grab from
/dev/fb0
instead of usinglz4
on the raw bytes? I think this should in principle implement @rien'sbsdiff
idea (changes between frames are small, so this will reduce IO throttle) that I saw on the reddit post. I was able to get a stream to show by doingbut it seems heavily laggy. It does seem to encode at a framerate of just over 1 per second, so there's clearly a long way to go. It also definitely seems like ffplay is waiting for a buffer to accumulate before playing anything. I'm really curious as to whether a more ingenious choice of codecs/ffmpeg flags would be availing.
Here's some sample output of ffmpeg if I set the loglevel of ffplay to quiet:
I think one of the big slowdowns here is it's taking the input stream as
[fbdev @ 0xb80400] w:1404 h:1872 bpp:16 pixfmt:rgb565le fps:1/1 bit_rate:42052608
, rather than the 2 bytes-per-pixel gray16le stream that reStream is using - I can't seem to configure this though.Also, is
lz4
really faster thanzstd
?The text was updated successfully, but these errors were encountered: