-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sixel support #27
Comments
Yes, that's useful. Thanks. Another interesting project is libsixel: https://github.com/saitoha/libsixel Sixels isn't that hard to do -- the hard part is the design/API decisions that need to be made to make it fit as seamlessly as possible alongside character-cell-based graphics. There is some research to be done too, like how feasible would it be to mix sixels and character cells, e.g. what do the various terminals do with existing sixel graphics when you change the font size or overdraw it with text? I'll need to look at it more carefully, but I think the best way to implement sixel support in Chafa may be to do it directly in the frontend and leave libchafa to deal with character-cell graphics only, since it's already built around a character-cell API. Maybe. |
Hi, I am currently implementing sixel support for a TUI gizak/termui#233. I first check the terminal response of \033[0c for a 4 which means the terminal supports sixel. The terminal dimensions in character cells and pixels have to be figured out to know how large a cell is in pixels and scale the image correctly. Unix terminals can be queried with the ioctl TIOCGWINSZ - I think this works only locally. The other way is to query with the escape codes \033[18t for dimensions in character cells and \033[14t for dimensions in pixels. Escape codes can be passed through the terminal multiplexers tmux and screen - please check my PR on termui for that. Handling scrolling output for those is probably difficult. |
Thanks, @srlehn. Chafa currently uses TIOCGWINSZ to get the size in character cells, and it works over ssh, in screen and in tmux, but not over a simple serial device. I don't think we can reliably get pixel dimensions that way, though (winsize.ws_xpixel and winsize.ws_ypixel come back as zero), so we'd have to use escape codes. |
Hi @hpjansson, thanks for letting me know that it works over ssh. Yes, you probably need to send the wrapped escape codes \033[14t and \033[18t to the terminal to figure out the character box size and not wrapped \033[18t or TIOCGWINSZ to tmux/screen to get the correct cell size for the current pane/window of the multiplexer. I don't remember what works for tmux/screen. |
I spoke too quickly. .ws_xpixel and .ws_ypixel properly reflect the terminal's pixel dimensions for mlterm, which supports sixels, but not for VTE (gnome-terminal), which does not. Experimentally this works over ssh and in screen too, modulo potential issues with resizing. So maybe we can just rely on that for simplicity. |
Another reference implementation for sixel support: https://github.com/jart/hiptext (this tool is quite similar to chafa) |
Sixel support has been committed to master. Use I wrote it from scratch, and it's the fastest implementation I'm aware of by far. There is still some refactoring, optimization and general cleanup to do, but it's working now, so I'll close this issue. |
Good job! |
FYI more terminals are supporting sixel recently:
More terminals and widgets with sixel support are reported at: https://gitlab.com/klamonte/jexer/wikis/terminals .
I haven't tried it yet, but peeking at the C it sure looks like it would be fast. I had to do quite a bit to get my Java encoder fast enough for the real world. Some of the tricks I resorted to are outlined at: https://jexer.sourceforge.io/evolution.html#sixel . |
@klamonte I see you did some work on speeding up palette mapping. That's where I focused my efforts too. After some experimentation I settled on an approach using principal component analysis and a hash table. When palette colors are sorted by their first principal component, you can do a binary search followed by a neighborhood probe for the closest match. This yields colors that are identical to a linear 3-component euclidean search, but instead of 256*3=768 multiplications (for a 256-color palette), it gets by with <16 multiplications typically. The mapping is stored in a flat 16384-entry hash table, so if the exact same color is seen in the near future, it will be satisfied by an inlined O(1) lookup. The palette itself is just generated with sparse median cut. Nothing interesting there :) It could probably be sped up with parallel merge sort or maybe a fancier quantization algorithm. The quantization/palette mapping code is spread across chafa-palette.c, chafa-color-table.c, chafa-color-hash.c, chafa-pca.c and the respective .h files. Since the encoder has to do many passes per six-height row, it first converts the row into an efficient internal format, so the data needed for each column can be fetched from memory in one load instead of six. I also took pains to make the column-to-character converter branch-free and divided the row into multiple "banks" with a bloom filter so if a palette color does not appear in a run of say, 64 columns, those can be skipped over quickly. See: chafa/chafa/internal/chafa-sixel-canvas.c Lines 106 to 191 in f4305b7
If you're benchmarking, I recommend doing so with GIF animations, as it bypasses ImageMagick's slow but high-quality loading. Feel free to e-mail me if you have any questions. |
I was disappeared for a while, but obviously I am swinging back around. I got my encoder a bit better, inspired by notcurses' new octree encoder. I wanted to point out @dankamongmen 's work here, as you both have very interesting optimizations for speed. When I find time to do better than my mix of directly mapping colors (when the palette is smaller than the number of registers) and boring median-cut, I want to play with both of your approaches. :-) |
Nice! I tried using octree at first, but I didn't like it (turned out slower than median cut for comparable quality) and discarded that attempt. The method I use is two-pass, so in theory I can use any algorithm for generating the palette (if you sample the image, this doesn't need to be that fast) and keep the fast color mapping in the second pass. I'd like to do a side-by-side comparison of the existing sixel encoders at some point when I'm less swamped with work :) |
to be precise, i use an octree-inspired algorithm. my first attempt at using octrees directly gave less-than-awesome results. when wondering why, my key insight was that octree's advantages come from a power-of-two-based spectrum (0..255 per channel). sixel, of course, is not a 2^n space, but rather 101 values. i then moved to a "decatree" configuration of my own, and got far better results. this is only viable if you're converting colors from 256->101 at the beginning rather than at the end (which is what i do, since i'm not doing any of the averaging commonly employed by mincut algorithms, and thus needn't preserve original precision through the calculation).
dankamongmen/notcurses#1857 shows notcurses vs timg from last June. i've become much faster since then. you might want to take a look at
|
whoa whoa whoa, isn't this kinda picking your benchmarks? if you go with a slow decoder, that ought be represented in your timings. i use FFmpeg as my sole decoding backend, in part because imagemagick was slower with no discernible improvement in image quality (indeed, in my testing, with no difference in image quality whatsoever). unless you're strictly comparing sixel implementations, of course, but an end user only cares about time-to-display from start time. |
btw, this is why |
the ioctl works over remote ssh connections just fine. there's a specific SSH protocol data unit for resolution change notification. |
It's in the protocol but not all ssh implementations honor it or expose it to client code. I had to manually patch cryptlib for it once upon a time, and it sucked because there were no good SSH alternatives for Windows that worked with Borland C++ or Visual Studio 6. Today I would use mbedtls. |
shots fired! =] so far as i can tell, ....drum roll please...
given that you have several user/sys times that exceed your real time, i'm guessing you've got a multithreaded solution? i do not yet employ any threading in my decodes, but am considering adding some. in any case, we've got chafa: .044, .047, .049, .067, .051 and ncplayer: .047, .049, .051, .053, .055 your min beats mine by .003, my max beats yours by .012, and my average beats yours by .0006, whew photo finish! i'd ascribe all these deltas to noise, but "significantly faster than all other implementations" no longer seems a valid claim IMHO =]. |
if you're doing a blind thread-per-core implementation, know that i tested this on an AMD 3970X (32 physical/64 logical cores), and you'd probably have less overhead on a smaller machine. questions like this are why i haven't yet moved to a threaded solution; were i to do so, i'd likely have a thread-per-some-unit-area-that-is-a-mutliple-of-page-size, capped by the number of logical cores, using a fractal spawnout to parallelize initial spawning overheads. |
We have two incredible winners here! I feel like I'm watching one of the great US Open matches with Andre Agassi vs Pete Sampras. This is awesome. :) |
two comments here:
|
@dankamongmen The reason I specified GIF is that the decoder is simple and bypasses ImageMagick, so you get closer to benchmarking the sixel encoder and not some random image loader (e.g. libpng). ImageMagick also adds tons of overhead. I bet your comparison above is mostly benchmarking the loader. Of course, that doesn't mean Chafa is faster, just that the benchmark is inconclusive :) Ideally you'd pass in raw memory buffers, but you'd have to write a small wrapper that loads the image once and measures around a loop of encoder calls. Re. threading: I've pipelined the threads so a single row of image data goes through the various passes from start to finish where possible, but there's still some low-hanging fruit there. It typically gets working sets smaller than L1 cache; though I've only checked the scaling code thoroughly for memory bottlenecks (it's the
That shouldn't be happening. Which Chafa version and terminal are you using?
When invoked like that, Chafa should be stretching the image also, as far as it can while preserving aspect. Though if you're running an older distro version it might not when stdout is being redirected. |
Hey guys, I just had a terrible idea. ;-) I just found out that sequences exist to query the sixel palette, in both HSL and RGB space. This isn't useful for single images, but for videos: what might it look like, and how fast could it be (average FPS, after the initial setup is ready) if one grabbed the palette that was there already and blindly mapped to it? (For reference, the default VT340 16-color palette is this: https://github.com/hackerb9/vt340test/blob/main/colormap/showcolortable.png) No initial color setup, just using the best-fit of 16-256 colors. I wonder if that would lead to a cool fuzzy-analog-TV type effect? 🤔 |
NB: As I pointed out in wez/wezterm#217, technically |
Thank you for the clarification! |
i'm using chafa 1.8.0-1 as packaged in Debian Unstable, and i ran these tests in an XTerm (patch 370). i should correct my claim: chafa does not leave the cursor in the middle of the output, necessarily, but rather always prints the sixel at the top of the terminal, and places the cursor one line below where chafa was launched. it definitely is not stretching in the sense that ncplayer does by default. |
i welcome your counterexample, good sir. |
i definitely refresh the palette for each frame of a multiframe visual (this is something that's trivially amenable to multithreading, and i do exactly that in e.g. the in any case, if you've already handled one frame, you already have the palette available to you without needing request it from the terminal, no? assuming this is all happening in one context. |
The idea (which won't work anyway except on a VT340 it seems) is to never generate a per-image palette. So it will definitely look different/bad, but might be a bit faster if the mapping/dithering can still work quickly. As to why: maybe faster multihead. :-) But my performance awfulness there could be more from thread contention. But then reading through you two brought to mind an embarrassingly parallel part of Jexer that is not actually parallel. |
@hpjansson i have found a class of images on which you outperform me for sure, so be pleased with that. on the other hand, i have found a class of images on which you outperform me for sure, so don't expect that to remain the case too long =D |
Hey, I paid good money for access to those papers, it'd better be better at something :) On big images, a lot of it will come down to multithreading. With many colors it'll likely be the PCA color mapping pulling ahead. With sparse or repeating colors, the bloom filters and branch-free/SWAR code helps. Guessing you already have the image scaling figured out, but if you're interested, smolscale is written to be easily copyable and usable verbatim. It's as fast as I could make it with separable convolution. If you need license concessions we can look at that. I'm still interested in getting a performance delta without the libpng overhead. Since it adds a fixed overhead to both implementations, the difference in the remaining slice is likely greater than 2x. Do you have a small code snippet that shows how to use notcurses API to convert an RGB buffer to sixels in direct mode? |
@dankamongmen You're right that Chafa's sixel output behaves oddly in XTerm, by the way. It used to work in older versions of XTerm, but when I updated it it started acting weird. XTerm's decoder also seems awfully slow. However, works fine in mlterm and VTE (from master with sixels enabled). Those are also quite fast. Thinking there's either a bug in XTerm and/or a disagreement on standards interpretation. If I do e.g. |
@hpjansson I'm looking at an xterm sixel bug too: https://gitlab.com/klamonte/jexer/-/issues/89 My root cause might be related to the slowish sixel decoder. 🤷♀️ |
@dankamongmen Chafa seems to work correctly with XTerm 370 here. Images appear inline and scroll as you'd expect. Since I'm not emitting DECSDM (I think it's good practice for a command-line tool to mess as little as possible with terminal state; going by the recent discussions, this seems to be analogous to how |
Just looking at it from afar, so maybe I'm wrong, but it feels like a typical event loop problem where the highest-priority event source is preventing other sources from getting any time slices. Confirmed it here with big sixel animations -- stuck parsing, never gets around to drawing anything. Edit: This seems unique to XTerm. For instance, mlterm also bottlenecks on 4k video, but it seems to split its time ok between parsing and updating. |
completely possible. this can also be controlled by your XTerm configuration using the |
By the way, you can make the loader take up less of the benchmarking time by just generating really big output images. After maximizing the terminal on a 4k screen, I get this:
So Chafa's sixel encoder is slightly north of 6x (six times) faster than Notcurses' when the actual sixel code dominates the benchmark. Single-thread performance is about 2x. Intel Haswell 4/8 core. The PNG loader is still adding overhead, so if I set up a test harness with raw image buffers, it should increase the delta a bit more. |
yep, the AllRGB images i'm testing with are 4096x4096 with each
RGB color used exactly once. i'm adding some threading now, and
expect to do significantly better soon on such large images.
btw, how many cores on the machine you ran those tests on?
|
4 cores physical, 8 logical. Note that the AllRGB images are huge, so you're still mostly benchmarking the loader. To get a good idea of the encoder's performance, the output image must be (much) bigger than the input image. Since AllRGB images must be 16MP to contain all the colors, I think the only way to get a true benchmark would be to load the image first and put a timer around the encoder calls. |
Note the output image must be big. When using AllRGB 4096x4096 images as input you're doing the inverse and spending even more time in libpng/ImageMagick. |
I have gotten a lot of gains, primarily in quality but also just in general structure and performance, from your outline of chafa's approach. Thank you so much for detailing it here!
This was such a fun trip to do! :-) Putting in the eigenvalue solver (I used this one, which is just a C repackaging of JAMA's Java repackaging of EISPACK, all of which are public domain) really warmed the cockles of my black computational chemist heart. Pop in the covariance matrix calc, get the eigenvalues/vectors, and go to town with the binary search. Thank you also for the pointer on the neighborhood search. I was close-but-not-quite there, and when I put just the two other minor checks in it was like "Woah! That is SO pretty!" :)
I cut about 40%-ish color-matching time by hanging onto the previous binary search answer, and checking the difference in pca1 on the next try: if it was within roughly 8 indices just start from there. Not quite O(1), but still a boost. Then I added a front-end 8192-entry hash table, and got another boost.
Palette generation is still about 5% of the total time, but I'm not unhappy. I just sample num_colors in uniform 16-pixel runs, and was really surprised how effective that is.
This is the next frontier for the single-thread case. Having cleaned up a lot of crap around it, those 6 memory loads stick out so much and uuugggh lol.
Funny enough, that's exactly how I got here. :-) I want gaming and videos to work, so gotta get a lot faster for it. Performance gains in the large so far have come from the following:
Thanks to your pointers, after a week or so I think I'm about 4x better on image quality, and 2-3x faster now. Still really far behind both of you here: on my puny i3-7100U @ 2.40GHz it's about 20 seconds for the AllRGB balloon.png :
That appears to be after the JIT has done what it can. On smaller images, you can see it stepping in and the speedup:
So I think that's like 1% of your speed? 🤣🤣 |
You mentioned sixel in TODO. However I didn't realize how sixel perform well in the terminal until I see this: https://github.com/hackerb9/lsix . It's a simple bash script, and it might provide some inspiration?
The text was updated successfully, but these errors were encountered: