-
Notifications
You must be signed in to change notification settings - Fork 90
wayland
This page covers the ongoing subproject for adding support for Wayland compatible clients to be able to connect and render using Arcan. It serves primarily as a notepad for how features are supposed to match. Actual Feature- state and progress itself is tracked in the corresponding README.md
There are three ways the server functionality could be added (realistically):
- Allow arcan to be built as a 'normal' compositor, using the provided libwayland-server directly.
- A libwayland-client reimplementation.
- An separate translation tool.
Approach 1 was dismissed due to the API design of libwayland-server, and the implications that design has on engine architecture and future security hardening and debugging efforts. In short, it's a "squeeze-OO-into-C" dynamic hierarchy of pseudo-vtables mapped via libFFI from an XML spec compiler. Resource allocation in this API is really really awkward and as soon as you screw it up, and you will screw it up - every compositor project to this day has cleaned up such issues in their commit history - there's a UAF and/or type confusion related crash to be had with a "nested callback"- style nasty backtrace. Most/all compositors and clients can be crashed easily by just finding a resource someone assumed couldn't be destroyed at some of the many state transitions and just try and destroy it. To linger on the callback- vtable- approach some more: A lot of work is put down into keepingbcallbacks and callback driven APIs away from the engine (still not 100%), since they trivially lead to R/W mapped structures with pointers-to-functions lying around on easy to predict and modify- locations, increasing the risk for reliable exploitation. They also make the control-flow-graph of the engine dynamic without some luck and a great link-time optimiser, reducing the paths that can be locked down and verified at compile time. The other option would be to reimplement working with the line-protocol itself, which is an unrealistic tradeoff for effort/reward.
Approach 2 was discarded due to its inherently 'hacky' nature and subsequent upkeep/maintenance, although it is likely the path with the least amount of overhead and resistance.
Approach 3 was ultimately chosen after some experimentation. The primary advantages of this approach here are:
- Ability to dynamically opt-in / out to the increase in attack surface.
- Reuse the same code-paths as a normal external/untrusted connection.
- Gain some resilience (i.e. server- crash recover)
- Indirection makes live migration possible (though costly)
- Compartmentation: you can spin up one translation process per wayland client with minimal overhead, and in high-load shm producers, more performance due to the task of uploading textures to the GPU being performed inside the bridge process rather than inside the main server.
Along with my favourite reason, no-one else is doing it this way.
The model used in Wayland counteracts many of the decisions made in shmif- concerning division of labor and assignment of responsibility.
In the Wayland model, the client has a lot more to say about some things and more responsibility for implementing certain features, but at the same time quite a lot less actual features due to the, in my opinion, overly aggressive 'not our job anymore' stance that 'the community' seem to be taking. The argument being that any supposedly missing features should be fixed with protocol extensions or the compositor opening up side channels through even more overcomplicated/overdesigned/underengineered IPC like D-Bus, which is just shifting the blame and not fixing the systemic problem. It's not like the set of requirements have moved much in this area.
The design of shmif attempts to cover the gamut of features needed to encompass current and future desktop environments with a bias towards virtual-machine style interaction with legacy code. The basic assumption is that all clients are filled with bugs, bloated, incompetent and hostile yet still provide some thing the user desires (data!). The ideal being that a client should need very little system interaction outside the display server IPC, and everything expressively 'not needed' should be sealed of by default. A client is provided with a very limited set of basics and need to negotiate for more, with the default response being 'go away' or 'sure' accompanied with the sound of a gun being cocked.
Wayland has a design which arguably favours UI toolkit backends (which indirectly makes them part of the actual protocol as the client APIs themselves are worse-than-xlib to actually work with), and relatedly, the indirect choice in letting clients provide interactive decorations that affect window state, e.g. drag, resize, maximize, minimize, ...
This means that both parties need to agree on what these operations mean in the context of the window management scheme in use. This is supposed to be mitigated through the shell- abstraction (wl_shell, xdg_shell, zxdg_shell, fullscreen_shell, ivi_shell, ...) with an unpleasant two dimensional state explosion (version, shell-subprotocol) and the tedium of updating every client to actually use your shell and/or your set of protocol extensions. This is nice if you control and want to keep control of the entire stack and terrible if you do not.
Worse, the client may (and practically already do) reject a connection where the version or subprotocol don't match expectations. This problem is already too evident with GTK builds and wl_shell not supported at all, despite being 'core', vs unstable xdg_v(5,6)_shell and zxdg_shell. The lockin- possibilities of obscure in-house extensions and clients refusing to connect to servers without them are also quite worrisome.
The design behind shmif in regards to external clients was to hopefully find a middle ground for make it easier to write user-friendly applications that don't fit the advanced pattern of UI toolkits, yet don't suffer the awkwardness of terminal- protocols. The idea being that the toolkits will eventually be eaten alive- or actually turning into- 'the browser' and we end up with this huge ball of unknown-until-accidentally-activated features and a smorgasbord of viable attack surface.
Arcan assumes server-side decorations by default, but clients can negotiate for more in certain conditions. A client may request a subsegment of the type that fills the role of titlebar, mouse cursor, icon and so on - but that request may always be rejected if it does not fit the window management model in use. A segment request is necessarily typed and accept/reject act as allocation and feature negotiation in one. The default policy, unless explicitly allowed by the running scripts, is to reject such requests.
As a compromise, there is the mechanism to specify the active viewport of a segment so that it is a matter of tweaking texture coordinates and pointer device coordinations to get rid of shadows, borders and morbidly obese titlebars. This might've been solvable with a protocol extension to get those values, or, since the running Arcan scripts know that a segment is bound to a Wayland client, and, while mourning the loss of consistency, just ignore drawing the decorations that the window management model dictates.
Drawing window decorations on the server side is trivial because the graphics code is needed anyhow to fill the vacuum between clients in anything but the most trivial of window managers. On the client, it means using a side-channel to synchronise style and behaviour between stakeholders of different origin or accepting inconsistency in the most basic of UI features. In addition, it introduces performance penalties in situations where the decorations push the frequently updated (video in a window or a VM guest console) contents to appear in memory on poorly aligned offsets and streaming sub-texture uploads (non-trivial performance considerations there, thresholds for PBO or synch- glSubTexImage2D based on dirty/clean pixel ratio). Worse, it also introduces ugly bleed, blur or aliasing in scaling and mipmapping operations which degrades quality in preview- windows, scaled recording, sharing, VR/3D UIs etc.
Outside the quality and performance aspect, we have protocol complexity, window management style collisions and security issues. The security parts means that since the server side contributes nothing to the visual identity of the client, you can't use decorations to indicate trust- domain or origin (e.g. border color in Qubes); any client can be proxied and used as a 'mimic' in order to target and steal data.
To combat the problems with mixed content attributes/content origin on the same surface, Wayland introduces the notion of subsurfaces, surface and subsurface synchronisation and ordering - though this solution would waste notable memory and memory bandwidth instead as the decoration surface itself will be rather empty (or be comprised of 4+ subsurfaces that make synchronisation during drag-resize and similar operations complicated or race condition prone), lend itself to overdraw or unnecessarily evaluated fragments, likely with blending enabled due to shadows with alpha, and necessarily larger than the relevant contents. In addition, each surface can be annotated with a number of regions that indicate if they are opaque, input etc. so that the compositor can chose to partially disable blending operations. Contrary to their name, subsurface are not clipped, meaning that they are free to extend the confines of their parent surface and only by considering the circumference of the subsurfaces do you know the size of the window. In principle, this can be worked around with the VIEWPORT- hint events, but it makes window management modes like tiling a lot more complicated - and it can very likely be exploited to steal input.
Then comes protocol and window management complications. Now the window manager has to communicate more state to the client; there need to be definitions about 'what is maximized, minimized, fullscreen' state so the client can waste a buffer update drawing the right icons in the titlebar while also updating the mouse cursor to reflect the possible drag action. As an effect, all window- management actions now become latency-sensitive as they require at least a full roundtrip, hurting the option for network transparency - and since you have no guarantee that a client won't stop responding, also need to provide alternative window management modes for move and resize server-side.
This one is really painful. Shmif support analog, digital, touch and translated devices that may be entirely synthetic, unicasted or broadcasted - with a controlled additional path for more complicated special purpose devices. A client should always be ready to receive and manage input as there's no notion of 'input focus', only visibility states. These events are packed and delivered memory mapped or out of bounds with resize and a/v synch operations so that processing can live in its own thread and not starve graphics operations.
System-wise, Keyboards ("translated" devices) result in a hierarchy of translation tables, scancode to keycode to keysym to some locale specific encoding or worse. On top of that, there is an assumed state machine in modifier keys, (di-, tri- n-) cretic (latched) sequences and num- caps- lock, repeat rates, initial and on-going repeat rates and so on.
Wayland dictates a capability table as part of the 'seat' protocol. This table has one slot for each of the mouse, keyboard, touch and tablet types. It also separates between which surface it may be that has access to which parts of the seat (mouse/keyboard/etc. focus is a separate state). We are not in a state of knowing which of these slots we can fill or not as that is dynamic, so we have to lie and just say that we support all of them because that is the truth - if you stretch it a bit. The input- focus issues have to be dealt with via a separate state- tracking stage in the translation bridge.
The keyboard slot in Wayland only supports linux kernel keycodes and then the compositor is expected to provide the translation table itself (not symbols or any other kind of abstraction) in a very peculiar format (simplified version of Old X 'XKB' layouts) and transmit that as a descriptor (because they are now the new Win32 HANDLE type, just with even worse semantics and higher risks). It is then up to the client to maintain the state machine - so now it's a huge job to figure out what a synthesised input will mean in the context of a specific client. In addition, the server-side needs to maintain the same state machine in order to provide the correct modifiers, forcing both sides to handle the arguably most overcomplicated keyboard layout format in history. Most(?) compositors also seem to implement this by just writing this to a temp->unlinked file ones, and then sharing the descriptor - letting all clients modify a shared keymap. F- for security there, proc is a thing on Linux (security researchers: free CVEs up for grabs). Apparently some have realised this doesn't hold water and there are unstable protocol extensions floating that support providing actual unicode text, with the effect that input behaviour will be about as inconsistent as everything else.
The philosophy used in Arcan is to aggregate as much input data as the platform can provide, kill the num/caps- lock states in favour of explicitly switching translation tables. Allow the running scripts to synthesise, modify and tag these aggregates with a label and unicode code point and explicitly address each client. This means that we don't really have a valid way of decomposing the end result back into a key code, so any script transformations essentially becomes lost. It seems like the keyboard management will have to remain its own little world, and have the Waybridge specify which keyboard maps it is that should be sent to a client.
Game devices and the subsequent translation and calibration done inside the scripts will either have to be 'faked' and converted to mouse/keyboard input, or accept that games with backends that connect over Wayland will keep trying to use raw USB/evdev (until they come up with a game- protocol extension for seat which likely will just send the descriptor to the device rather than actually provide an abstraction), with the mess that entails. For those situations, the least painful solution might be converting the filtered game device output back over the evdev protocol and rely on sandboxing to prevent access to other devices.
Shmif is built around one guaranteed and additional "possible, but optional" segments that can be negotiated to cover a number of audio and video sub-buffers. This design is to allow more memory mapped resources, to minimise the number of allocations and system calls used to transfer data, the number of file descriptors that needs to be held, to reduce the set of system calls that can't be sandboxed away after initial setup and hopefully reach zero-extra-copy GPU texture synchs (pinned memory etc).
There is a concept of a compile-time defined native buffer format and a packing macro for clients to use, but principally no way to switch this for the shared memory buffers. Synchronisation is performed by setting bitfields in a round-robin ready state, video synch picks the set bit with the longest distance to the last synched bit, implementing single, double and triple buffering all in one.
Wayland sets up a shared memory buffer pool of shared memory mappings with the faux-shared mmap/unlink trick, and then hands off descriptors as requested. Although these can be mapped at an offset, it doesn't give us the granularity to correctly map these to the negotiated shmif buffers - though I'm not entirely sure here, it may be possible with some complications in the resize stage.
This incompatibility leaves us with two choices: repack or to convert the shm buffers to textures and pass handles onwards. One is very costly and the other requires the bridge tool to have GPU access, which isn't always desired as a GPU connection is today, pretty much an infinite attack surface. The only saving grace is that it's so hard to even get them to behave consistently for regular purposes, much less so for malicious ones.
Here is another big conceptual difference. If I understand this correctly, and the documentation is.. incredibly lacking in this area as well, is that a buffer needs to be bound to a surface that is promoted to something in the management style of the active shell protocol. This means that at buffer allocation time, we don't actually know what it is or will be used for and can't say if we accept or reject it.
Shmif makes a segment request with a specific SEGID (type+use) and some other metadata for authentication and everything has a visual representation that is immediately accessible. The idea was to make it very difficult to starve the server by pre-allocating hidden resources that keep accumulating until it runs out of descriptors. The catch is that we don't know if we'll get a new segment or not at the time we need to allocate a Wayland buffer or tie it to a surface. It is probably workable by just accepting normal surfaces and deferring the segment request until an actual shell- specific surface is created and swallow the latency cost that comes with it.
The shmif model for clipboard is the same as for screen sharing, desktop sharing, whatever. A client may make a subsegment request with an output segid type (clipboard is one such type), though the server can also explicitly push one in advance without any negotiation as a means of announcing capability, calling something like define_rendertarget (offscreen- copy+composition), or define_feedtarget (direct synch- between two clients) or define_nulltarget (when recipient doesn't need audio/video).
Output segments are just like normal ones, but with restrictions on synchronisation and resize operations. Server-side, it is the scripts that determine the contents, meaning we can paste raw images, streaming video, audio samples, UTF-8 text and binary streams. Type conventions like MIME types are deliberately avoided with the argument that Postel's principle is a no-go with binary data being passed across privilege boundaries. The gains are few and the opportunities for an attacker are many. Clipboards are easy malware transport mechanisms and a refined type model makes it easier for an attacker to pick and chose the targeted parser. Wayland uses MIME types for both drag and drop / cut and paste where we can only say application/octet-stream (or probe/derive from file extension).
This is a list of features exposed by the engine and current shmif- API that do not translate to the current set of Wayland protocols and subprotocols at all, and will not be available.
Shmif specifies GEOHINT events to indicate the switch between presentation language, input language, time keeping and global positioning. There is also a special subsegment that is pushed to a client that (if possible) should be possible to populate with a simplified (high contrast, screen reader friendly, text- representations etc.) update.
Some SEGIDs specify explicit format-string protocols for providing alternative representations by queueing LABELHINT and MESSAGE events, primarily TITLEBAR and POPUP in order for the scripts to take an alternate route to presenting the information (text to speech, global menu interface etc.)
Shmif specifies FONTHINT events to both provide a descriptor to the font to be used for a specific role, and it's sub-properties, like hinting and size. Either we discard this feature for Wayland clients or start adding them over side-channel protocols.
We have no way of pairing a Wayland client to any audio source, so accessing that data will likely require some sandboxing control and intercepting that way, or the default Linux idiocy by trying to guess from YOLO information sources like scraping proc. A compromise would be to make a pulse-audio backend and sneak servers into clients on a per-process basis or something equally awful.
There doesn't seem to be any reliable approach for being event triggered when a client wants to 'open' a file, a capability announce for saving a file or anything in between. Again, introduce side-channels or scrap the feature.
Buffer-sharing is single-direction, client to server in Wayland while it has a negotiated per-segment direction in shmif. A client can request an output segment and the scripts can explicitly push one. At the same time, this also means that no Wayland compatible client will expect, or can handle, data coming in, so we either fake it by converting to a drag/drop cut/paste operation with a suitable MIME type, or let the scripts block any define_recordtarget calls against a wayland-client bound VID. Some suggestions have been made towards using the nesting-compositors approach and have a client also act as a compositor which translates to 'do not fix the protocol, just make things more Rube Goldberg-y'.
This has already been elaborated on in the input model problem, but deserves repeating. All the input that is synthesised or transformed in the running scripts will degrade into a linux evdev key-code that will be reinterpreted with the state machine of the keymap running in each client. Though there is a remote possibility of dynamically generating a valid keymap and sending that, it is rather unrealistic. I would love to be proven wrong here.
Shmif does not require any patches to EGL for a client or server to be able to share buffers, though we do rely on some extensions (surfaceless and displayless contexts, dma-buf or EGLstreams). It also forces the client to either render to a texture, or provide a default FBO that should be bound - something similar to what iOS does. There is a recent initiative by nvidia here for externalplatform EGL extension that decouples the protocol from the EGL setup as to not repeat the GLX mistake, though I have my doubts that MESA will follow. The color attachment is then shared with whatever texture- streaming mechanism that is available. Anyhow: there is an invalidation event that can be pushed to indicate that the currently used context is dead, and a the client must switch to another. Though we can probably respond by binding and unbinding Wayland-bound displays, there seem to be no way for the invalidation to occur, and with clients written with the assumption that the allocated context lives for as long as the client does - we're screwed.
There is currently no Wayland extension that deals with defining related assets like an icon representing the client, though there is core protocol for turning a surface into a client-pointer attachment so just introducing more surface types like that might become yet another extension in the future - but again, writing the extension, getting "the community" to accept it, updating the clients and waiting for the update to propagate.
Arcan has an explicit subsegment type for saying that this is an icon ("miniaturized") type, where the first frame delivered specify the initial state (so it can be cached and re-used in the desktop enviroment) and then dynamic updates so it can work in a statusbar or notification system. This subsegment can also be promoted (current WIP) to a vectorised buffer interpretation, so tesselated vector data + normal buffer as textures - can be used for a dynamic/3D/VR representation.
Arcan has a rendering API prepared for the 0.6 planned networking release. This rendering API is the same as the underpinnings of the Lua scripting interface, with a few dangerous functions masked out and the destination being forced to a "per-connection defined rendertarget". A separate tool that acts as a shmif- proxy will provide the line-format for serialising, and subsegment- type specific compression. For wayland- clients, this will be reduced to the same I/O event limitations as expressed elsewhere, and likely a forced H264- encoding for buffer contents (the cost of real-time content analysis to determine encoding scheme is probably too high - we'll see).
To get away from the problem of clients trying to provide things like "CTRL+W" to "perform action", a disease most commonly (but not exclusively) found in games, a client can send INPUTHINTs where it indicates a label, its datatype, integer encoding and a localized description. This allows the scripts running in Arcan to tag output events in order for the client to treat them right, and provide universal input<-> device binding without the client providing a UI for it every single time. These will just be missing for all Wayland clients.
Similar to Client defined Input Labels, a client can provide COREOPTs to indicate simple key/value storable configuration properties that the user may want retained between launches. There is also the option to do a STATEHINT to indicate the current estimated size of a runtime state snapshot. This feature is used for game savestates, VM snapshots etc. in order to allow migration between machines, synchronisation between devices and so on. For Wayland, these will just be empty / disabled.