Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] add til::clump<T>, some weird data structure #7220

Closed
wants to merge 4 commits into from

Conversation

DHowett
Copy link
Member

@DHowett DHowett commented Aug 7, 2020

Summary of the Pull Request

clump is a vector intended to be consumed in chunks.
It is stored as a vector of T with an optional vector
of lengths as a sidecar.
If the length vector is missing, it is assumed that
each component is of length 1.

         +-----------------------------+---------+----+ 
Sizes    | 6                           | 2       | 1  | 
         +--------Region 1-------------+--Rgn 2--+-R3-+ 
Contents | 38 |  2 |  0 | 12 | 34 | 56 |  4 |  2 |  2 | 
         +----+----+----+----+----+----+----+----+----+ 

During iteration, this clump will produce three spans:

{38, 2, 0, 12, 34, 56}                                  
{4, 2}                                                  
{2}                                                     
Sizes    [ UNSPECIFIED       ]                          
         +----+----+----+----+                          
Contents | 38 |  5 | 68 |  8 |                          
         +----+----+----+----+                          

During iteration, this clump will produce four spans:

{38}                                                    
{5}                                                     
{68}                                                    
{8}                                                     

It's designed like this to be approximately as performant as a vector in the normal case (where there are no sizes specified).

TODO:

  • Tests
  • Fill out this PR body
  • Maybe find a name for it, who knows
  • Try it out in the state machine

References

PR Checklist

  • Closes #xxx
  • CLA signed. If not, go over here and sign the CLA
  • Tests added/passed
  • Documentation updated. If checked, please file a pull request on our docs repo and link it here: #xxx
  • Schema updated.
  • I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

Detailed Description of the Pull Request / Additional comments

Validation Steps Performed

@j4james
Copy link
Collaborator

j4james commented Aug 9, 2020

I have some comments on this, based on on some refactoring work I was experimenting with for the parameter handling in general. I had to put that refactoring on hold to get the SGR refactoring out the way first (and that led my down the rabbit hole of the conpty colors), but at some point I would like to get back to it.

To give you a bit of context, my goal was to try and simplify the way parameters were accessed in the OutputStateMachineEngine class. So instead of having a call like _GetXYPosition to extract the parameters in one switch statement, and then a second switch statement dispatching those parameters to CursorPosition, we'd just have a single switch with something like CursorPosition(parms.get(0), parms.get(1)).

The idea was that the parms class would automatically handle any missing or default values, and eliminate the need for literally hundreds of lines of those boilerplate parameter extraction methods. The plan was also for this class to support looping over sequences of selective parameters, so we could more easily address issues like #2101.

I hadn't got to the point of dealing with the colon subparameters, but I was thinking about how they might eventually be incorporated, and I came to the following conclusions:

  1. 99% of the time they aren't going to be used, so we'd just need a method in the parameter class like HasAnySubparameters, and if that returned true, the dispatch could be aborted right away.
  2. With that established, the rest of the code could be guaranteed it was accessing a simple parameter list without worrying about whether subparameters might be returned, so index lookups don't need to be complicated.
  3. For the 1% of time where you are dealing with subparameters, it's in controls that take a parameter sequence, so you don't really need index lookups, just the ability to iterate. That's the only time the returned parameter might need to be a subparameter group.

Given those constraints, I think this clump class may be more complicated than it needs to be. I don't think it necessarily conflicts with anything I was planning, but I suspect much of the interface it provides would ultimately not be used, and it may be introducing unnecessary overhead. Not that that matters a great deal - just FYI.

I should also mention some thoughts I had on the underlying structure. One of my ideas was that we could reserve a bit in each parameter value to signal whether it was the start of a sub sequence or not, and that way we could avoid the need for two lists. I don't know whether that was a good idea - just throwing it out there.

And one other thing. I was thinking at some point that we should make the parameter list a fixed length. The DEC standard recommends supporting up to 16, and most terminal emulators only support around 32. Allowing unlimited parameters sounds nice in theory, but it's a potential attack vector for overflowing memory. Not hugely important, but something to consider while we're evaluating data structures for this.

@DHowett
Copy link
Member Author

DHowett commented Aug 9, 2020

That's an interesting take, and yeah -- I think it would be great to have something that supports specifically VT parsing with all its optional parameters. We're definitely overdue for a refactor there, and you're right that we could evict a good amount of boilerplate with a better type.

The guiding principles behind clump (clustered_vector is perhaps a better name?) were that it...

  • ... is reasonably well-suited for VT parsing, even if short of ideal
  • ... is almost a drop-in replacement for a vector/span view over a vector; therefore has:
    • approximate zero additional cost unless subparameters are used
    • a decay to vector/span at runtime for most operations
    • (with operator[] or at,) reasonable indexing behavior
      • (I do not necessarily think that returning a span is reasonable indexing behavior ;P)
  • ... fits into our existing span-passing API. Since everything boils down to a pointer+length, the incremental cost is "just run a couple functions a few more times" for clusters where len=1
  • ... has broad applicability for any clustering operations, like glyph cluster mapping in DWrite and font feature tables (pretty much anything that's a length-coded vector)
    • (I can imagine a std::pair<FONT_FEATURE*, SIZE_T*> clump<FONT_FEATURE, SIZE_T>::win32_pointer_view(handwave, handwave) as an adapter that makes directwrite integration less obnoxious)

I don't necessarily love that this is a too-general solution applied to a too-specific problem, but I do appreciate how minimal the delta is in StateMachine/the engines 😄
That's probably because we're moving downwards in the generalization gradient from vector/span.

Incidentally, because of the decay of a list of length=1 units into a set of spans of length=1 I'd updated the SGR dispatcher to operate on a single SGR parameter pack at a time. It seems like the right thing to do, but it does rather involve (I apparently left while writing this and came back and never finished it. I haven't the foggiest idea what I was going for.)

Perhaps there's a staged approach?

  1. Introduce type aliases VTParameterPack, VTParameterPackView to vector and span respectively
  2. (Optionally) switch them to a clustered vector, since we're "close" on that one and it defers a refactor
  3. (Finally) switch them to a smarter more-specific type that fills in default values and land the refactor

@j4james
Copy link
Collaborator

j4james commented Aug 9, 2020

  • approximate zero additional cost unless subparameters are used

  • a decay to vector/span at runtime for most operations

  • (with operator[] or at,) reasonable indexing behavior

    • (I do not necessarily think that returning a span is reasonable indexing behavior ;P)

Maybe I need to wait to see how it's going to be used, but I got the impression that the index would be returning a span. And that would mean that everywhere we're currently doing something like til::at(parameters, 0), we'd now require an additional dereference to get back the actual value.

We'd also need to check the size of the returned span and fail for values greater than one. And in the case of parameter sequences it's even more complicated (although perhaps that's an edge case that could initially be ignored). The bottom line is that it doesn't sound like zero additional cost, unless I've misunderstood something.

  • ... has broad applicability for any clustering operations, like glyph cluster mapping in DWrite and font feature tables (pretty much anything that's a length-coded vector)

Now that's something I hadn't considered. I think that's a good enough argument to convince me it's worth going ahead with this regardless of any other concerns I might have.

I don't necessarily love that this is a too-general solution applied to a too-specific problem, but I do appreciate how minimal the delta is in StateMachine/the engines 😄

This is the bit that I have my doubts about, but I'm happy to wait and see how it turns out.

Perhaps there's a staged approach?

Yeah, there's no need to hold up anything for the refactoring - I just wanted to let you know what I had planned and make sure we were on the same page. And I wouldn't worry about type aliases either. I'd just go right ahead with integrating the clustered_vector into the StateMachine and see what it takes to make the extended SGR colors work (assuming you haven't already done that).

@DHowett
Copy link
Member Author

DHowett commented Aug 9, 2020

Maybe I need to wait to see how it's going to be used, but I got the impression that the index would be returning a span.

Oh, yeah, that was a tongue-in-cheek comment at the expense of my own code. I don't think that the thing I did was reasonable (edit: but it was expedient for the purposes of prototyping).

Playing with xterm a bit, it looks like sequences that do not expect subparameters are rejected when they have subparameters. That does complicate things a bit, since (as you rightly assert) we'd need to introduce length checks.

Have to do a bit more investigation to see whether I'm comfortable committing to this. It might be better-suited for DWrite than VT parsing, and I'm treating it as a hammer for any nail-shaped object.

@DHowett
Copy link
Member Author

DHowett commented Sep 23, 2020

Clump isn't currently the right solution.

@DHowett DHowett closed this Sep 23, 2020
@DHowett DHowett deleted the dev/duhowett/uptown_clump_you_up branch October 16, 2020 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants