Skip to content
This repository has been archived by the owner on Mar 20, 2024. It is now read-only.

The Return of Cast Operations #503

Closed
billhuffman opened this issue Jun 26, 2020 · 15 comments
Closed

The Return of Cast Operations #503

billhuffman opened this issue Jun 26, 2020 · 15 comments
Labels
Resolve for v1.0 To be resolved for v1.0 draft

Comments

@billhuffman
Copy link

We've decided to require (the appearance of) SLEN=VLEN. Early in the discussion of the issue, we considered having cast operations that would rearrange for different element size because it was important in a small number of codes. We may still want to have those cast operations, though now only for performance on wide machines. The cast instruction will perform better on a wide, in-order implementation than the auto-inserted micro-op. As before, the cast is a nop on narrow machines.

I think all the issues about fragmentation are gone here. With and without both work on all implementations. But for performance optimization, they will want to be used on wide, in-order implementations and possibly also on wide, OoO implementations.

@David-Horner
Copy link
Contributor

David-Horner commented Jun 26, 2020 via email

@billhuffman
Copy link
Author

billhuffman commented Jun 26, 2020 via email

@billhuffman
Copy link
Author

billhuffman commented Jun 26, 2020 via email

@billhuffman
Copy link
Author

I'd say let's put a non-normative comment that vmv.v.v with dest=source is expected to be used as a hint, for implementations that need it, that element width is changing from an old element width to the current SEW. If that's added, we can close this issue.

@David-Horner
Copy link
Contributor

The beauty is that it (vmv.v.v with source=destination) does the right thing as a hint or as a physical move when internal EW is tracked.

Although it only supports internal reformatting to "SEW friendly" internal state,
if more is needed, the preemptive formatting to explicit EW hint can be added post v1.0.

I agree, however, that v1.0 could benefit from the inclusion of vmv.v.v with dest=source described as a hint.

It helps by

  • introducing hints to RVV
  • highlights unique aspects of "nop"s in RVV and
  • disambiguates this obvious special case.

Do you have any proposed wording?

@billhuffman
Copy link
Author

billhuffman commented Jun 26, 2020 via email

@David-Horner
Copy link
Contributor

David-Horner commented Jun 26, 2020 via email

@billhuffman
Copy link
Author

billhuffman commented Jun 26, 2020 via email

@kasanovic kasanovic added the Resolve for v1.0 To be resolved for v1.0 draft label Jun 28, 2020
@kasanovic
Copy link
Collaborator

This looks like a good resolution.

@kasanovic
Copy link
Collaborator

Might want both vmv.v.v form for hint that "vl" elements need to be rearranged, and vmv?r.v hints for whole vector register groups.

kasanovic added a commit that referenced this issue Jul 3, 2020
@David-Horner
Copy link
Contributor

I disagree with multiple formulations of whole register (WR) loads to provide hint to microarch of requested internal structure, because (detailed below)

1.. these hints only potentially benefit machines with in-register not in-mem order
2.. better mechanisms should be possible to optimize behaviour for such machines
3.. the designation has diminished value for multiple register load with a single width hint.
4. The typical WR use case is for autonomous/independent spill and fill that are unaware of subsequent use
5.. it complicates both the hardware and software causing potential confusion to each
(software needs to use a size specific load in a context where such info is not immediately available if at all)
6. hint can leads to lower performance for machines not needing them (a consequence of 5).

I believe it is premature to provide these whole register (WR) load hints.
I propose we only use “e1024 encoding” (mop-width all ones) for WR loads and for WR stores.
The implementation will decide what internal format to use, software can assume current sew is used.
Conventions can be used to optimize the re-establishment of internal width information.

Further, I recommend we emphasis the in-memory format of the model. Specifically, that software can consider unit strided elements as equivalent to unit strided elements of narrower widths (and wider elements if alignment requirements are met) with an appropriately adjusted count.

from RVI:

We considered but did not include static branch hints in the instruction encoding. These
can reduce the pressure on dynamic predictors, but require more instruction encoding space and
software profiling for best results, and can result in poor performance if production runs do not
match profiling runs.

  1. Only physical implementations that have differing in-register order formats will potentially benefit from these hints. It thus makes sense to me that such hints be proposed as an extension rather than in the base. It would then get review for the matters considered below.

  2. better mechanisms should be possible to optimize behaviour for such machines
    a) implementations may stack and pop internal width information (just as JALR use a convention to stack and pop return addresses). This would be based on a convention or pairing store/loads.
    A possible convention is stack width with a whole register store that writes based on r2 , the stack pointer.
    b) Another approach implementations could use is store width with address of store, so that load from that address will be associated with the corresponding internal width format.
    c) an implementation could always load with the internal format that is least disruptive for its ELEN range. Or dynamically change the default given the recently executed whole register stores.
    d) a similar tracking that a microarch performs over a loop: either
    it decodes ahead and determines the future usage width for the register and stores according to that format or
    profiles the use during the last use in the loop and applies the appropriate internal format from that.
    This latter is an optimization that potentially benefits multi in-register order machines. It is questionable that this tracking will provide value as subsequent use of a register with a different EEW than written is relatively rare. (but significantly frequent and, as we decided, impacting to an explicit SLEN<VLEN).

As with JALR refinement of conventions and recommended hardware response can evolve.

  1. the designation has diminished value for multiple register load with a single width hint.
    When a group of registers is loaded, the hint can only provide the width for one of them. Forcing all registers into a single width may optimize one of the register’s subsequent use at the expense of the (up to 7) other registers.

  2. The typical WR use case is for autonomous/independent spill and fill that are unaware of subsequent use.
    In the envisioned typical use, called routines or interrupt routines will spill and fill registers as needed to perform the requested function. The internal format information is lost over such a process, potentially causing subsequent performance degradation Whereas there is the potential for improved code performance, these width hints may actually yield to poorer performance in dynamic use situations.

  3. As a hint, hardware need only mask out all width designations and perform the same action regardless of width hint. However, providing the hint itself will be a causes of confusion as even minimal vector implementations will need to understand the SLEN<VLEN issues to assertain that they have no need to consider the hint. Similarly, software exception handlers and even vector routines will need substantial analysis to determine an optimal hint for a given target class (that may not be relevant to specific machines in that class as described above). Exception handlers especially will be challenged on the optimal use as no mechanism exists to access the current internal format or even last written EEW for any set of registers.

  4. hint can leads to lower performance for machines not needing them (a consequence of 5).
    A machine design may (erroneously) use the designated hint width for a WR load potentially increasing memory side activity (for example performing byte transfers when word or cache line would be appropriate and more performant).
    I am sure there will be other examples.

@billhuffman
Copy link
Author

billhuffman commented Jul 8, 2020 via email

@David-Horner
Copy link
Contributor

Your comments introduced renumbering to the expanded points.

As far as I can tell, there's absolutely no loss of performance. Ever. Can you suggest a case where there's a loss?

From item 6 detail:

hint can leads to lower performance for machines not needing them (a consequence of 5).
A machine design may (erroneously) use the designated hint width for a WR load potentially increasing memory side activity (for example performing byte transfers when word or cache line would be appropriate and more performant).

Your compelling argument is

The performance sensitive cases are for spill/fill in the same function and the compiler knows how the register is being used. In complex functions, spill/fill performance matters in a number of codes that are currently heavily used for us. I expect a factor greater than 2x loss for some functions without this.

This is immediate application with significant anticipated benefit.

As for "better" alternatives, tracking branch behaviour was once "too complex" and did not give a significant return, such that branch hints were the preferred method. Times change, RV architecture in particular is designed to be "forever"

I expect the method to apply these hints will evolve as calling conventions and register allocation conventions "improve".

So, although I would now agree with providing the hints, I believe at least one should be reserved for when the compiler cannot make an informed choice, such as in interrupt routines.
My choice would be as previously suggested, e1024.

@billhuffman
Copy link
Author

billhuffman commented Jul 8, 2020 via email

@kasanovic
Copy link
Collaborator

kasanovic commented Jul 10, 2020 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Resolve for v1.0 To be resolved for v1.0 draft
Projects
None yet
Development

No branches or pull requests

3 participants