Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

newBE: value_labels_ranges is very slow #2681

Closed
bjorn3 opened this issue Feb 23, 2021 · 5 comments
Closed

newBE: value_labels_ranges is very slow #2681

bjorn3 opened this issue Feb 23, 2021 · 5 comments
Labels
cranelift:goal:compile-time Focus area: how fast Cranelift can compile or how much memory it uses. cranelift Issues related to the Cranelift code generator

Comments

@bjorn3
Copy link
Contributor

bjorn3 commented Feb 23, 2021

It literally took more time than the actual compilation on one profile.

image

@cfallin
Copy link
Member

cfallin commented Feb 23, 2021

It would be interesting to know more about this workload: why was the label-location dataflow analysis particularly slow in this case? Was there a higher than usual density of labels? Many basic blocks?

Considering alternative approaches (against the baseline "regalloc tells us everything directly" approach which regalloc.rs does not currently support):

  • The current approach, adding label ghost-instructions and doing a post-hoc analysis, is general but potentially involves slow analysis. It imposes no additional overhead on generated code, however (i.e. it can describe any generated code). This might be called a "passive" debug-info design.
  • There is an alternative where we explicitly reserve a stackslot for every label and store to this slot whenever the labeled value changes. Then the DWARF info is simple and constant (value label N always lives at BP-offset) but the runtime overhead is potentially considerable. However maybe this isn't so bad if we're already executing unoptimized debug-mode code. Let's call this an "active" debug-info design: rather than describing locations, we set a location and generate code to move data we care about observing.

I actually kind of favor the latter, all other things being equal. As mentioned here (and in the same spirit as the "simpler GC without stackmaps" proposal #2459), I'd like to bias toward better factorization of complexity and less reliance on complex analyses and maintenance of metadata; the "post-hoc analysis" is partway there (the core compiler pipeline only sees blackbox value-label instructions) but this would be further so. Thoughts?

@bjorn3
Copy link
Contributor Author

bjorn3 commented Feb 23, 2021

This particular workload is compiling simple-raytracer with all of of its dependencies.

I would prefer not allocting a stackslot for every value for three reasons: I don't think it is acceptible to regress the already poor debugmode performance of rust even more. I don't want the choice to generate debuginfo or jot to influence the generated code. Gcc also doesn't let it influence the generated code. This has the advantage that enabling debuginfo doesn't change the behaviour of a program in case of UB or miscompilations, thus making them easier to debug. Finally value debuginfo may be useful for on stack replacement in case of a tiered JIT. Regressing performance in this case is unacceptable.

@cfallin
Copy link
Member

cfallin commented Feb 23, 2021

Yeah, perhaps not, though I'd be interested to measure how large the regression would be.

The "right" answer here, I think, is to rely on regalloc.rs to provide us location info per vreg per program-point. Unfortunately without that we're forced to do an analysis of some sort to recover the info.

It's possible the analysis data structures could be improved: I notice that a lot of time is spent in cloning HashMaps for example; perhaps a delta-based scheme with some structure sharing could be designed. If someone has the time to experiment with this (I unfortunately don't at the moment, first priority is completeness for the switchover) I'd be happy to discuss or review...

@froydnj
Copy link
Contributor

froydnj commented Feb 23, 2021

FWIW, GCC does (or at least used to) go with the second approach suggested above of giving everything a fixed stack slot.

@akirilov-arm akirilov-arm added cranelift Issues related to the Cranelift code generator cranelift:goal:compile-time Focus area: how fast Cranelift can compile or how much memory it uses. labels Oct 4, 2021
@cfallin
Copy link
Member

cfallin commented May 4, 2022

This implementation has been deleted now that regalloc2 is in use! RA2 natively supports passing through and translating debug info, so there's no need to reverse-engineer it from the allocated instructions anymore.

@cfallin cfallin closed this as completed May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:goal:compile-time Focus area: how fast Cranelift can compile or how much memory it uses. cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

No branches or pull requests

4 participants