Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make emitted egal code more loopy #54121

Merged
merged 1 commit into from
Apr 25, 2024
Merged

Make emitted egal code more loopy #54121

merged 1 commit into from
Apr 25, 2024

Conversation

Keno
Copy link
Member

@Keno Keno commented Apr 17, 2024

The strategy here is to look at (data, padding) pairs and RLE them into loops, so that repeated adjacent patterns use a loop rather than getting unrolled. On the test case from #54109, this makes compilation essentially instant, while also being faster at runtime (turns out LLVM spends a massive amount of time AND the answer is bad).

There's some obvious further enhancements possible here:

  1. The memcmp constant is small. LLVM has a pass to inline these with better code. However, we don't have it turned on. We should consider vendoring it, though we may want to add some shorcutting to it to avoid having it iterate through each function.
  2. This only does one level of sequence matching. It could be recursed to turn things into nested loops.

However, this solves the immediate issue, so hopefully it's a useful start. Fixes #54109.

@gbaraldi
Copy link
Member

We do run the pass that inlines memcmps, but it's a backend pass. I've been chipping away at moving it to the middle end in llvm/llvm-project#77370, but it still needs some love

@Keno
Copy link
Member Author

Keno commented Apr 18, 2024

The probably explains why the performance is better than I expected

@gbaraldi
Copy link
Member

I played a bit with it here #52719, which is what made me start doing the LLVM work

@Keno
Copy link
Member Author

Keno commented Apr 21, 2024

@nanosoldier runtests()

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@gbaraldi
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@gbaraldi
Copy link
Member

Performance looks good

@oscardssmith oscardssmith added performance Must go faster compiler:codegen Generation of LLVM IR and native code labels Apr 23, 2024
@Keno
Copy link
Member Author

Keno commented Apr 23, 2024

Performance is fine, but as shown in pkgeval, we're missing some edge cases, since we're double booking the meaning of has padding to also mean bots comparable. I need to think about whether to implement the edge case or just fall back.

@Keno Keno force-pushed the kf/loopyegal branch 2 times, most recently from b5a5ea1 to fe67c79 Compare April 24, 2024 05:46
@Keno
Copy link
Member Author

Keno commented Apr 24, 2024

@nanosoldier runtests(["ReduceWindows", "NonlinearSystems", "StatsModels", "GeoParquet", "GeoStatsModels", "MicroCanonicalHMC", "DistributedStwdLDA", "Agents", "SurfaceCoverage", "JumpProblemLibrary", "SDEProblemLibrary", "ConceptualClimateModels", "ReactionNetworkImporters", "Phylo", "ModelOrderReduction", "LibTrixi", "ReactionSensitivity", "Biofilm", "Turkie", "BloqadeODE", "IonSim"])

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

@Keno Keno force-pushed the kf/loopyegal branch 3 times, most recently from ee760e0 to 1ac52df Compare April 25, 2024 18:13
The strategy here is to look at (data, padding) pairs and RLE
them into loops, so that repeated adjacent patterns use a loop
rather than getting unrolled. On the test case from #54109,
this makes compilation essentially instant, while also being
faster at runtime (turns out LLVM spends a massive amount of time
AND the answer is bad).

There's some obvious further enhancements possible here:
1. The `memcmp` constant is small. LLVM has a pass to inline these
   with better code. However, we don't have it turned on. We should
   consider vendoring it, though we may want to add some shorcutting
   to it to avoid having it iterate through each function.
2. This only does one level of sequence matching. It could be recursed
   to turn things into nested loops.

However, this solves the immediate issue, so hopefully it's a useful
start. Fixes #54109.
@Keno Keno merged commit 50833c8 into master Apr 25, 2024
5 of 7 checks passed
@Keno Keno deleted the kf/loopyegal branch April 25, 2024 23:21
Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very fancy! Sgtm (though I didn't review the recursive algorithm fully, I trust our tests would have caught any issues with it)

@@ -621,18 +624,17 @@ void jl_compute_field_offsets(jl_datatype_t *st)
// if we have no fields, we can trivially skip the rest
if (st == jl_symbol_type || st == jl_string_type) {
// opaque layout - heap-allocated blob
static const jl_datatype_layout_t opaque_byte_layout = {0, 0, 1, -1, 1, {0}};
static const jl_datatype_layout_t opaque_byte_layout = {0, 0, 1, -1, 1, { .haspadding = 0, .fielddesc_type=0, .isbitsegal=1, .arrayelem_isboxed=0, .arrayelem_isunion=0 }};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was a GCC extension. Did it finally make it into the standard with reasonable behavior? It is certainly much better syntax for readability/maintainability!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's in C99. The C++ version was a GCC extension until C++20, so that may be what you're thinking of.

// Emit memcmp. TODO: LLVM has a pass to expand this for additional
// performance.
Value *this_answer = ctx.builder.CreateCall(prepare_call(memcmp_func),
{ ptr1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This load from a gc value appears to be unrooted (as required by the ABI for memcp). Please copy the code from the other memcmp branch to add the roots to this (unless I missed something)

Keno added a commit that referenced this pull request Apr 26, 2024
As requested in post-commit review on #54121.
Keno added a commit that referenced this pull request Apr 26, 2024
As requested in post-commit review on #54121.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Excessive LLVM time in egal codegen of large struct
5 participants