Skip to content

Conversation

@hsharsha
Copy link

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

basioli-k and others added 30 commits November 11, 2025 04:30
…ut running HLO passes

PiperOrigin-RevId: 830848849
…ferization.

I am going to rewrite the tiled lowering pipeline to start taking advantage of destintation-passing style ops & bufferization, but first I need to ensure the fusion compiler works for it.

PiperOrigin-RevId: 830849292
This change adds checks to ensure that indices used to access `lhs_output_tile_sizes` and `block_level_parameters.output_tile_sizes` are within the valid range, preventing potential out-of-bounds errors.

PiperOrigin-RevId: 830855360
PiperOrigin-RevId: 830858428
PiperOrigin-RevId: 830863081
PiperOrigin-RevId: 830867633
This change extends the `BufferDebugFloatCheckEntry` struct to include counts for infinities and zeros. The CUDA kernel for float checking is updated to compute and store these additional counts in the log entries.

PiperOrigin-RevId: 830870471
The legacy emitter will be deleted soon. Change tests to use the new generic emitter instead.

Updates various HLO test cases from using the `__triton_gemm` backend config to the new `__triton_nested_gemm_fusion` structure, which involves defining separate nested fusions for the LHS and RHS of the dot operation. Also adjusts legacy test configurations.

PiperOrigin-RevId: 830888333
Add native_backend before block_level_emitter_backend. This means that when tests are comparing against goldens and using `--xla_gpu_deterministic_ops=true`, it will choose the native emitter backend (it just chooses the first config). This means I don't have to update the sensitive goldens for the flag flip.

PiperOrigin-RevId: 830895045
…parts of the fusion_emitter APIs.

This change introduces a `FusionEmitter` base class which currently emits xtile.
This is done to remove se::DeviceDescription from the fusion_emitter API in the interest of sharing the infra between CPU and GPU.

The Triton specific logic is implemented in `TritonFusionEmitter`.

PiperOrigin-RevId: 830908750
This just means it will use the original operand value if there is no mapping. This covers the odd case for when the return op is not within the reduction space - for example a constant:

```
%cst = arith.constant dense<true> : tensor<i1>
...
%12 = stablehlo.reduce(%11 init: %cst_0) across dimensions = [1] : (tensor<4x8xi1>, tensor<i1>) -> tensor<4xi1>
     reducer(%arg3: tensor<i1>, %arg4: tensor<i1>)  {
      stablehlo.return %cst : tensor<i1>
    }
...
```

Previously, the code was crashing, but now it's working.

PiperOrigin-RevId: 830916653
Cholesky HLO was the only user for cusolver. After cr/817718825 switched JAX to using FFI for cusolver instead of the dedicated op, we can remove the lowering to the cusolver custom-call and CholeskyThunk. TF users if any, can rely on the CholeskyExpander added to the GPU pipeline.

PiperOrigin-RevId: 830927167
This change makes `BufferDebugLog` a template class `BufferDebugLog<Entry>`, inheriting from a new `BufferDebugLogBase`. This allows methods like `RequiredSizeForEntries`, `CreateOnDevice`, and `ReadFromDevice` to be type-safe and simplifies their usage by removing the need to explicitly pass the `Entry` type in many calls.

PiperOrigin-RevId: 830956449
…usion_compiler_opt`.

PiperOrigin-RevId: 830997907
PiperOrigin-RevId: 831009555
PiperOrigin-RevId: 831020229
…ds. Remaining methods will be updated in follow up cl's.

PiperOrigin-RevId: 831046450
… LocalToGlobal custom-calls.

cl/826134489 only clears the frontend attributes for mhlo.custom-call. This cl adds support for stablehlo.custom-call.

PiperOrigin-RevId: 831055853
…utations but with force delay

PiperOrigin-RevId: 831133591
The image now has pyyaml

PiperOrigin-RevId: 831202578
PiperOrigin-RevId: 831206765
…so on Apple.

Tsan seems not to work with `mstats()` (other sanitizers are fine).
As per `lite/profiling/BUILD:160`, this could happen, but due to how `tflite_portable_test_suite_combined` rule works, this test will be included if the tsan flag is passed, even if the test itself is `notsan`.

PiperOrigin-RevId: 831261356
Imported from GitHub PR openxla/xla#31348

Since rocm7, hipGetLastError returns last error even if last call was successful. This change makes it possible to work correctly with rocm7. We need to call hipGetLastError to reset per thread error state.

Especially, it fixed the failed test in TransformerEngine compiled with rocm7, `pytest -vvv -s tests/jax/test_distributed_layernorm_mlp.py::TestDistributedLayernormMLP::test_layernorm_fp8_mlp_primitive[True-bfloat16-activation_type0-input_shape0-mesh_config1]`

@xla-rotation could you review my PR, please?

Copybara import of the project:

--
0b72c66a759147a9f85ca0e3bc7c59e33d0ac9b8 by songlin <Songlin.Piao@amd.com>:

added rocm7 support to EnablePeerAccess

--
5166268be4e1246a960b4bef687bb8e4122ed9bf by songlin <Songlin.Piao@amd.com>:

use wrap namespace, clang-format and add comments

Merging this change closes tensorflow#31348

PiperOrigin-RevId: 831263261
Imported from GitHub PR openxla/xla#33794

📝 Summary of Changes
Support int4 in cuDNN GEMM fusions.

🎯 Justification
Accelerates some int4 GEMM fusions (under the flag xla_gpu_cudnn_gemm_fusion_level).

🚀 Kind of Contribution
⚡️ Performance Improvement

📊 Benchmark (for Performance Improvements)
> Please measure and include speedups for one of the public HLOs in
`compiler/xla/tools/benchmarks/hlo/`.

These do not use int4.

🧪 Unit Tests:
yes

🧪 Execution Tests:
yes
Copybara import of the project:

--
e1b8dc7daff4963b93152d2a5c81c4d91a9f14d8 by Ilia Sergachev <isergachev@nvidia.com>:

[GPU] Support int4 in cuDNN GEMM fusions.

Merging this change closes tensorflow#33794

PiperOrigin-RevId: 831264661
tensorflower-gardener and others added 14 commits November 17, 2025 05:37
- This fixes the xla_gpu_cublas_fallback flag behavior.

PiperOrigin-RevId: 833294523
This was probably an oversight. Our tests did not detect this, as SortRewriter
will not rewrite if there is no kernel registered for the data type.
Adjust the SortSupportsType test so that it would catch this issue.

PiperOrigin-RevId: 833305248
This change introduces free functions like `CreateSymbolicConstant`, `CreateSymbolicVariable`, and `CreateSymbolicBinaryOp` that directly accept an `mlir::MLIRContext*`. The `SymbolicExprContext` class is now deprecated, and its methods are refactored to call these new free functions. The internal `SymbolicExprStorage` now holds a pointer to `mlir::MLIRContext`. I left the SymbolicExprContext for now until all the refactor is done.

PiperOrigin-RevId: 833305709
This heuristic should improve runtime on Blackwell architecture for fusions
with small data types (<= 16 bits).

PiperOrigin-RevId: 833308087
…5d8f6c9a6554d0b004bc0…

Imported from GitHub PR openxla/xla#33906

…cd474

📝 Summary of Changes
Upgrade bitcode library. Small cleanup of build rules.

🎯 Justification
Pull in the changes for the new gfx archs.

🚀 Kind of Contribution
♻️ Cleanup

📊 Benchmark (for Performance Improvements)
N\A

🧪 Unit Tests:
N\A

🧪 Execution Tests:
N\A

Copybara import of the project:

--
688346550b3733693459bbcef4e6d3a754bb2746 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>:

[ROCm] upgrade bitcode library to fcc50fb091b7c75d8f6c9a6554d0b004bc0cd474

Merging this change closes tensorflow#33906

PiperOrigin-RevId: 833312610
For now this just unconditionally does a copy from the memref buffer to the bufferized tensor, in the future we can do some analysis to avoid this in the un-strided case when we know all instances are in-bounds.

PiperOrigin-RevId: 833313677
Use placeholder cuDNN version instead of hard-coded 9.10 to prevent failing after cuDNN update.

The test only fills in the current version when the stored version is exactly "1.2.3". The version is used as part of the cache key, so hard-coding 9.10 only works specifically on cuDNN 9.10.

PiperOrigin-RevId: 833319023
This change removes the dependency on `SymbolicExprContext` by using `mlir::MLIRContext` as the primary context for creating symbolic expressions within the `SymbolicMap` and its conversion utilities.

This simplifies the context management and aligns better with MLIR's infrastructure.

PiperOrigin-RevId: 833323582
PiperOrigin-RevId: 833326696
- There are cases in gemm_fusion_autotuner where we don't have a reference output from cuBLAS and we skip the requested correctness check. Hence updating the code in new infra to match current state.

PiperOrigin-RevId: 833337857
@hsharsha hsharsha force-pushed the develop-upstream-sync-20251117 branch from e5f8f74 to 6210425 Compare November 19, 2025 12:19
@hsharsha hsharsha marked this pull request as draft November 19, 2025 15:41
@hsharsha hsharsha force-pushed the develop-upstream-sync-20251117 branch 3 times, most recently from ac6f47c to 7e71814 Compare November 20, 2025 10:51
@hsharsha hsharsha force-pushed the develop-upstream-sync-20251117 branch from 7e71814 to 067d23a Compare November 20, 2025 11:13
@hsharsha hsharsha marked this pull request as ready for review November 20, 2025 14:46
@hsharsha
Copy link
Author

Looks like it works

@hsharsha
Copy link
Author

@hsharsha hsharsha force-pushed the develop-upstream-sync-20251117 branch from ee15a6b to 6e5266c Compare November 22, 2025 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.