Develop upstream sync 20251117 #3158

hsharsha · 2025-11-18T13:57:44Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…ut running HLO passes PiperOrigin-RevId: 830848849

…ferization. I am going to rewrite the tiled lowering pipeline to start taking advantage of destintation-passing style ops & bufferization, but first I need to ensure the fusion compiler works for it. PiperOrigin-RevId: 830849292

This change adds checks to ensure that indices used to access `lhs_output_tile_sizes` and `block_level_parameters.output_tile_sizes` are within the valid range, preventing potential out-of-bounds errors. PiperOrigin-RevId: 830855360

PiperOrigin-RevId: 830858428

PiperOrigin-RevId: 830863081

PiperOrigin-RevId: 830867633

This change extends the `BufferDebugFloatCheckEntry` struct to include counts for infinities and zeros. The CUDA kernel for float checking is updated to compute and store these additional counts in the log entries. PiperOrigin-RevId: 830870471

The legacy emitter will be deleted soon. Change tests to use the new generic emitter instead. Updates various HLO test cases from using the `__triton_gemm` backend config to the new `__triton_nested_gemm_fusion` structure, which involves defining separate nested fusions for the LHS and RHS of the dot operation. Also adjusts legacy test configurations. PiperOrigin-RevId: 830888333

PiperOrigin-RevId: 830894329

Add native_backend before block_level_emitter_backend. This means that when tests are comparing against goldens and using `--xla_gpu_deterministic_ops=true`, it will choose the native emitter backend (it just chooses the first config). This means I don't have to update the sensitive goldens for the flag flip. PiperOrigin-RevId: 830895045

…parts of the fusion_emitter APIs. This change introduces a `FusionEmitter` base class which currently emits xtile. This is done to remove se::DeviceDescription from the fusion_emitter API in the interest of sharing the infra between CPU and GPU. The Triton specific logic is implemented in `TritonFusionEmitter`. PiperOrigin-RevId: 830908750

This just means it will use the original operand value if there is no mapping. This covers the odd case for when the return op is not within the reduction space - for example a constant: ``` %cst = arith.constant dense<true> : tensor<i1> ... %12 = stablehlo.reduce(%11 init: %cst_0) across dimensions = [1] : (tensor<4x8xi1>, tensor<i1>) -> tensor<4xi1> reducer(%arg3: tensor<i1>, %arg4: tensor<i1>) { stablehlo.return %cst : tensor<i1> } ... ``` Previously, the code was crashing, but now it's working. PiperOrigin-RevId: 830916653

Cholesky HLO was the only user for cusolver. After cr/817718825 switched JAX to using FFI for cusolver instead of the dedicated op, we can remove the lowering to the cusolver custom-call and CholeskyThunk. TF users if any, can rely on the CholeskyExpander added to the GPU pipeline. PiperOrigin-RevId: 830927167

This change makes `BufferDebugLog` a template class `BufferDebugLog<Entry>`, inheriting from a new `BufferDebugLogBase`. This allows methods like `RequiredSizeForEntries`, `CreateOnDevice`, and `ReadFromDevice` to be type-safe and simplifies their usage by removing the need to explicitly pass the `Entry` type in many calls. PiperOrigin-RevId: 830956449

…usion_compiler_opt`. PiperOrigin-RevId: 830997907

PiperOrigin-RevId: 831009555

PiperOrigin-RevId: 831020229

PiperOrigin-RevId: 831023473

…ds. Remaining methods will be updated in follow up cl's. PiperOrigin-RevId: 831046450

… LocalToGlobal custom-calls. cl/826134489 only clears the frontend attributes for mhlo.custom-call. This cl adds support for stablehlo.custom-call. PiperOrigin-RevId: 831055853

PiperOrigin-RevId: 831125137

…gyDescription. PiperOrigin-RevId: 831130604

PiperOrigin-RevId: 831133271

…utations but with force delay PiperOrigin-RevId: 831133591

PiperOrigin-RevId: 831147246

The image now has pyyaml PiperOrigin-RevId: 831202578

PiperOrigin-RevId: 831206765

…so on Apple. Tsan seems not to work with `mstats()` (other sanitizers are fine). As per `lite/profiling/BUILD:160`, this could happen, but due to how `tflite_portable_test_suite_combined` rule works, this test will be included if the tsan flag is passed, even if the test itself is `notsan`. PiperOrigin-RevId: 831261356

@xla-rotation

Imported from GitHub PR openxla/xla#31348 Since rocm7, hipGetLastError returns last error even if last call was successful. This change makes it possible to work correctly with rocm7. We need to call hipGetLastError to reset per thread error state. Especially, it fixed the failed test in TransformerEngine compiled with rocm7, `pytest -vvv -s tests/jax/test_distributed_layernorm_mlp.py::TestDistributedLayernormMLP::test_layernorm_fp8_mlp_primitive[True-bfloat16-activation_type0-input_shape0-mesh_config1]` @xla-rotation could you review my PR, please? Copybara import of the project: -- 0b72c66a759147a9f85ca0e3bc7c59e33d0ac9b8 by songlin <Songlin.Piao@amd.com>: added rocm7 support to EnablePeerAccess -- 5166268be4e1246a960b4bef687bb8e4122ed9bf by songlin <Songlin.Piao@amd.com>: use wrap namespace, clang-format and add comments Merging this change closes tensorflow#31348 PiperOrigin-RevId: 831263261

Imported from GitHub PR openxla/xla#33794 📝 Summary of Changes Support int4 in cuDNN GEMM fusions. 🎯 Justification Accelerates some int4 GEMM fusions (under the flag xla_gpu_cudnn_gemm_fusion_level). 🚀 Kind of Contribution ⚡️ Performance Improvement 📊 Benchmark (for Performance Improvements) > Please measure and include speedups for one of the public HLOs in `compiler/xla/tools/benchmarks/hlo/`. These do not use int4. 🧪 Unit Tests: yes 🧪 Execution Tests: yes Copybara import of the project: -- e1b8dc7daff4963b93152d2a5c81c4d91a9f14d8 by Ilia Sergachev <isergachev@nvidia.com>: [GPU] Support int4 in cuDNN GEMM fusions. Merging this change closes tensorflow#33794 PiperOrigin-RevId: 831264661

- This fixes the xla_gpu_cublas_fallback flag behavior. PiperOrigin-RevId: 833294523

This was probably an oversight. Our tests did not detect this, as SortRewriter will not rewrite if there is no kernel registered for the data type. Adjust the SortSupportsType test so that it would catch this issue. PiperOrigin-RevId: 833305248

This change introduces free functions like `CreateSymbolicConstant`, `CreateSymbolicVariable`, and `CreateSymbolicBinaryOp` that directly accept an `mlir::MLIRContext*`. The `SymbolicExprContext` class is now deprecated, and its methods are refactored to call these new free functions. The internal `SymbolicExprStorage` now holds a pointer to `mlir::MLIRContext`. I left the SymbolicExprContext for now until all the refactor is done. PiperOrigin-RevId: 833305709

This heuristic should improve runtime on Blackwell architecture for fusions with small data types (<= 16 bits). PiperOrigin-RevId: 833308087

…5d8f6c9a6554d0b004bc0… Imported from GitHub PR openxla/xla#33906 …cd474 📝 Summary of Changes Upgrade bitcode library. Small cleanup of build rules. 🎯 Justification Pull in the changes for the new gfx archs. 🚀 Kind of Contribution ♻️ Cleanup 📊 Benchmark (for Performance Improvements) N\A 🧪 Unit Tests: N\A 🧪 Execution Tests: N\A Copybara import of the project: -- 688346550b3733693459bbcef4e6d3a754bb2746 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>: [ROCm] upgrade bitcode library to fcc50fb091b7c75d8f6c9a6554d0b004bc0cd474 Merging this change closes tensorflow#33906 PiperOrigin-RevId: 833312610

For now this just unconditionally does a copy from the memref buffer to the bufferized tensor, in the future we can do some analysis to avoid this in the un-strided case when we know all instances are in-bounds. PiperOrigin-RevId: 833313677

Use placeholder cuDNN version instead of hard-coded 9.10 to prevent failing after cuDNN update. The test only fills in the current version when the stored version is exactly "1.2.3". The version is used as part of the cache key, so hard-coding 9.10 only works specifically on cuDNN 9.10. PiperOrigin-RevId: 833319023

This change removes the dependency on `SymbolicExprContext` by using `mlir::MLIRContext` as the primary context for creating symbolic expressions within the `SymbolicMap` and its conversion utilities. This simplifies the context management and aligns better with MLIR's infrastructure. PiperOrigin-RevId: 833323582

PiperOrigin-RevId: 833326696

- There are cases in gemm_fusion_autotuner where we don't have a reference output from cuBLAS and we skip the requested correctness check. Hence updating the code in new infra to match current state. PiperOrigin-RevId: 833337857

…sync-20251117

hsharsha · 2025-11-21T19:12:45Z

Looks like it works

hsharsha · 2025-11-21T19:12:50Z

http://aus-navi3x-03.amd.com:8080/job/tensorflow/job/ThirdParty-XLA/job/PR-3158/11/console

basioli-k and others added 30 commits November 11, 2025 04:30

[XLA:CPU][nanort] Enable nanort client to compile an HLO module witho…

e539f51

…ut running HLO passes PiperOrigin-RevId: 830848849

NFC: Add bounds checks in Triton fusion emitter.

485a5ec

This change adds checks to ensure that indices used to access `lhs_output_tile_sizes` and `block_level_parameters.output_tile_sizes` are within the valid range, preventing potential out-of-bounds errors. PiperOrigin-RevId: 830855360

Automated Code Change

d2a0a49

PiperOrigin-RevId: 830858428

Reverts 9cbe7bd

be86484

PiperOrigin-RevId: 830863081

Automated Code Change

b2d0fdb

PiperOrigin-RevId: 830867633

[stablehlo] Add ALG_DOT_BF16_BF16_F32_X9 to ConvertDotAlgorithm

578a4f8

PiperOrigin-RevId: 830894329

[XLA:CPU][XTile] Register StableHLO to Linalg conversion passes in `f…

63047d9

…usion_compiler_opt`. PiperOrigin-RevId: 830997907

Reverts 42eb0db

20b7e54

PiperOrigin-RevId: 831009555

add CustomCall to append Tensor to file

9da0422

PiperOrigin-RevId: 831020229

raise nccl channel limit and add blackwell nvlink bandwidth

1fdc835

PiperOrigin-RevId: 831023473

(1/N) Add support for NamedSharding in existing HloSharding metho…

29ae5d0

…ds. Remaining methods will be updated in follow up cl's. PiperOrigin-RevId: 831046450

Clear frontend attributes for get-tuple-elements of GlobalToLocal and…

bdca87b

… LocalToGlobal custom-calls. cl/826134489 only clears the frontend attributes for mhlo.custom-call. This cl adds support for stablehlo.custom-call. PiperOrigin-RevId: 831055853

Allow op cost analysis to be customizable for TFRT session

d85da0b

PiperOrigin-RevId: 831125137

Introduce methods to query host, chip, and device IDs from PjRtTopolo…

c3ecebb

…gyDescription. PiperOrigin-RevId: 831130604

Remove redundant constructors for TileAssignment.

fad2b04

PiperOrigin-RevId: 831133271

[XLA] Run latency hiding scheduler on computations with no async comp…

fa7dd80

…utations but with force delay PiperOrigin-RevId: 831133591

Move dlpack_support to xla and separate out stride functions from types.

de237d1

PiperOrigin-RevId: 831147246

Update XLA ml-build image used by benchmark to the latest version

a5bcf5a

The image now has pyyaml PiperOrigin-RevId: 831202578

Automated Code Change

b15ba07

PiperOrigin-RevId: 831206765

tensorflower-gardener and others added 14 commits November 17, 2025 05:37

Update autotuner to filter out "Cublas_fission" backends.

8a1e75e

- This fixes the xla_gpu_cublas_fallback flag behavior. PiperOrigin-RevId: 833294523

[XLA:GPU] Enable maximum unroll factor 8 heuristic by default.

dc0aec4

This heuristic should improve runtime on Blackwell architecture for fusions with small data types (<= 16 bits). PiperOrigin-RevId: 833308087

Handle fission in legacy cache.

37ad2d2

PiperOrigin-RevId: 833326696

Merge remote-tracking branch 'upstream/master' into develop-upstream-…

4d8491f

…sync-20251117

Resolve merge conflict

b3ad88b

Enable TF_NEED_ROCM in cpu tests script

a1daec9

Fix lldMain access to the LLVM command line

6210425

hsharsha force-pushed the develop-upstream-sync-20251117 branch from e5f8f74 to 6210425 Compare November 19, 2025 12:19

hsharsha marked this pull request as draft November 19, 2025 15:41

hsharsha force-pushed the develop-upstream-sync-20251117 branch 3 times, most recently from ac6f47c to 7e71814 Compare November 20, 2025 10:51

Add -no-canonical-prefixes to net_zstd

067d23a

hsharsha force-pushed the develop-upstream-sync-20251117 branch from 7e71814 to 067d23a Compare November 20, 2025 11:13

Snub logging from rocm_tracer

d5a10eb

hsharsha marked this pull request as ready for review November 20, 2025 14:46

hsharsha added 3 commits November 21, 2025 03:00

Fix xla unit tests

43382aa

Enable more gpu_pycpp tests

b5f3d12

Differentiate container name based on CI Job for parallel CI processing

1231900

Unset TF_NEED_ROCM for cpu tests and conditonal include in dso_loader

6e5266c

hsharsha force-pushed the develop-upstream-sync-20251117 branch from ee15a6b to 6e5266c Compare November 22, 2025 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Develop upstream sync 20251117 #3158

Develop upstream sync 20251117 #3158

hsharsha commented Nov 18, 2025

Uh oh!

hsharsha commented Nov 21, 2025

Uh oh!

hsharsha commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

61 participants

Develop upstream sync 20251117 #3158

Are you sure you want to change the base?

Develop upstream sync 20251117 #3158

Conversation

hsharsha commented Nov 18, 2025

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

hsharsha commented Nov 21, 2025

Uh oh!

hsharsha commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

61 participants