forked from tensorflow/tensorflow
-
Notifications
You must be signed in to change notification settings - Fork 99
Develop upstream sync 20251117 #3158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
hsharsha
wants to merge
1,046
commits into
develop-upstream
Choose a base branch
from
develop-upstream-sync-20251117
base: develop-upstream
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+83,397
−50,766
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ut running HLO passes PiperOrigin-RevId: 830848849
…ferization. I am going to rewrite the tiled lowering pipeline to start taking advantage of destintation-passing style ops & bufferization, but first I need to ensure the fusion compiler works for it. PiperOrigin-RevId: 830849292
This change adds checks to ensure that indices used to access `lhs_output_tile_sizes` and `block_level_parameters.output_tile_sizes` are within the valid range, preventing potential out-of-bounds errors. PiperOrigin-RevId: 830855360
PiperOrigin-RevId: 830858428
PiperOrigin-RevId: 830867633
This change extends the `BufferDebugFloatCheckEntry` struct to include counts for infinities and zeros. The CUDA kernel for float checking is updated to compute and store these additional counts in the log entries. PiperOrigin-RevId: 830870471
The legacy emitter will be deleted soon. Change tests to use the new generic emitter instead. Updates various HLO test cases from using the `__triton_gemm` backend config to the new `__triton_nested_gemm_fusion` structure, which involves defining separate nested fusions for the LHS and RHS of the dot operation. Also adjusts legacy test configurations. PiperOrigin-RevId: 830888333
PiperOrigin-RevId: 830894329
Add native_backend before block_level_emitter_backend. This means that when tests are comparing against goldens and using `--xla_gpu_deterministic_ops=true`, it will choose the native emitter backend (it just chooses the first config). This means I don't have to update the sensitive goldens for the flag flip. PiperOrigin-RevId: 830895045
…parts of the fusion_emitter APIs. This change introduces a `FusionEmitter` base class which currently emits xtile. This is done to remove se::DeviceDescription from the fusion_emitter API in the interest of sharing the infra between CPU and GPU. The Triton specific logic is implemented in `TritonFusionEmitter`. PiperOrigin-RevId: 830908750
This just means it will use the original operand value if there is no mapping. This covers the odd case for when the return op is not within the reduction space - for example a constant:
```
%cst = arith.constant dense<true> : tensor<i1>
...
%12 = stablehlo.reduce(%11 init: %cst_0) across dimensions = [1] : (tensor<4x8xi1>, tensor<i1>) -> tensor<4xi1>
reducer(%arg3: tensor<i1>, %arg4: tensor<i1>) {
stablehlo.return %cst : tensor<i1>
}
...
```
Previously, the code was crashing, but now it's working.
PiperOrigin-RevId: 830916653
Cholesky HLO was the only user for cusolver. After cr/817718825 switched JAX to using FFI for cusolver instead of the dedicated op, we can remove the lowering to the cusolver custom-call and CholeskyThunk. TF users if any, can rely on the CholeskyExpander added to the GPU pipeline. PiperOrigin-RevId: 830927167
This change makes `BufferDebugLog` a template class `BufferDebugLog<Entry>`, inheriting from a new `BufferDebugLogBase`. This allows methods like `RequiredSizeForEntries`, `CreateOnDevice`, and `ReadFromDevice` to be type-safe and simplifies their usage by removing the need to explicitly pass the `Entry` type in many calls. PiperOrigin-RevId: 830956449
…usion_compiler_opt`. PiperOrigin-RevId: 830997907
PiperOrigin-RevId: 831020229
PiperOrigin-RevId: 831023473
…ds. Remaining methods will be updated in follow up cl's. PiperOrigin-RevId: 831046450
… LocalToGlobal custom-calls. cl/826134489 only clears the frontend attributes for mhlo.custom-call. This cl adds support for stablehlo.custom-call. PiperOrigin-RevId: 831055853
PiperOrigin-RevId: 831125137
…gyDescription. PiperOrigin-RevId: 831130604
PiperOrigin-RevId: 831133271
…utations but with force delay PiperOrigin-RevId: 831133591
PiperOrigin-RevId: 831147246
The image now has pyyaml PiperOrigin-RevId: 831202578
PiperOrigin-RevId: 831206765
…so on Apple. Tsan seems not to work with `mstats()` (other sanitizers are fine). As per `lite/profiling/BUILD:160`, this could happen, but due to how `tflite_portable_test_suite_combined` rule works, this test will be included if the tsan flag is passed, even if the test itself is `notsan`. PiperOrigin-RevId: 831261356
Imported from GitHub PR openxla/xla#31348 Since rocm7, hipGetLastError returns last error even if last call was successful. This change makes it possible to work correctly with rocm7. We need to call hipGetLastError to reset per thread error state. Especially, it fixed the failed test in TransformerEngine compiled with rocm7, `pytest -vvv -s tests/jax/test_distributed_layernorm_mlp.py::TestDistributedLayernormMLP::test_layernorm_fp8_mlp_primitive[True-bfloat16-activation_type0-input_shape0-mesh_config1]` @xla-rotation could you review my PR, please? Copybara import of the project: -- 0b72c66a759147a9f85ca0e3bc7c59e33d0ac9b8 by songlin <Songlin.Piao@amd.com>: added rocm7 support to EnablePeerAccess -- 5166268be4e1246a960b4bef687bb8e4122ed9bf by songlin <Songlin.Piao@amd.com>: use wrap namespace, clang-format and add comments Merging this change closes tensorflow#31348 PiperOrigin-RevId: 831263261
Imported from GitHub PR openxla/xla#33794 📝 Summary of Changes Support int4 in cuDNN GEMM fusions. 🎯 Justification Accelerates some int4 GEMM fusions (under the flag xla_gpu_cudnn_gemm_fusion_level). 🚀 Kind of Contribution ⚡️ Performance Improvement 📊 Benchmark (for Performance Improvements) > Please measure and include speedups for one of the public HLOs in `compiler/xla/tools/benchmarks/hlo/`. These do not use int4. 🧪 Unit Tests: yes 🧪 Execution Tests: yes Copybara import of the project: -- e1b8dc7daff4963b93152d2a5c81c4d91a9f14d8 by Ilia Sergachev <isergachev@nvidia.com>: [GPU] Support int4 in cuDNN GEMM fusions. Merging this change closes tensorflow#33794 PiperOrigin-RevId: 831264661
- This fixes the xla_gpu_cublas_fallback flag behavior. PiperOrigin-RevId: 833294523
This was probably an oversight. Our tests did not detect this, as SortRewriter will not rewrite if there is no kernel registered for the data type. Adjust the SortSupportsType test so that it would catch this issue. PiperOrigin-RevId: 833305248
This change introduces free functions like `CreateSymbolicConstant`, `CreateSymbolicVariable`, and `CreateSymbolicBinaryOp` that directly accept an `mlir::MLIRContext*`. The `SymbolicExprContext` class is now deprecated, and its methods are refactored to call these new free functions. The internal `SymbolicExprStorage` now holds a pointer to `mlir::MLIRContext`. I left the SymbolicExprContext for now until all the refactor is done. PiperOrigin-RevId: 833305709
This heuristic should improve runtime on Blackwell architecture for fusions with small data types (<= 16 bits). PiperOrigin-RevId: 833308087
…5d8f6c9a6554d0b004bc0… Imported from GitHub PR openxla/xla#33906 …cd474 📝 Summary of Changes Upgrade bitcode library. Small cleanup of build rules. 🎯 Justification Pull in the changes for the new gfx archs. 🚀 Kind of Contribution ♻️ Cleanup 📊 Benchmark (for Performance Improvements) N\A 🧪 Unit Tests: N\A 🧪 Execution Tests: N\A Copybara import of the project: -- 688346550b3733693459bbcef4e6d3a754bb2746 by Dragan Mladjenovic <Dragan.Mladjenovic@amd.com>: [ROCm] upgrade bitcode library to fcc50fb091b7c75d8f6c9a6554d0b004bc0cd474 Merging this change closes tensorflow#33906 PiperOrigin-RevId: 833312610
For now this just unconditionally does a copy from the memref buffer to the bufferized tensor, in the future we can do some analysis to avoid this in the un-strided case when we know all instances are in-bounds. PiperOrigin-RevId: 833313677
Use placeholder cuDNN version instead of hard-coded 9.10 to prevent failing after cuDNN update. The test only fills in the current version when the stored version is exactly "1.2.3". The version is used as part of the cache key, so hard-coding 9.10 only works specifically on cuDNN 9.10. PiperOrigin-RevId: 833319023
This change removes the dependency on `SymbolicExprContext` by using `mlir::MLIRContext` as the primary context for creating symbolic expressions within the `SymbolicMap` and its conversion utilities. This simplifies the context management and aligns better with MLIR's infrastructure. PiperOrigin-RevId: 833323582
PiperOrigin-RevId: 833326696
- There are cases in gemm_fusion_autotuner where we don't have a reference output from cuBLAS and we skip the requested correctness check. Hence updating the code in new infra to match current state. PiperOrigin-RevId: 833337857
e5f8f74 to
6210425
Compare
ac6f47c to
7e71814
Compare
7e71814 to
067d23a
Compare
Author
|
Looks like it works |
Author
ee15a6b to
6e5266c
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist