Add stable parallel_for #161320

mikaylagawarecki · 2025-08-22T22:38:10Z

The current state of the world is that there are two implementations of torch's parallel interface, OpenMP and ParallelNative. INTRA_OP_PARALLEL (which is used to is gate whether the "parallel logic" in parallel_for is used, see below) is defined if

AT_PARALLEL_OPENMP = 1 (at libtorch build time per the generated ATen/Config.h) + _OPENMP is defined at extension build time (meaning that both libtorch and extension compile/link against OPENMP)
AT_PARALLEL_OPENMP=0 && AT_PARALLEL_NATIVE = 1 (at libtorch build time per the generated ATen/Config.h)

pytorch/aten/src/ATen/Parallel-inl.h

Lines 9 to 43 in e0cb184

    
           template <class F> 
        
           inline void parallel_for( 
        
               const int64_t begin, 
        
               const int64_t end, 
        
               const int64_t grain_size, 
        
               const F& f) { 
        
             TORCH_INTERNAL_ASSERT_DEBUG_ONLY(grain_size >= 0); 
        
             if (begin >= end) { 
        
               return; 
        
             } 
        
           #ifdef INTRA_OP_PARALLEL 
        
             at::internal::lazy_init_num_threads(); 
        
             const auto numiter = end - begin; 
        
             const bool use_parallel = 
        
                 (numiter > grain_size && numiter > 1 && !at::in_parallel_region() && 
        
                  at::get_num_threads() > 1); 
        
             if (!use_parallel) { 
        
               internal::ThreadIdGuard tid_guard(0); 
        
               c10::ParallelGuard guard(true); 
        
               f(begin, end); 
        
               return; 
        
             } 
        
             internal::invoke_parallel( 
        
                 begin, end, grain_size, [&](int64_t begin, int64_t end) { 
        
                   c10::ParallelGuard guard(true); 
        
                   f(begin, end); 
        
                 }); 
        
           #else 
        
             internal::ThreadIdGuard tid_guard(0); 
        
             c10::ParallelGuard guard(true); 
        
             f(begin, end); 
        
           #endif 
        
           }

The approach taken in this PR is to paste the implementation of parallel_for from ATen/Parallel-inl.h into torch/csrc/stable/ops.h with the following modifications:

For perf, we want the function passed to parallel_for to be inlined all the way into invoke_parallel

This is possible for the OpenMP implementation which templates F just like how parallel_for does

pytorch/aten/src/ATen/ParallelOpenMP.h

Lines 14 to 53 in 001e1d2

    
           #ifdef _OPENMP 
        
           namespace at::internal { 
        
           template <typename F> 
        
           inline void invoke_parallel( 
        
               int64_t begin, 
        
               int64_t end, 
        
               int64_t grain_size, 
        
               const F& f) { 
        
             std::atomic_flag err_flag = ATOMIC_FLAG_INIT; 
        
             std::exception_ptr eptr; 
        
           #pragma omp parallel 
        
             { 
        
               // choose number of tasks based on grain size and number of threads 
        
               // can't use num_threads clause due to bugs in GOMP's thread pool (See 
        
               // #32008) 
        
               int64_t num_threads = omp_get_num_threads(); 
        
               if (grain_size > 0) { 
        
                 num_threads = std::min(num_threads, divup((end - begin), grain_size)); 
        
               } 
        
               int64_t tid = omp_get_thread_num(); 
        
               int64_t chunk_size = divup((end - begin), num_threads); 
        
               int64_t begin_tid = begin + tid * chunk_size; 
        
               if (begin_tid < end) { 
        
                 try { 
        
                   internal::ThreadIdGuard tid_guard(tid); 
        
                   f(begin_tid, std::min(end, chunk_size + begin_tid)); 
        
                 } catch (...) { 
        
                   if (!err_flag.test_and_set()) { 
        
                     eptr = std::current_exception(); 
        
                   } 
        
                 } 
        
               } 
        
             } 
        
             if (eptr) { 
        
               std::rethrow_exception(eptr); 
        
             } 
        
           } 
        
           } // namespace at::internal

We paste the implementation of invoke_parallel into torch::stable::internal with a modification that ThreadIdGuard (the only non-headeronly piece) uses the shimmed version. We do not move it to torch/headeronly due to the ThreadIdGuard limitation

When I compile the call I added in kernel.cpp,

torch::stable::parallel_for(
      0, size, grain_size, [data_ptr](int64_t begin, int64_t end) {
        for (int64_t i = begin; i < end; i++) {
          int thread_id = aoti_torch_get_thread_num();
          data_ptr[i] = i | (static_cast<int64_t>(thread_id) << 32);
        }
      });

I can find the following in the objdump, which I think indicates that the function is getting inlined correctly into invoke_parallel for this path

This is not possible for the ParallelNative implementation

takes in an std::function for f

Is defined in a cpp (and relies on other non-headeronly functions)
For the above two reasons, we shim the ParallelNative version of invoke_parallel

pytorch/aten/src/ATen/ParallelNative.cpp

Lines 144 to 199 in 71aefd5

    
           void invoke_parallel( 
        
             const int64_t begin, 
        
             const int64_t end, 
        
             const int64_t grain_size, 
        
             const std::function<void(int64_t, int64_t)>& f) { 
        
             at::internal::lazy_init_num_threads(); 
        
             size_t num_tasks = 0, chunk_size = 0; 
        
             std::tie(num_tasks, chunk_size) = 
        
                 internal::calc_num_tasks_and_chunk_size(begin, end, grain_size); 
        
             struct { 
        
               std::atomic_flag err_flag = ATOMIC_FLAG_INIT; 
        
               std::exception_ptr eptr; 
        
               std::mutex mutex; 
        
               std::atomic_size_t remaining{0}; 
        
               std::condition_variable cv; 
        
             } state; 
        
             auto task = [f, &state, begin, end, chunk_size] 
        
                 (size_t task_id) { 
        
               int64_t local_start = static_cast<int64_t>(begin + task_id * chunk_size); 
        
               if (local_start < end) { 
        
                 int64_t local_end = std::min(end, static_cast<int64_t>(chunk_size + local_start)); 
        
                 try { 
        
                   ParallelRegionGuard guard(static_cast<int>(task_id)); 
        
                   f(local_start, local_end); 
        
                 } catch (...) { 
        
                   if (!state.err_flag.test_and_set()) { 
        
                     state.eptr = std::current_exception(); 
        
                   } 
        
                 } 
        
               } 
        
               { 
        
                 std::unique_lock<std::mutex> lk(state.mutex); 
        
                 if (--state.remaining == 0) { 
        
                   state.cv.notify_one(); 
        
                 } 
        
               } 
        
             }; 
        
             state.remaining = num_tasks; 
        
             _run_with_pool(std::move(task), num_tasks); 
        
             // Wait for all tasks to finish. 
        
             { 
        
               std::unique_lock<std::mutex> lk(state.mutex); 
        
               if (state.remaining != 0) { 
        
                 state.cv.wait(lk); 
        
               } 
        
             } 
        
             if (state.eptr) { 
        
               std::rethrow_exception(state.eptr); 
        
             } 
        
           } 
        
           } // namespace internal

The rest of the APIs are shimmed

at::internal::lazy_init_num_threads() --> aoti_torch_lazy_init_num_threads
Reason for shimming: The implementation of at::internal::lazy_init_num_threads() calls at::init_num_threads which is not header-only
at::in_parallel_region --> aoti_torch_in_parallel_region
Reason for shimming: The OpenMP implementation is defined in a .cpp and depends on whether OPENMP is linked against at libtorch build time. The ParallelNative implementation is defined in a .cpp and depends on whether c10_MOBILE is defined at libtorch build time
at::get_num_threads --> aoti_torch_get_num_threads
Reason for shimming: Similar story to in_parallel_region, ParallelNative impl, OpenMP impl
ThreadIdGuard --> aoti_torch_create_thread_id_guard and aoti_torch_delete_thread_id_guard, with a C++ wrapper torch::stable::ThreadIdGuard
Reason for shimming: Depends on set_thread_num which is not header-only ThreadIdGuard impl
c10::ParallelGuard --> aoti_torch_create_parallel_guard, aoti_torch_delete_parallel_guard, aoti_torch_parallel_guard_is_enabled with a C++ wrapper torch::stable::ParallelGuard
Reason for shimming: Has a cpp file [ParallelGuard.cpp]

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]

pytorch-bot · 2025-08-22T22:38:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161320

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 19ab872 with merge base f06e669 ():

NEW FAILURE - The following job has failed:

Lint / lintrunner-noclang-all / linux-job (gh)
>>> Lint for test/cpp_extensions/libtorch_agnostic_extension/setup.py:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 2, lf.linux.2xlarge) (gh) (similar failure)
test/distributed/tensor/test_dtensor_ops.py::TestMultiThreadedDTensorOpsCPU::test_dtensor_op_db_full_like_cpu_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 691b9f9 Pull Request resolved: #161320

[ghstack-poisoned]

ghstack-source-id: 5ed00f0 Pull Request resolved: #161320

github-actions · 2025-08-22T22:42:09Z

Attention! PyTorch one of the C-stable API file was changed

You MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function.

Caused by:

torch/csrc/inductor/aoti_torch/c/shim.h

[ghstack-poisoned]

ghstack-source-id: b3f7fdf Pull Request resolved: #161320

torch/headeronly/util/ParallelGuard.h

[ghstack-poisoned]

ghstack-source-id: e9a7389 Pull Request resolved: #161320

[ghstack-poisoned]

ghstack-source-id: 33fb487 Pull Request resolved: #161320

[ghstack-poisoned]

ghstack-source-id: ae2e218 Pull Request resolved: #161320

[ghstack-poisoned]

ghstack-source-id: c236a43 Pull Request resolved: #161320

[ghstack-poisoned]

ghstack-source-id: 9e95d70 Pull Request resolved: #161320

[ghstack-poisoned]

ghstack-source-id: dd66444 Pull Request resolved: #161320

The current state of the world is that there are two parallel backends in torch, OpenMP and ParallelNative. `INTRA_OP_PARALLEL` (which is used to is gate whether the "parallel logic" in `parallel_for` is used, see below) is defined if 1. `AT_PARALLEL_OPENMP = 1` (at libtorch build time per the generated ATen/Config.h) + `_OPENMP` is defined both at libtorch build time and extension build time (meaning that **both libtorch and extension link against OPENMP**, for example see https://github.com/pytorch/audio/pull/1761/files and pytorch/vision#2783) 2. `AT_PARALLEL_OPENMP=0 && AT_PARALLEL_NATIVE = 1` (at libtorch build time per the generated ATen/Config.h) https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel-inl.h#L9-L43 The approach taken in this PR is to paste the implementation of `parallel_for` from `ATen/Parallel-inl.h `into `torch/csrc/stable/ops.h` with the following modifications: For perf, we want the function passed to `parallel_for` to be inlined all the way into `invoke_parallel` - This is possible for the OpenMP implementation which templates F just like how `parallel_for` does https://github.com/pytorch/pytorch/blob/001e1d263746ae9d121d9c8cf55bc87f777d9dba/aten/src/ATen/ParallelOpenMP.h#L14-L53 We paste the implementation of `invoke_parallel` into `torch::stable::internal` **with a modification that `ThreadIdGuard` (the only non-headeronly piece) uses the shimmed version. We do not move it to torch/headeronly due to the ThreadIdGuard limitation** - This is not possible for the ParallelNative implementation - takes in an `std::function` for `f` - Is defined in a cpp (and relies on other non-headeronly functions) For the above two reasons, we shim the ParallelNative version of `invoke_parallel` https://github.com/pytorch/pytorch/blob/71aefd5595834dd97f38aa978ee32abbd13ac3d6/aten/src/ATen/ParallelNative.cpp#L144-L199 Looking at the objdump of the test call to parallel_for I add in kernel.cpp when compiling with `-fopenmp` ``` torch::stable::parallel_for( 0, size, grain_size, [data_ptr](int64_t begin, int64_t end) { for (int64_t i = begin; i < end; i++) { #ifdef _OPENMP int thread_id = omp_get_thread_num(); data_ptr[i] = i | (static_cast<int64_t>(thread_id) << 32); #else data_ptr[i] = i; #endif } }); ``` I can see the following, which I think indicates the function is getting inlined properly <img width="1201" height="327" alt="Screenshot 2025-10-08 at 5 02 06 PM" src="https://github.com/user-attachments/assets/6ada8ce9-dc28-4157-b8ba-3da347856674" /> The rest of the APIs are shimmed - `at::internal::lazy_init_num_threads()` --> `aoti_torch_lazy_init_num_threads` Reason for shimming: The [implementation of `at::internal::lazy_init_num_threads()`](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel.h#L32-L38) calls `at::init_num_threads` which is not header-only - `at::in_parallel_region` --> `aoti_torch_in_parallel_region` Reason for shimming: The [OpenMP implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L94-L100) is defined in a .cpp and depends on whether OPENMP is linked against at libtorch build time. The [ParallelNative implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L266-L276) is defined in a .cpp and depends on whether `c10_MOBILE` is defined at libtorch build time - `at::get_num_threads` --> `aoti_torch_get_num_threads` Reason for shimming: Similar story to `in_parallel_region`, [`ParallelNative` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L241-L260), [`OpenMP` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L75-L82) - `ThreadIdGuard` --> `aoti_torch_create_thread_id_guard` and `aoti_torch_delete_thread_id_guard`, with a C++ wrapper `torch::stable::ThreadIdGuard` Reason for shimming: Depends on `set_thread_num` which is not header-only [ThreadIdGuard impl](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Parallel.h?fbclid=IwY2xjawNTpHBleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR6LlnxdN6zn2HJlVDeoUyYJBHLZKidAmH_wiEJ7CbBE5bF56_4-WaltmlBOEw_aem_iArU_QX6AZQZeBizz6EEJQ#L42-L50) - `c10::ParallelGuard` --> `aoti_torch_create_parallel_guard`, `aoti_torch_delete_parallel_guard`, `aoti_torch_parallel_guard_is_enabled` with a C++ wrapper `torch::stable::ParallelGuard` Reason for shimming: Has a cpp file [[ParallelGuard.cpp](https://github.com/pytorch/pytorch/blob/main/c10/util/ParallelGuard.cpp?fbclid=IwY2xjawNTpLxleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR7k_w2Ob695Dy7w_WYPK9vsiMEycutaGMeNkwsp_m0x8Y2FbyWMA1QVtvsH7Q_aem_07IlbboUfzDpqB07wU27IA)] [ghstack-poisoned]

ghstack-source-id: 1955d1a Pull Request resolved: #161320

The current state of the world is that there are two parallel backends in torch, OpenMP and ParallelNative. `INTRA_OP_PARALLEL` (which is used to is gate whether the "parallel logic" in `parallel_for` is used, see below) is defined if 1. `AT_PARALLEL_OPENMP = 1` (at libtorch build time per the generated ATen/Config.h) + `_OPENMP` is defined both at libtorch build time and extension build time (meaning that **both libtorch and extension link against OPENMP**, for example see https://github.com/pytorch/audio/pull/1761/files and pytorch/vision#2783) 2. `AT_PARALLEL_OPENMP=0 && AT_PARALLEL_NATIVE = 1` (at libtorch build time per the generated ATen/Config.h) https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel-inl.h#L9-L43 The approach taken in this PR is to paste the implementation of `parallel_for` from `ATen/Parallel-inl.h `into `torch/csrc/stable/ops.h` with the following modifications: For perf, we want the function passed to `parallel_for` to be inlined all the way into `invoke_parallel` - This is possible for the OpenMP implementation which templates F just like how `parallel_for` does https://github.com/pytorch/pytorch/blob/001e1d263746ae9d121d9c8cf55bc87f777d9dba/aten/src/ATen/ParallelOpenMP.h#L14-L53 We paste the implementation of `invoke_parallel` into `torch::stable::internal` **with a modification that `ThreadIdGuard` (the only non-headeronly piece) uses the shimmed version. We do not move it to torch/headeronly due to the ThreadIdGuard limitation** - This is not possible for the ParallelNative implementation - takes in an `std::function` for `f` - Is defined in a cpp (and relies on other non-headeronly functions) For the above two reasons, we shim the ParallelNative version of `invoke_parallel` https://github.com/pytorch/pytorch/blob/71aefd5595834dd97f38aa978ee32abbd13ac3d6/aten/src/ATen/ParallelNative.cpp#L144-L199 Looking at the objdump of the test call to parallel_for I add in kernel.cpp when compiling with `-fopenmp` ``` torch::stable::parallel_for( 0, size, grain_size, [data_ptr](int64_t begin, int64_t end) { for (int64_t i = begin; i < end; i++) { #ifdef _OPENMP int thread_id = omp_get_thread_num(); data_ptr[i] = i | (static_cast<int64_t>(thread_id) << 32); #else data_ptr[i] = i; #endif } }); ``` I can see the following, which I think indicates the function is getting inlined properly <img width="1201" height="327" alt="Screenshot 2025-10-08 at 5 02 06 PM" src="https://github.com/user-attachments/assets/6ada8ce9-dc28-4157-b8ba-3da347856674" /> The rest of the APIs are shimmed - `at::internal::lazy_init_num_threads()` --> `aoti_torch_lazy_init_num_threads` Reason for shimming: The [implementation of `at::internal::lazy_init_num_threads()`](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel.h#L32-L38) calls `at::init_num_threads` which is not header-only - `at::in_parallel_region` --> `aoti_torch_in_parallel_region` Reason for shimming: The [OpenMP implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L94-L100) is defined in a .cpp and depends on whether OPENMP is linked against at libtorch build time. The [ParallelNative implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L266-L276) is defined in a .cpp and depends on whether `c10_MOBILE` is defined at libtorch build time - `at::get_num_threads` --> `aoti_torch_get_num_threads` Reason for shimming: Similar story to `in_parallel_region`, [`ParallelNative` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L241-L260), [`OpenMP` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L75-L82) - `ThreadIdGuard` --> `aoti_torch_create_thread_id_guard` and `aoti_torch_delete_thread_id_guard`, with a C++ wrapper `torch::stable::ThreadIdGuard` Reason for shimming: Depends on `set_thread_num` which is not header-only [ThreadIdGuard impl](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Parallel.h?fbclid=IwY2xjawNTpHBleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR6LlnxdN6zn2HJlVDeoUyYJBHLZKidAmH_wiEJ7CbBE5bF56_4-WaltmlBOEw_aem_iArU_QX6AZQZeBizz6EEJQ#L42-L50) - `c10::ParallelGuard` --> `aoti_torch_create_parallel_guard`, `aoti_torch_delete_parallel_guard`, `aoti_torch_parallel_guard_is_enabled` with a C++ wrapper `torch::stable::ParallelGuard` Reason for shimming: Has a cpp file [[ParallelGuard.cpp](https://github.com/pytorch/pytorch/blob/main/c10/util/ParallelGuard.cpp?fbclid=IwY2xjawNTpLxleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR7k_w2Ob695Dy7w_WYPK9vsiMEycutaGMeNkwsp_m0x8Y2FbyWMA1QVtvsH7Q_aem_07IlbboUfzDpqB07wU27IA)] [ghstack-poisoned]

ghstack-source-id: 3114a1e Pull Request resolved: #161320

The current state of the world is that there are two parallel backends in torch, OpenMP and ParallelNative. `INTRA_OP_PARALLEL` (which is used to is gate whether the "parallel logic" in `parallel_for` is used, see below) is defined if 1. `AT_PARALLEL_OPENMP = 1` (at libtorch build time per the generated ATen/Config.h) + `_OPENMP` is defined both at libtorch build time and extension build time (iiuc this means that **both libtorch and extension link against OPENMP**, for example see https://github.com/pytorch/audio/pull/1761/files and pytorch/vision#2783) 2. `AT_PARALLEL_OPENMP=0 && AT_PARALLEL_NATIVE = 1` (at libtorch build time per the generated ATen/Config.h) https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel-inl.h#L9-L43 The approach taken in this PR is to paste the implementation of `parallel_for` from `ATen/Parallel-inl.h `into `torch/csrc/stable/ops.h` with the following modifications: For perf, we want the function passed to `parallel_for` to be inlined all the way into `invoke_parallel` - This is possible for the OpenMP implementation which templates F just like how `parallel_for` does https://github.com/pytorch/pytorch/blob/001e1d263746ae9d121d9c8cf55bc87f777d9dba/aten/src/ATen/ParallelOpenMP.h#L14-L53 We paste the implementation of `invoke_parallel` into `torch::stable::internal` **with a modification that `ThreadIdGuard` (the only non-headeronly piece) uses the shimmed version. We do not move it to torch/headeronly due to the ThreadIdGuard limitation** Looking at the objdump of the test call to parallel_for I add in kernel.cpp when compiling with `-fopenmp` ``` torch::stable::parallel_for( 0, size, grain_size, [data_ptr](int64_t begin, int64_t end) { for (int64_t i = begin; i < end; i++) { #ifdef _OPENMP int thread_id = omp_get_thread_num(); data_ptr[i] = i | (static_cast<int64_t>(thread_id) << 32); #else data_ptr[i] = i; #endif } }); ``` I can see the following, which I think indicates the function is getting inlined properly <img width="1201" height="327" alt="Screenshot 2025-10-08 at 5 02 06 PM" src="https://github.com/user-attachments/assets/6ada8ce9-dc28-4157-b8ba-3da347856674" /> - This is not possible for the ParallelNative implementation - takes in an `std::function` for `f` - Is defined in a cpp (and relies on other non-headeronly functions) For the above two reasons, we shim the ParallelNative version of `invoke_parallel` https://github.com/pytorch/pytorch/blob/71aefd5595834dd97f38aa978ee32abbd13ac3d6/aten/src/ATen/ParallelNative.cpp#L144-L199 The rest of the APIs are shimmed - `at::internal::lazy_init_num_threads()` --> `aoti_torch_lazy_init_num_threads` Reason for shimming: The [implementation of `at::internal::lazy_init_num_threads()`](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel.h#L32-L38) calls `at::init_num_threads` which is not header-only - `at::in_parallel_region` --> `aoti_torch_in_parallel_region` Reason for shimming: The [OpenMP implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L94-L100) is defined in a .cpp and depends on whether OPENMP is linked against at libtorch build time. The [ParallelNative implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L266-L276) is defined in a .cpp and depends on whether `c10_MOBILE` is defined at libtorch build time - `at::get_num_threads` --> `aoti_torch_get_num_threads` Reason for shimming: Similar story to `in_parallel_region`, [`ParallelNative` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L241-L260), [`OpenMP` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L75-L82) - `ThreadIdGuard` --> `aoti_torch_create_thread_id_guard` and `aoti_torch_delete_thread_id_guard`, with a C++ wrapper `torch::stable::ThreadIdGuard` Reason for shimming: Depends on `set_thread_num` which is not header-only [ThreadIdGuard impl](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Parallel.h?fbclid=IwY2xjawNTpHBleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR6LlnxdN6zn2HJlVDeoUyYJBHLZKidAmH_wiEJ7CbBE5bF56_4-WaltmlBOEw_aem_iArU_QX6AZQZeBizz6EEJQ#L42-L50) - `c10::ParallelGuard` --> `aoti_torch_create_parallel_guard`, `aoti_torch_delete_parallel_guard`, `aoti_torch_parallel_guard_is_enabled` with a C++ wrapper `torch::stable::ParallelGuard` Reason for shimming: Has a cpp file [[ParallelGuard.cpp](https://github.com/pytorch/pytorch/blob/main/c10/util/ParallelGuard.cpp?fbclid=IwY2xjawNTpLxleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR7k_w2Ob695Dy7w_WYPK9vsiMEycutaGMeNkwsp_m0x8Y2FbyWMA1QVtvsH7Q_aem_07IlbboUfzDpqB07wU27IA)] [ghstack-poisoned]

ghstack-source-id: 25e58b1 Pull Request resolved: #161320

mikaylagawarecki · 2025-10-09T18:34:51Z

@swolchok

Could you help me understand this better please

My impression was that templating F was desirable because we want to allow the f passed to parallel_for to be inlined into the loop in invoke_parallel. Wouldn't llvm::function_ref prevent or make it harder to inline f (?)

I see that smaller code size would be beneficial when binary size is a concern, but I'm not sure that is the goal here (my impression is that an extension using torch/csrc/stable/ops.h would need to depend on libtorch.so (or a binary that implements a large part of the aoti C shim, which might not be so lightweight anyway).

swolchok · 2025-10-09T19:44:43Z

@swolchok

Could you help me understand this better please

This is what I get for commenting based on the description without looking at the code, sorry. I see now that we aren't exposing std::function anywhere, so we needn't worry about committing to it.

test/cpp_extensions/libtorch_agnostic_extension/libtorch_agnostic/csrc/kernel.cpp

swolchok · 2025-10-09T19:46:04Z

torch/csrc/stable/ops.h

+// matches the existing semantic.
+#ifdef _OPENMP
+template <typename F>
+inline void invoke_parallel(


OK, so if I understand correctly, this implementation isn't actually a stable ABI to anything, it's just an OpenMP-based invoke_parallel implementation that will get built into clients' binaries. That's fine, so long as there are no concerns about clients using gcc/clang when pytorch uses the other one (they have different OpenMP implementations that they use by default), and the overall process getting linked against two different OpenMP support libraries, and that somehow causing problems. Those problems, to the extent they are real (which I don't know), already exist for customers that want to use OpenMP in their extension anyway; however here we are sort of advertising/pushing OpenMP to them and so we should make sure that that's actually a reasonable thing to do.

Yes I think your understanding is right, the only reason I call it "stable" is because it uses a shimmed-ThreadIdGuard (though perhaps that is a misnomer)

Those problems, to the extent they are real (which I don't know), already exist for customers that want to use OpenMP in their extension anyway

Yes exactly! 😄

however here we are sort of advertising/pushing OpenMP to them and so we should make sure that that's actually a reasonable thing to do.

Makes sense, my intent mostly for "extension writers to be able to use parallel_for in the same way the used to". But I see what you mean that having this here might incentivize them to try to use OpenMP.

I think this is the first op we've added in torch/csrc/stable that the compiler toolchain issues apply to (cc @janeyx99 is that right?). Would it be reasonable if I:

add more explicit documentation here + in our user facing docs re the potential issues users might run into if they use different compilers

(if helpful) put all the openmp related code in some "more private" looking file, rather than exposing it directly in ops.h

swolchok · 2025-10-09T19:52:05Z

torch/csrc/stable/ops.h

+// For the ParallelNative path, this helps with converting C++ lambdas
+// etc. to a C-style function pointer expected by the C-shim
+template <typename F>
+struct Trampoline {


Sure, this is basically a minimal llvm::function_ref so I'm fine with it. Note that they moved to intptr_t from void* 11 years ago because some compilers warn on these casts. llvm/llvm-project@36e1295

torch/csrc/inductor/aoti_torch/c/shim.h

malfet · 2025-10-15T14:16:14Z

torch/csrc/inductor/aoti_torch/c/shim.h

+AOTI_TORCH_EXPORT bool aoti_torch_get_intra_op_parallel_enabled();
+
+// Value of AT_PARALLEL_OPENMP
+AOTI_TORCH_EXPORT bool aoti_torch_get_parallel_openmp_enabled();


Should we try to mimic naming patterns of existing AOTI APIs (See pattern on line 172) as well as torch.backends.openmp.is_available()? (get_ prefix implies there must be a matching set_ API, but OpenMP is either enabled or not, isn't it?)

Suggested change

AOTI_TORCH_EXPORT bool aoti_torch_get_parallel_openmp_enabled();

AOTI_TORCH_EXPORT bool aoti_torch_openmp_is_available();

Good point, thank you!

test/cpp_extensions/libtorch_agnostic_extension/libtorch_agnostic/csrc/kernel.cpp

malfet · 2025-10-15T14:18:58Z

test/cpp_extensions/libtorch_agnostic_extension/libtorch_agnostic/csrc/kernel.cpp

+  // If using a parallel path, the thread id is encoded in the upper 32 bits
+  torch::stable::parallel_for(
+      0, size, grain_size, [data_ptr](int64_t begin, int64_t end) {
+        for (int64_t i = begin; i < end; i++) {


Why not use c10::irange there? (or at the very least use auto for i)

Suggested change

for (int64_t i = begin; i < end; i++) {

for (const auto i : c10::irange(begin, end)) {

malfet · 2025-10-15T14:20:02Z

test/cpp_extensions/libtorch_agnostic_extension/libtorch_agnostic/csrc/kernel.cpp

+    StableIValue* stack,
+    uint64_t num_args,
+    uint64_t num_outputs) {
+  Tensor res = test_parallel_for(to<int64_t>(stack[0]), to<int64_t>(stack[1]));


Suggested change

Tensor res = test_parallel_for(to<int64_t>(stack[0]), to<int64_t>(stack[1]));

auto& res = test_parallel_for(to<int64_t>(stack[0]), to<int64_t>(stack[1]));

malfet · 2025-10-15T14:22:48Z

test/cpp_extensions/libtorch_agnostic_extension/setup.py

+    # always use OPENMP path, OpenMP path will only be used if (1) AND (2)
+    # (1) libtorch was built with OpenMP
+    # (2) extension compiles and links with -fopenmp
+    # macOS clang does not support -fopenmp so we need to skip it


This is an incorrect statement, it does, option simply needs to be wrapped with -Xcompiler -fopenmp

And indeed, if we want some sort of an abstraction there, shouldn't we have helper function, say torch.utils.cpp_extensions.get_openmp_flags() (which I think already exist)

ooh thank you, let me try this

Sorry, I saw other errors in the codebase that seemed to corroborate this so I assumed it was the case

pytorch/torch/_inductor/cpp_builder.py

Lines 606 to 615 in ffe3cb2

if openmp_problem and sys.platform == "darwin":

instruction = (

"\n\nOpenMP support not found. Please try one of the following solutions:\n"

"(1) Set the `CXX` environment variable to a compiler other than Apple clang++/g++ "

"that has builtin OpenMP support;\n"

"(2) install OpenMP via conda: `conda install llvm-openmp`;\n"

"(3) install libomp via brew: `brew install libomp`;\n"

"(4) manually setup OpenMP and set the `OMP_PREFIX` environment variable to point to a path"

" with `include/omp.h` under it."

)

torch/csrc/inductor/aoti_torch/c/shim.h

malfet · 2025-10-15T14:27:16Z

torch/csrc/inductor/aoti_torch/c/shim.h

 AOTI_TORCH_EXPORT AOTITorchError aoti_torch_zero_(AtenTensorHandle self);

+// parallel utilities
+AOTI_TORCH_EXPORT void aoti_torch_lazy_init_num_threads();


This API looks quite confusing, may be some explanation why this is needed would be good? (For example, I don't know myself why, on all non-emedded OSes it's pretty safe to query num cores at the initialization time)

Will add a comment that it's only for use by parallel_for in torch/csrc/stable!

torch/csrc/inductor/aoti_torch/shim_common.cpp

malfet · 2025-10-15T14:38:51Z

test/cpp_extensions/libtorch_agnostic_extension/setup.py

+    # (2) extension compiles and links with -fopenmp
+    # macOS clang does not support -fopenmp so we need to skip it
+    if sys.platform != "darwin":
+        extra_compile_args["cxx"].extend(["-fopenmp", "-D_OPENMP"])


Why do you need to define _OPENMP here, isn't it a compiler's job? See https://godbolt.org/z/7dEcn1vfP

Also, you'll need to pass a different flag if you are on Windows

malfet · 2025-10-15T14:44:40Z

torch/csrc/stable/parallel_utils.h

@@ -0,0 +1,55 @@
+#pragma once
+
+#include <torch/csrc/inductor/aoti_torch/c/shim.h>


I'm somewhat new to the codebase, but this header name makes me a bit uncofortable. Are stable API simply piggy backing on some of AOTI shim definitions? And is the intention to move it later into torch/csrc/stable folder?

Yes torch/csrc/stable mainly provides C++ wrappers around the AOTI shim definitions (my understanding is some of these are intended to be more ergonomic e.g. do memory management that the C header does not, provide kwarg defaults that the C header can't etc.). I don't think there is an intention to move it later into torch/csrc/stable cc @janeyx99 for why

Yea, I can see why the headers give discomfort. I think based on offline discussion, we should start moving our shims to a non-aoti file and go from there.

malfet · 2025-10-15T14:48:25Z

torch/csrc/stable/ops.h

+namespace internal {
+
+// Copied from aten/src/ATen/Parallel.h
+inline int64_t divup(int64_t x, int64_t y) {


I think this is 5th copy of this function that I'm seeing in the codebase. Why not move it to say torch/csrs/stable/utils.h? Also what's wrong with the template

template<typename T> intline T divup(T x, T y) { ... }

Good point, let me add it to torch/headeronly and make the rest of the libtorch include that

malfet

Looks like torch::stable::internal::parallel_for is just a copy of respective implementations from ParallelOpenMP.h and ParallelNative.h. If this is the case, why aren't you deleting the implementaiton there and just make them use this "Stable" implementation?

mikaylagawarecki · 2025-10-15T15:24:21Z

@malfet Are you referring to torch::stable::internal::invoke_parallel or torch::stable::parallel_for?

For torch::stable::parallel_for, I think the main difference is that the version I pasted into torch/csrc/stable uses shimm-ed functions + classes which I assume might have some slight overhead of extra function calls (we have no choice but to incur this for torch/csrc/stable), but for the libtorch version we can avoid this.

~~Separately, the principle we are following is that torch/csrc/stable is allowed to include the aoti shim.h and anything in torch/headeronly (but not anything else in libtorch)~~

~~The rest of libtorch is allowed to include anything in torch/headeronly but not torch/csrc/stable . See the diagram from Jane below~~

@janeyx99 If this is an accurate representation/there is any other fundamental reasoning that I'm missing

For torch::stable::internal::invoke_parallel, it's the same justification as the above (the shimm-ed ThreadIdGuard is used)

See my comment in the PR description

For perf, we want the function passed to parallel_for to be inlined all the way into invoke_parallel

This is possible for the OpenMP implementation which templates F just like how parallel_for does

pytorch/aten/src/ATen/ParallelOpenMP.h

Lines 14 to 53 in 001e1d2

#ifdef _OPENMP

namespace at::internal {

template <typename F>

inline void invoke_parallel(

int64_t begin,

int64_t end,

int64_t grain_size,

const F& f) {

std::atomic_flag err_flag = ATOMIC_FLAG_INIT;

std::exception_ptr eptr;

#pragma omp parallel

{

// choose number of tasks based on grain size and number of threads

// can't use num_threads clause due to bugs in GOMP's thread pool (See

// #32008)

int64_t num_threads = omp_get_num_threads();

if (grain_size > 0) {

num_threads = std::min(num_threads, divup((end - begin), grain_size));

}

int64_t tid = omp_get_thread_num();

int64_t chunk_size = divup((end - begin), num_threads);

int64_t begin_tid = begin + tid * chunk_size;

if (begin_tid < end) {

try {

internal::ThreadIdGuard tid_guard(tid);

f(begin_tid, std::min(end, chunk_size + begin_tid));

} catch (...) {

if (!err_flag.test_and_set()) {

eptr = std::current_exception();

}

}

}

}

if (eptr) {

std::rethrow_exception(eptr);

}

}

} // namespace at::internal

We paste the implementation of invoke_parallel into torch::stable::internal with a modification that ThreadIdGuard (the only non-headeronly piece) uses the shimmed version. We do not move it to torch/headeronly due to the ThreadIdGuard limitation

The current state of the world is that there are two implementations of torch's parallel interface, OpenMP and ParallelNative. `INTRA_OP_PARALLEL` (which is used to is gate whether the "parallel logic" in `parallel_for` is used, see below) is defined if 1. `AT_PARALLEL_OPENMP = 1` (at libtorch build time per the generated ATen/Config.h) + `_OPENMP` is defined at extension build time (meaning that **both libtorch and extension compile/link against OPENMP**) 2. `AT_PARALLEL_OPENMP=0 && AT_PARALLEL_NATIVE = 1` (at libtorch build time per the generated ATen/Config.h) https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel-inl.h#L9-L43 The approach taken in this PR is to paste the implementation of `parallel_for` from `ATen/Parallel-inl.h `into `torch/csrc/stable/ops.h` with the following modifications: For perf, we want the function passed to `parallel_for` to be inlined all the way into `invoke_parallel` - This is possible for the OpenMP implementation which templates F just like how `parallel_for` does https://github.com/pytorch/pytorch/blob/001e1d263746ae9d121d9c8cf55bc87f777d9dba/aten/src/ATen/ParallelOpenMP.h#L14-L53 We paste the implementation of `invoke_parallel` into `torch::stable::internal` **with a modification that `ThreadIdGuard` (the only non-headeronly piece) uses the shimmed version. We do not move it to torch/headeronly due to the ThreadIdGuard limitation** When I compile the call I added in `kernel.cpp`, ```cpp torch::stable::parallel_for( 0, size, grain_size, [data_ptr](int64_t begin, int64_t end) { for (int64_t i = begin; i < end; i++) { int thread_id = aoti_torch_get_thread_num(); data_ptr[i] = i | (static_cast<int64_t>(thread_id) << 32); } }); ``` I can find the following in the objdump, which I think indicates that the function is getting inlined correctly into `invoke_parallel` for this path <img width="1194" height="260" alt="Screenshot 2025-10-08 at 7 03 49 PM" src="https://github.com/user-attachments/assets/32982cfc-8b5f-4765-84db-4aaeb5e77591" /> - This is not possible for the ParallelNative implementation - takes in an `std::function` for `f` - Is defined in a cpp (and relies on other non-headeronly functions) For the above two reasons, we shim the ParallelNative version of `invoke_parallel` https://github.com/pytorch/pytorch/blob/71aefd5595834dd97f38aa978ee32abbd13ac3d6/aten/src/ATen/ParallelNative.cpp#L144-L199 The rest of the APIs are shimmed - `at::internal::lazy_init_num_threads()` --> `aoti_torch_lazy_init_num_threads` Reason for shimming: The [implementation of `at::internal::lazy_init_num_threads()`](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel.h#L32-L38) calls `at::init_num_threads` which is not header-only - `at::in_parallel_region` --> `aoti_torch_in_parallel_region` Reason for shimming: The [OpenMP implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L94-L100) is defined in a .cpp and depends on whether OPENMP is linked against at libtorch build time. The [ParallelNative implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L266-L276) is defined in a .cpp and depends on whether `c10_MOBILE` is defined at libtorch build time - `at::get_num_threads` --> `aoti_torch_get_num_threads` Reason for shimming: Similar story to `in_parallel_region`, [`ParallelNative` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L241-L260), [`OpenMP` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L75-L82) - `ThreadIdGuard` --> `aoti_torch_create_thread_id_guard` and `aoti_torch_delete_thread_id_guard`, with a C++ wrapper `torch::stable::ThreadIdGuard` Reason for shimming: Depends on `set_thread_num` which is not header-only [ThreadIdGuard impl](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Parallel.h?fbclid=IwY2xjawNTpHBleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR6LlnxdN6zn2HJlVDeoUyYJBHLZKidAmH_wiEJ7CbBE5bF56_4-WaltmlBOEw_aem_iArU_QX6AZQZeBizz6EEJQ#L42-L50) - `c10::ParallelGuard` --> `aoti_torch_create_parallel_guard`, `aoti_torch_delete_parallel_guard`, `aoti_torch_parallel_guard_is_enabled` with a C++ wrapper `torch::stable::ParallelGuard` Reason for shimming: Has a cpp file [[ParallelGuard.cpp](https://github.com/pytorch/pytorch/blob/main/c10/util/ParallelGuard.cpp?fbclid=IwY2xjawNTpLxleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR7k_w2Ob695Dy7w_WYPK9vsiMEycutaGMeNkwsp_m0x8Y2FbyWMA1QVtvsH7Q_aem_07IlbboUfzDpqB07wU27IA)] [ghstack-poisoned]

ghstack-source-id: 1e9eb12 Pull Request resolved: #161320

The current state of the world is that there are two implementations of torch's parallel interface, OpenMP and ParallelNative. `INTRA_OP_PARALLEL` (which is used to is gate whether the "parallel logic" in `parallel_for` is used, see below) is defined if 1. `AT_PARALLEL_OPENMP = 1` (at libtorch build time per the generated ATen/Config.h) + `_OPENMP` is defined at extension build time (meaning that **both libtorch and extension compile/link against OPENMP**) 2. `AT_PARALLEL_OPENMP=0 && AT_PARALLEL_NATIVE = 1` (at libtorch build time per the generated ATen/Config.h) https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel-inl.h#L9-L43 The approach taken in this PR is to paste the implementation of `parallel_for` from `ATen/Parallel-inl.h `into `torch/csrc/stable/ops.h` with the following modifications: For perf, we want the function passed to `parallel_for` to be inlined all the way into `invoke_parallel` - This is possible for the OpenMP implementation which templates F just like how `parallel_for` does https://github.com/pytorch/pytorch/blob/001e1d263746ae9d121d9c8cf55bc87f777d9dba/aten/src/ATen/ParallelOpenMP.h#L14-L53 We paste the implementation of `invoke_parallel` into `torch::stable::internal` **with a modification that `ThreadIdGuard` (the only non-headeronly piece) uses the shimmed version. We do not move it to torch/headeronly due to the ThreadIdGuard limitation** When I compile the call I added in `kernel.cpp`, ```cpp torch::stable::parallel_for( 0, size, grain_size, [data_ptr](int64_t begin, int64_t end) { for (int64_t i = begin; i < end; i++) { int thread_id = aoti_torch_get_thread_num(); data_ptr[i] = i | (static_cast<int64_t>(thread_id) << 32); } }); ``` I can find the following in the objdump, which I think indicates that the function is getting inlined correctly into `invoke_parallel` for this path <img width="1194" height="260" alt="Screenshot 2025-10-08 at 7 03 49 PM" src="https://github.com/user-attachments/assets/32982cfc-8b5f-4765-84db-4aaeb5e77591" /> - This is not possible for the ParallelNative implementation - takes in an `std::function` for `f` - Is defined in a cpp (and relies on other non-headeronly functions) For the above two reasons, we shim the ParallelNative version of `invoke_parallel` https://github.com/pytorch/pytorch/blob/71aefd5595834dd97f38aa978ee32abbd13ac3d6/aten/src/ATen/ParallelNative.cpp#L144-L199 The rest of the APIs are shimmed - `at::internal::lazy_init_num_threads()` --> `aoti_torch_lazy_init_num_threads` Reason for shimming: The [implementation of `at::internal::lazy_init_num_threads()`](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel.h#L32-L38) calls `at::init_num_threads` which is not header-only - `at::in_parallel_region` --> `aoti_torch_in_parallel_region` Reason for shimming: The [OpenMP implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L94-L100) is defined in a .cpp and depends on whether OPENMP is linked against at libtorch build time. The [ParallelNative implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L266-L276) is defined in a .cpp and depends on whether `c10_MOBILE` is defined at libtorch build time - `at::get_num_threads` --> `aoti_torch_get_num_threads` Reason for shimming: Similar story to `in_parallel_region`, [`ParallelNative` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L241-L260), [`OpenMP` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L75-L82) - `ThreadIdGuard` --> `aoti_torch_create_thread_id_guard` and `aoti_torch_delete_thread_id_guard`, with a C++ wrapper `torch::stable::ThreadIdGuard` Reason for shimming: Depends on `set_thread_num` which is not header-only [ThreadIdGuard impl](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Parallel.h?fbclid=IwY2xjawNTpHBleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR6LlnxdN6zn2HJlVDeoUyYJBHLZKidAmH_wiEJ7CbBE5bF56_4-WaltmlBOEw_aem_iArU_QX6AZQZeBizz6EEJQ#L42-L50) - `c10::ParallelGuard` --> `aoti_torch_create_parallel_guard`, `aoti_torch_delete_parallel_guard`, `aoti_torch_parallel_guard_is_enabled` with a C++ wrapper `torch::stable::ParallelGuard` Reason for shimming: Has a cpp file [[ParallelGuard.cpp](https://github.com/pytorch/pytorch/blob/main/c10/util/ParallelGuard.cpp?fbclid=IwY2xjawNTpLxleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR7k_w2Ob695Dy7w_WYPK9vsiMEycutaGMeNkwsp_m0x8Y2FbyWMA1QVtvsH7Q_aem_07IlbboUfzDpqB07wU27IA)] [ghstack-poisoned]

ghstack-source-id: 2538044 Pull Request resolved: #161320

The current state of the world is that there are two implementations of torch's parallel interface, OpenMP and ParallelNative. `INTRA_OP_PARALLEL` (which is used to is gate whether the "parallel logic" in `parallel_for` is used, see below) is defined if 1. `AT_PARALLEL_OPENMP = 1` (at libtorch build time per the generated ATen/Config.h) + `_OPENMP` is defined at extension build time (meaning that **both libtorch and extension compile/link against OPENMP**) 2. `AT_PARALLEL_OPENMP=0 && AT_PARALLEL_NATIVE = 1` (at libtorch build time per the generated ATen/Config.h) https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel-inl.h#L9-L43 The approach taken in this PR is to paste the implementation of `parallel_for` from `ATen/Parallel-inl.h `into `torch/csrc/stable/ops.h` with the following modifications: For perf, we want the function passed to `parallel_for` to be inlined all the way into `invoke_parallel` - This is possible for the OpenMP implementation which templates F just like how `parallel_for` does https://github.com/pytorch/pytorch/blob/001e1d263746ae9d121d9c8cf55bc87f777d9dba/aten/src/ATen/ParallelOpenMP.h#L14-L53 We paste the implementation of `invoke_parallel` into `torch::stable::internal` **with a modification that `ThreadIdGuard` (the only non-headeronly piece) uses the shimmed version. We do not move it to torch/headeronly due to the ThreadIdGuard limitation** When I compile the call I added in `kernel.cpp`, ```cpp torch::stable::parallel_for( 0, size, grain_size, [data_ptr](int64_t begin, int64_t end) { for (int64_t i = begin; i < end; i++) { int thread_id = aoti_torch_get_thread_num(); data_ptr[i] = i | (static_cast<int64_t>(thread_id) << 32); } }); ``` I can find the following in the objdump, which I think indicates that the function is getting inlined correctly into `invoke_parallel` for this path <img width="1194" height="260" alt="Screenshot 2025-10-08 at 7 03 49 PM" src="https://github.com/user-attachments/assets/32982cfc-8b5f-4765-84db-4aaeb5e77591" /> - This is not possible for the ParallelNative implementation - takes in an `std::function` for `f` - Is defined in a cpp (and relies on other non-headeronly functions) For the above two reasons, we shim the ParallelNative version of `invoke_parallel` https://github.com/pytorch/pytorch/blob/71aefd5595834dd97f38aa978ee32abbd13ac3d6/aten/src/ATen/ParallelNative.cpp#L144-L199 The rest of the APIs are shimmed - `at::internal::lazy_init_num_threads()` --> `aoti_torch_lazy_init_num_threads` Reason for shimming: The [implementation of `at::internal::lazy_init_num_threads()`](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel.h#L32-L38) calls `at::init_num_threads` which is not header-only - `at::in_parallel_region` --> `aoti_torch_in_parallel_region` Reason for shimming: The [OpenMP implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L94-L100) is defined in a .cpp and depends on whether OPENMP is linked against at libtorch build time. The [ParallelNative implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L266-L276) is defined in a .cpp and depends on whether `c10_MOBILE` is defined at libtorch build time - `at::get_num_threads` --> `aoti_torch_get_num_threads` Reason for shimming: Similar story to `in_parallel_region`, [`ParallelNative` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L241-L260), [`OpenMP` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L75-L82) - `ThreadIdGuard` --> `aoti_torch_create_thread_id_guard` and `aoti_torch_delete_thread_id_guard`, with a C++ wrapper `torch::stable::ThreadIdGuard` Reason for shimming: Depends on `set_thread_num` which is not header-only [ThreadIdGuard impl](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Parallel.h?fbclid=IwY2xjawNTpHBleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR6LlnxdN6zn2HJlVDeoUyYJBHLZKidAmH_wiEJ7CbBE5bF56_4-WaltmlBOEw_aem_iArU_QX6AZQZeBizz6EEJQ#L42-L50) - `c10::ParallelGuard` --> `aoti_torch_create_parallel_guard`, `aoti_torch_delete_parallel_guard`, `aoti_torch_parallel_guard_is_enabled` with a C++ wrapper `torch::stable::ParallelGuard` Reason for shimming: Has a cpp file [[ParallelGuard.cpp](https://github.com/pytorch/pytorch/blob/main/c10/util/ParallelGuard.cpp?fbclid=IwY2xjawNTpLxleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR7k_w2Ob695Dy7w_WYPK9vsiMEycutaGMeNkwsp_m0x8Y2FbyWMA1QVtvsH7Q_aem_07IlbboUfzDpqB07wU27IA)] [ghstack-poisoned]

ghstack-source-id: 9a9d6d4 Pull Request resolved: #161320

The current state of the world is that there are two implementations of torch's parallel interface, OpenMP and ParallelNative. `INTRA_OP_PARALLEL` (which is used to is gate whether the "parallel logic" in `parallel_for` is used, see below) is defined if 1. `AT_PARALLEL_OPENMP = 1` (at libtorch build time per the generated ATen/Config.h) + `_OPENMP` is defined at extension build time (meaning that **both libtorch and extension compile/link against OPENMP**) 2. `AT_PARALLEL_OPENMP=0 && AT_PARALLEL_NATIVE = 1` (at libtorch build time per the generated ATen/Config.h) https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel-inl.h#L9-L43 The approach taken in this PR is to paste the implementation of `parallel_for` from `ATen/Parallel-inl.h `into `torch/csrc/stable/ops.h` with the following modifications: For perf, we want the function passed to `parallel_for` to be inlined all the way into `invoke_parallel` - This is possible for the OpenMP implementation which templates F just like how `parallel_for` does https://github.com/pytorch/pytorch/blob/001e1d263746ae9d121d9c8cf55bc87f777d9dba/aten/src/ATen/ParallelOpenMP.h#L14-L53 We paste the implementation of `invoke_parallel` into `torch::stable::internal` **with a modification that `ThreadIdGuard` (the only non-headeronly piece) uses the shimmed version. We do not move it to torch/headeronly due to the ThreadIdGuard limitation** When I compile the call I added in `kernel.cpp`, ```cpp torch::stable::parallel_for( 0, size, grain_size, [data_ptr](int64_t begin, int64_t end) { for (int64_t i = begin; i < end; i++) { int thread_id = aoti_torch_get_thread_num(); data_ptr[i] = i | (static_cast<int64_t>(thread_id) << 32); } }); ``` I can find the following in the objdump, which I think indicates that the function is getting inlined correctly into `invoke_parallel` for this path <img width="1194" height="260" alt="Screenshot 2025-10-08 at 7 03 49 PM" src="https://github.com/user-attachments/assets/32982cfc-8b5f-4765-84db-4aaeb5e77591" /> - This is not possible for the ParallelNative implementation - takes in an `std::function` for `f` - Is defined in a cpp (and relies on other non-headeronly functions) For the above two reasons, we shim the ParallelNative version of `invoke_parallel` https://github.com/pytorch/pytorch/blob/71aefd5595834dd97f38aa978ee32abbd13ac3d6/aten/src/ATen/ParallelNative.cpp#L144-L199 The rest of the APIs are shimmed - `at::internal::lazy_init_num_threads()` --> `aoti_torch_lazy_init_num_threads` Reason for shimming: The [implementation of `at::internal::lazy_init_num_threads()`](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel.h#L32-L38) calls `at::init_num_threads` which is not header-only - `at::in_parallel_region` --> `aoti_torch_in_parallel_region` Reason for shimming: The [OpenMP implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L94-L100) is defined in a .cpp and depends on whether OPENMP is linked against at libtorch build time. The [ParallelNative implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L266-L276) is defined in a .cpp and depends on whether `c10_MOBILE` is defined at libtorch build time - `at::get_num_threads` --> `aoti_torch_get_num_threads` Reason for shimming: Similar story to `in_parallel_region`, [`ParallelNative` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L241-L260), [`OpenMP` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L75-L82) - `ThreadIdGuard` --> `aoti_torch_create_thread_id_guard` and `aoti_torch_delete_thread_id_guard`, with a C++ wrapper `torch::stable::ThreadIdGuard` Reason for shimming: Depends on `set_thread_num` which is not header-only [ThreadIdGuard impl](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Parallel.h?fbclid=IwY2xjawNTpHBleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR6LlnxdN6zn2HJlVDeoUyYJBHLZKidAmH_wiEJ7CbBE5bF56_4-WaltmlBOEw_aem_iArU_QX6AZQZeBizz6EEJQ#L42-L50) - `c10::ParallelGuard` --> `aoti_torch_create_parallel_guard`, `aoti_torch_delete_parallel_guard`, `aoti_torch_parallel_guard_is_enabled` with a C++ wrapper `torch::stable::ParallelGuard` Reason for shimming: Has a cpp file [[ParallelGuard.cpp](https://github.com/pytorch/pytorch/blob/main/c10/util/ParallelGuard.cpp?fbclid=IwY2xjawNTpLxleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR7k_w2Ob695Dy7w_WYPK9vsiMEycutaGMeNkwsp_m0x8Y2FbyWMA1QVtvsH7Q_aem_07IlbboUfzDpqB07wU27IA)] [ghstack-poisoned]

ghstack-source-id: df62375 Pull Request resolved: #161320

The current state of the world is that there are two implementations of torch's parallel interface, OpenMP and ParallelNative. `INTRA_OP_PARALLEL` (which is used to is gate whether the "parallel logic" in `parallel_for` is used, see below) is defined if 1. `AT_PARALLEL_OPENMP = 1` (at libtorch build time per the generated ATen/Config.h) + `_OPENMP` is defined at extension build time (meaning that **both libtorch and extension compile/link against OPENMP**) 2. `AT_PARALLEL_OPENMP=0 && AT_PARALLEL_NATIVE = 1` (at libtorch build time per the generated ATen/Config.h) https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel-inl.h#L9-L43 The approach taken in this PR is to paste the implementation of `parallel_for` from `ATen/Parallel-inl.h `into `torch/csrc/stable/ops.h` with the following modifications: For perf, we want the function passed to `parallel_for` to be inlined all the way into `invoke_parallel` - This is possible for the OpenMP implementation which templates F just like how `parallel_for` does https://github.com/pytorch/pytorch/blob/001e1d263746ae9d121d9c8cf55bc87f777d9dba/aten/src/ATen/ParallelOpenMP.h#L14-L53 We paste the implementation of `invoke_parallel` into `torch::stable::internal` **with a modification that `ThreadIdGuard` (the only non-headeronly piece) uses the shimmed version. We do not move it to torch/headeronly due to the ThreadIdGuard limitation** When I compile the call I added in `kernel.cpp`, ```cpp torch::stable::parallel_for( 0, size, grain_size, [data_ptr](int64_t begin, int64_t end) { for (int64_t i = begin; i < end; i++) { int thread_id = aoti_torch_get_thread_num(); data_ptr[i] = i | (static_cast<int64_t>(thread_id) << 32); } }); ``` I can find the following in the objdump, which I think indicates that the function is getting inlined correctly into `invoke_parallel` for this path <img width="1194" height="260" alt="Screenshot 2025-10-08 at 7 03 49 PM" src="https://github.com/user-attachments/assets/32982cfc-8b5f-4765-84db-4aaeb5e77591" /> - This is not possible for the ParallelNative implementation - takes in an `std::function` for `f` - Is defined in a cpp (and relies on other non-headeronly functions) For the above two reasons, we shim the ParallelNative version of `invoke_parallel` https://github.com/pytorch/pytorch/blob/71aefd5595834dd97f38aa978ee32abbd13ac3d6/aten/src/ATen/ParallelNative.cpp#L144-L199 The rest of the APIs are shimmed - `at::internal::lazy_init_num_threads()` --> `aoti_torch_lazy_init_num_threads` Reason for shimming: The [implementation of `at::internal::lazy_init_num_threads()`](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel.h#L32-L38) calls `at::init_num_threads` which is not header-only - `at::in_parallel_region` --> `aoti_torch_in_parallel_region` Reason for shimming: The [OpenMP implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L94-L100) is defined in a .cpp and depends on whether OPENMP is linked against at libtorch build time. The [ParallelNative implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L266-L276) is defined in a .cpp and depends on whether `c10_MOBILE` is defined at libtorch build time - `at::get_num_threads` --> `aoti_torch_get_num_threads` Reason for shimming: Similar story to `in_parallel_region`, [`ParallelNative` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L241-L260), [`OpenMP` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L75-L82) - `ThreadIdGuard` --> `aoti_torch_create_thread_id_guard` and `aoti_torch_delete_thread_id_guard`, with a C++ wrapper `torch::stable::ThreadIdGuard` Reason for shimming: Depends on `set_thread_num` which is not header-only [ThreadIdGuard impl](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Parallel.h?fbclid=IwY2xjawNTpHBleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR6LlnxdN6zn2HJlVDeoUyYJBHLZKidAmH_wiEJ7CbBE5bF56_4-WaltmlBOEw_aem_iArU_QX6AZQZeBizz6EEJQ#L42-L50) - `c10::ParallelGuard` --> `aoti_torch_create_parallel_guard`, `aoti_torch_delete_parallel_guard`, `aoti_torch_parallel_guard_is_enabled` with a C++ wrapper `torch::stable::ParallelGuard` Reason for shimming: Has a cpp file [[ParallelGuard.cpp](https://github.com/pytorch/pytorch/blob/main/c10/util/ParallelGuard.cpp?fbclid=IwY2xjawNTpLxleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR7k_w2Ob695Dy7w_WYPK9vsiMEycutaGMeNkwsp_m0x8Y2FbyWMA1QVtvsH7Q_aem_07IlbboUfzDpqB07wU27IA)] [ghstack-poisoned]

ghstack-source-id: 157c942 Pull Request resolved: #161320

The current state of the world is that there are two implementations of torch's parallel interface, OpenMP and ParallelNative. `INTRA_OP_PARALLEL` (which is used to is gate whether the "parallel logic" in `parallel_for` is used, see below) is defined if 1. `AT_PARALLEL_OPENMP = 1` (at libtorch build time per the generated ATen/Config.h) + `_OPENMP` is defined at extension build time (meaning that **both libtorch and extension compile/link against OPENMP**) 2. `AT_PARALLEL_OPENMP=0 && AT_PARALLEL_NATIVE = 1` (at libtorch build time per the generated ATen/Config.h) https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel-inl.h#L9-L43 The approach taken in this PR is to paste the implementation of `parallel_for` from `ATen/Parallel-inl.h `into `torch/csrc/stable/ops.h` with the following modifications: For perf, we want the function passed to `parallel_for` to be inlined all the way into `invoke_parallel` - This is possible for the OpenMP implementation which templates F just like how `parallel_for` does https://github.com/pytorch/pytorch/blob/001e1d263746ae9d121d9c8cf55bc87f777d9dba/aten/src/ATen/ParallelOpenMP.h#L14-L53 We paste the implementation of `invoke_parallel` into `torch::stable::internal` **with a modification that `ThreadIdGuard` (the only non-headeronly piece) uses the shimmed version. We do not move it to torch/headeronly due to the ThreadIdGuard limitation** When I compile the call I added in `kernel.cpp`, ```cpp torch::stable::parallel_for( 0, size, grain_size, [data_ptr](int64_t begin, int64_t end) { for (int64_t i = begin; i < end; i++) { int thread_id = aoti_torch_get_thread_num(); data_ptr[i] = i | (static_cast<int64_t>(thread_id) << 32); } }); ``` I can find the following in the objdump, which I think indicates that the function is getting inlined correctly into `invoke_parallel` for this path <img width="1194" height="260" alt="Screenshot 2025-10-08 at 7 03 49 PM" src="https://github.com/user-attachments/assets/32982cfc-8b5f-4765-84db-4aaeb5e77591" /> - This is not possible for the ParallelNative implementation - takes in an `std::function` for `f` - Is defined in a cpp (and relies on other non-headeronly functions) For the above two reasons, we shim the ParallelNative version of `invoke_parallel` https://github.com/pytorch/pytorch/blob/71aefd5595834dd97f38aa978ee32abbd13ac3d6/aten/src/ATen/ParallelNative.cpp#L144-L199 The rest of the APIs are shimmed - `at::internal::lazy_init_num_threads()` --> `aoti_torch_lazy_init_num_threads` Reason for shimming: The [implementation of `at::internal::lazy_init_num_threads()`](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel.h#L32-L38) calls `at::init_num_threads` which is not header-only - `at::in_parallel_region` --> `aoti_torch_in_parallel_region` Reason for shimming: The [OpenMP implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L94-L100) is defined in a .cpp and depends on whether OPENMP is linked against at libtorch build time. The [ParallelNative implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L266-L276) is defined in a .cpp and depends on whether `c10_MOBILE` is defined at libtorch build time - `at::get_num_threads` --> `aoti_torch_get_num_threads` Reason for shimming: Similar story to `in_parallel_region`, [`ParallelNative` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L241-L260), [`OpenMP` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L75-L82) - `ThreadIdGuard` --> `aoti_torch_create_thread_id_guard` and `aoti_torch_delete_thread_id_guard`, with a C++ wrapper `torch::stable::ThreadIdGuard` Reason for shimming: Depends on `set_thread_num` which is not header-only [ThreadIdGuard impl](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Parallel.h?fbclid=IwY2xjawNTpHBleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR6LlnxdN6zn2HJlVDeoUyYJBHLZKidAmH_wiEJ7CbBE5bF56_4-WaltmlBOEw_aem_iArU_QX6AZQZeBizz6EEJQ#L42-L50) - `c10::ParallelGuard` --> `aoti_torch_create_parallel_guard`, `aoti_torch_delete_parallel_guard`, `aoti_torch_parallel_guard_is_enabled` with a C++ wrapper `torch::stable::ParallelGuard` Reason for shimming: Has a cpp file [[ParallelGuard.cpp](https://github.com/pytorch/pytorch/blob/main/c10/util/ParallelGuard.cpp?fbclid=IwY2xjawNTpLxleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR7k_w2Ob695Dy7w_WYPK9vsiMEycutaGMeNkwsp_m0x8Y2FbyWMA1QVtvsH7Q_aem_07IlbboUfzDpqB07wU27IA)] [ghstack-poisoned]

ghstack-source-id: c1b18aa Pull Request resolved: #161320

The current state of the world is that there are two implementations of torch's parallel interface, OpenMP and ParallelNative. `INTRA_OP_PARALLEL` (which is used to is gate whether the "parallel logic" in `parallel_for` is used, see below) is defined if 1. `AT_PARALLEL_OPENMP = 1` (at libtorch build time per the generated ATen/Config.h) + `_OPENMP` is defined at extension build time (meaning that **both libtorch and extension compile/link against OPENMP**) 2. `AT_PARALLEL_OPENMP=0 && AT_PARALLEL_NATIVE = 1` (at libtorch build time per the generated ATen/Config.h) https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel-inl.h#L9-L43 The approach taken in this PR is to paste the implementation of `parallel_for` from `ATen/Parallel-inl.h `into `torch/csrc/stable/ops.h` with the following modifications: For perf, we want the function passed to `parallel_for` to be inlined all the way into `invoke_parallel` - This is possible for the OpenMP implementation which templates F just like how `parallel_for` does https://github.com/pytorch/pytorch/blob/001e1d263746ae9d121d9c8cf55bc87f777d9dba/aten/src/ATen/ParallelOpenMP.h#L14-L53 We paste the implementation of `invoke_parallel` into `torch::stable::internal` **with a modification that `ThreadIdGuard` (the only non-headeronly piece) uses the shimmed version. We do not move it to torch/headeronly due to the ThreadIdGuard limitation** When I compile the call I added in `kernel.cpp`, ```cpp torch::stable::parallel_for( 0, size, grain_size, [data_ptr](int64_t begin, int64_t end) { for (int64_t i = begin; i < end; i++) { int thread_id = aoti_torch_get_thread_num(); data_ptr[i] = i | (static_cast<int64_t>(thread_id) << 32); } }); ``` I can find the following in the objdump, which I think indicates that the function is getting inlined correctly into `invoke_parallel` for this path <img width="1194" height="260" alt="Screenshot 2025-10-08 at 7 03 49 PM" src="https://github.com/user-attachments/assets/32982cfc-8b5f-4765-84db-4aaeb5e77591" /> - This is not possible for the ParallelNative implementation - takes in an `std::function` for `f` - Is defined in a cpp (and relies on other non-headeronly functions) For the above two reasons, we shim the ParallelNative version of `invoke_parallel` https://github.com/pytorch/pytorch/blob/71aefd5595834dd97f38aa978ee32abbd13ac3d6/aten/src/ATen/ParallelNative.cpp#L144-L199 The rest of the APIs are shimmed - `at::internal::lazy_init_num_threads()` --> `aoti_torch_lazy_init_num_threads` Reason for shimming: The [implementation of `at::internal::lazy_init_num_threads()`](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel.h#L32-L38) calls `at::init_num_threads` which is not header-only - `at::in_parallel_region` --> `aoti_torch_in_parallel_region` Reason for shimming: The [OpenMP implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L94-L100) is defined in a .cpp and depends on whether OPENMP is linked against at libtorch build time. The [ParallelNative implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L266-L276) is defined in a .cpp and depends on whether `c10_MOBILE` is defined at libtorch build time - `at::get_num_threads` --> `aoti_torch_get_num_threads` Reason for shimming: Similar story to `in_parallel_region`, [`ParallelNative` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L241-L260), [`OpenMP` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L75-L82) - `ThreadIdGuard` --> `aoti_torch_create_thread_id_guard` and `aoti_torch_delete_thread_id_guard`, with a C++ wrapper `torch::stable::ThreadIdGuard` Reason for shimming: Depends on `set_thread_num` which is not header-only [ThreadIdGuard impl](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Parallel.h?fbclid=IwY2xjawNTpHBleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR6LlnxdN6zn2HJlVDeoUyYJBHLZKidAmH_wiEJ7CbBE5bF56_4-WaltmlBOEw_aem_iArU_QX6AZQZeBizz6EEJQ#L42-L50) - `c10::ParallelGuard` --> `aoti_torch_create_parallel_guard`, `aoti_torch_delete_parallel_guard`, `aoti_torch_parallel_guard_is_enabled` with a C++ wrapper `torch::stable::ParallelGuard` Reason for shimming: Has a cpp file [[ParallelGuard.cpp](https://github.com/pytorch/pytorch/blob/main/c10/util/ParallelGuard.cpp?fbclid=IwY2xjawNTpLxleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR7k_w2Ob695Dy7w_WYPK9vsiMEycutaGMeNkwsp_m0x8Y2FbyWMA1QVtvsH7Q_aem_07IlbboUfzDpqB07wU27IA)] [ghstack-poisoned]

ghstack-source-id: cde0bdb Pull Request resolved: #161320

The current state of the world is that there are two implementations of torch's parallel interface, OpenMP and ParallelNative. `INTRA_OP_PARALLEL` (which is used to is gate whether the "parallel logic" in `parallel_for` is used, see below) is defined if 1. `AT_PARALLEL_OPENMP = 1` (at libtorch build time per the generated ATen/Config.h) + `_OPENMP` is defined at extension build time (meaning that **both libtorch and extension compile/link against OPENMP**) 2. `AT_PARALLEL_OPENMP=0 && AT_PARALLEL_NATIVE = 1` (at libtorch build time per the generated ATen/Config.h) https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel-inl.h#L9-L43 The approach taken in this PR is to paste the implementation of `parallel_for` from `ATen/Parallel-inl.h `into `torch/csrc/stable/ops.h` with the following modifications: For perf, we want the function passed to `parallel_for` to be inlined all the way into `invoke_parallel` - This is possible for the OpenMP implementation which templates F just like how `parallel_for` does https://github.com/pytorch/pytorch/blob/001e1d263746ae9d121d9c8cf55bc87f777d9dba/aten/src/ATen/ParallelOpenMP.h#L14-L53 We paste the implementation of `invoke_parallel` into `torch::stable::internal` **with a modification that `ThreadIdGuard` (the only non-headeronly piece) uses the shimmed version. We do not move it to torch/headeronly due to the ThreadIdGuard limitation** When I compile the call I added in `kernel.cpp`, ```cpp torch::stable::parallel_for( 0, size, grain_size, [data_ptr](int64_t begin, int64_t end) { for (int64_t i = begin; i < end; i++) { int thread_id = aoti_torch_get_thread_num(); data_ptr[i] = i | (static_cast<int64_t>(thread_id) << 32); } }); ``` I can find the following in the objdump, which I think indicates that the function is getting inlined correctly into `invoke_parallel` for this path <img width="1194" height="260" alt="Screenshot 2025-10-08 at 7 03 49 PM" src="https://github.com/user-attachments/assets/32982cfc-8b5f-4765-84db-4aaeb5e77591" /> - This is not possible for the ParallelNative implementation - takes in an `std::function` for `f` - Is defined in a cpp (and relies on other non-headeronly functions) For the above two reasons, we shim the ParallelNative version of `invoke_parallel` https://github.com/pytorch/pytorch/blob/71aefd5595834dd97f38aa978ee32abbd13ac3d6/aten/src/ATen/ParallelNative.cpp#L144-L199 The rest of the APIs are shimmed - `at::internal::lazy_init_num_threads()` --> `aoti_torch_lazy_init_num_threads` Reason for shimming: The [implementation of `at::internal::lazy_init_num_threads()`](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/Parallel.h#L32-L38) calls `at::init_num_threads` which is not header-only - `at::in_parallel_region` --> `aoti_torch_in_parallel_region` Reason for shimming: The [OpenMP implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L94-L100) is defined in a .cpp and depends on whether OPENMP is linked against at libtorch build time. The [ParallelNative implementation](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L266-L276) is defined in a .cpp and depends on whether `c10_MOBILE` is defined at libtorch build time - `at::get_num_threads` --> `aoti_torch_get_num_threads` Reason for shimming: Similar story to `in_parallel_region`, [`ParallelNative` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelNative.cpp#L241-L260), [`OpenMP` impl](https://github.com/pytorch/pytorch/blob/e0cb1848d0fd9fb4467ad8b844c565aea5071838/aten/src/ATen/ParallelOpenMP.cpp#L75-L82) - `ThreadIdGuard` --> `aoti_torch_create_thread_id_guard` and `aoti_torch_delete_thread_id_guard`, with a C++ wrapper `torch::stable::ThreadIdGuard` Reason for shimming: Depends on `set_thread_num` which is not header-only [ThreadIdGuard impl](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Parallel.h?fbclid=IwY2xjawNTpHBleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR6LlnxdN6zn2HJlVDeoUyYJBHLZKidAmH_wiEJ7CbBE5bF56_4-WaltmlBOEw_aem_iArU_QX6AZQZeBizz6EEJQ#L42-L50) - `c10::ParallelGuard` --> `aoti_torch_create_parallel_guard`, `aoti_torch_delete_parallel_guard`, `aoti_torch_parallel_guard_is_enabled` with a C++ wrapper `torch::stable::ParallelGuard` Reason for shimming: Has a cpp file [[ParallelGuard.cpp](https://github.com/pytorch/pytorch/blob/main/c10/util/ParallelGuard.cpp?fbclid=IwY2xjawNTpLxleHRuA2FlbQIxMQBicmlkETBhRjNHRG5BZEZIcjRTdHVzAR7k_w2Ob695Dy7w_WYPK9vsiMEycutaGMeNkwsp_m0x8Y2FbyWMA1QVtvsH7Q_aem_07IlbboUfzDpqB07wU27IA)] [ghstack-poisoned]

ghstack-source-id: c0ae771 Pull Request resolved: #161320

mikaylagawarecki · 2025-10-17T17:48:01Z

torch/csrc/stable/ops.h

+    if (begin_tid < end) {
+      try {
+        ThreadIdGuard tid_guard(static_cast<uint64_t>(tid));
+        f(begin_tid, std::min(end, chunk_size + begin_tid));


@swolchok This is where I thought we could inline f to

mikaylagawarecki · 2025-11-03T21:44:26Z

Closing in favor of #166695 as Scott shared that even if we can inline the function into invoke_parallel it really doesn't matter because a fn pointer must be passed to threads due to pragma omp parallel

Add stable parallel_for

cf9a6ad

[ghstack-poisoned]

pytorch-bot bot added the ciflow/inductor label Aug 22, 2025

mikaylagawarecki added a commit that referenced this pull request Aug 22, 2025

Add stable parallel_for

9d82a69

ghstack-source-id: 691b9f9 Pull Request resolved: #161320

pytorch-bot bot added the release notes: inductor (aoti) label Aug 22, 2025

Update on "Add stable parallel_for"

57d5bb1

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request Aug 22, 2025

Add stable parallel_for

5f51f42

ghstack-source-id: 5ed00f0 Pull Request resolved: #161320

Update on "Add stable parallel_for"

44e1752

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request Aug 23, 2025

Add stable parallel_for

69c991c

ghstack-source-id: b3f7fdf Pull Request resolved: #161320

Skylion007 reviewed Aug 23, 2025

View reviewed changes

torch/headeronly/util/ParallelGuard.h Outdated Show resolved Hide resolved

torch/headeronly/util/ParallelGuard.h Outdated Show resolved Hide resolved

NicolasHug mentioned this pull request Oct 3, 2025

[STABLE ABI] Porting audio to Torch Stable ABI pytorch/audio#4114

Open

19 tasks

Update on "Add stable parallel_for"

df37ed8

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request Oct 3, 2025

Add stable parallel_for

15782b2

ghstack-source-id: e9a7389 Pull Request resolved: #161320

Update on "Add stable parallel_for"

34bccd0

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request Oct 6, 2025

Add stable parallel_for

fbd219f

ghstack-source-id: 33fb487 Pull Request resolved: #161320

Update on "Add stable parallel_for"

ea155f9

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request Oct 7, 2025

Add stable parallel_for

09e8c84

ghstack-source-id: ae2e218 Pull Request resolved: #161320

Update on "Add stable parallel_for"

706cc71

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request Oct 8, 2025

Add stable parallel_for

53b7bea

ghstack-source-id: c236a43 Pull Request resolved: #161320

Update on "Add stable parallel_for"

6f5b63c

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request Oct 8, 2025

Add stable parallel_for

0b4fa61

ghstack-source-id: 9e95d70 Pull Request resolved: #161320

Update on "Add stable parallel_for"

c5e4edf

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request Oct 8, 2025

Add stable parallel_for

8cdf337

ghstack-source-id: dd66444 Pull Request resolved: #161320

mikaylagawarecki added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 8, 2025

mikaylagawarecki added a commit that referenced this pull request Oct 8, 2025

Add stable parallel_for

7c373b6

ghstack-source-id: 1955d1a Pull Request resolved: #161320

mikaylagawarecki added a commit that referenced this pull request Oct 8, 2025

Add stable parallel_for

f72eea9

ghstack-source-id: 3114a1e Pull Request resolved: #161320

mikaylagawarecki added a commit that referenced this pull request Oct 9, 2025

Add stable parallel_for

f88b2df

ghstack-source-id: 25e58b1 Pull Request resolved: #161320

swolchok reviewed Oct 9, 2025

View reviewed changes

malfet reviewed Oct 15, 2025

View reviewed changes

mikaylagawarecki added a commit that referenced this pull request Oct 16, 2025

Add stable parallel_for

0fdfd18

ghstack-source-id: 1e9eb12 Pull Request resolved: #161320

mikaylagawarecki added a commit that referenced this pull request Oct 16, 2025

Add stable parallel_for

e943bbe

ghstack-source-id: 2538044 Pull Request resolved: #161320

mikaylagawarecki added a commit that referenced this pull request Oct 16, 2025

Add stable parallel_for

41be3f3

ghstack-source-id: 9a9d6d4 Pull Request resolved: #161320

mikaylagawarecki added a commit that referenced this pull request Oct 16, 2025

Add stable parallel_for

fd92b4c

ghstack-source-id: df62375 Pull Request resolved: #161320

mikaylagawarecki mentioned this pull request Oct 16, 2025

Move divup to headeronly #165698

Closed

mikaylagawarecki added a commit that referenced this pull request Oct 16, 2025

Add stable parallel_for

a02f441

ghstack-source-id: 157c942 Pull Request resolved: #161320

mikaylagawarecki added a commit that referenced this pull request Oct 16, 2025

Add stable parallel_for

7cdd74f

ghstack-source-id: c1b18aa Pull Request resolved: #161320

mikaylagawarecki added a commit that referenced this pull request Oct 16, 2025

Add stable parallel_for

d2a836a

ghstack-source-id: cde0bdb Pull Request resolved: #161320

mikaylagawarecki added a commit that referenced this pull request Oct 17, 2025

Add stable parallel_for

5eaa1f0

ghstack-source-id: c0ae771 Pull Request resolved: #161320

mikaylagawarecki commented Oct 17, 2025

View reviewed changes

mikaylagawarecki closed this Nov 3, 2025

github-actions bot deleted the gh/mikaylagawarecki/337/head branch December 4, 2025 02:18

	template <class F>
	inline void parallel_for(
	const int64_t begin,
	const int64_t end,
	const int64_t grain_size,
	const F& f) {
	TORCH_INTERNAL_ASSERT_DEBUG_ONLY(grain_size >= 0);
	if (begin >= end) {
	return;
	}

	#ifdef INTRA_OP_PARALLEL
	at::internal::lazy_init_num_threads();
	const auto numiter = end - begin;
	const bool use_parallel =
	(numiter > grain_size && numiter > 1 && !at::in_parallel_region() &&
	at::get_num_threads() > 1);
	if (!use_parallel) {
	internal::ThreadIdGuard tid_guard(0);
	c10::ParallelGuard guard(true);
	f(begin, end);
	return;
	}

	internal::invoke_parallel(
	begin, end, grain_size, [&](int64_t begin, int64_t end) {
	c10::ParallelGuard guard(true);
	f(begin, end);
	});
	#else
	internal::ThreadIdGuard tid_guard(0);
	c10::ParallelGuard guard(true);
	f(begin, end);
	#endif
	}

	#ifdef _OPENMP
	namespace at::internal {
	template <typename F>
	inline void invoke_parallel(
	int64_t begin,
	int64_t end,
	int64_t grain_size,
	const F& f) {
	std::atomic_flag err_flag = ATOMIC_FLAG_INIT;
	std::exception_ptr eptr;

	#pragma omp parallel
	{
	// choose number of tasks based on grain size and number of threads
	// can't use num_threads clause due to bugs in GOMP's thread pool (See
	// #32008)
	int64_t num_threads = omp_get_num_threads();
	if (grain_size > 0) {
	num_threads = std::min(num_threads, divup((end - begin), grain_size));
	}

	int64_t tid = omp_get_thread_num();
	int64_t chunk_size = divup((end - begin), num_threads);
	int64_t begin_tid = begin + tid * chunk_size;
	if (begin_tid < end) {
	try {
	internal::ThreadIdGuard tid_guard(tid);
	f(begin_tid, std::min(end, chunk_size + begin_tid));
	} catch (...) {
	if (!err_flag.test_and_set()) {
	eptr = std::current_exception();
	}
	}
	}
	}
	if (eptr) {
	std::rethrow_exception(eptr);
	}
	}
	} // namespace at::internal

	void invoke_parallel(
	const int64_t begin,
	const int64_t end,
	const int64_t grain_size,
	const std::function<void(int64_t, int64_t)>& f) {
	at::internal::lazy_init_num_threads();

	size_t num_tasks = 0, chunk_size = 0;
	std::tie(num_tasks, chunk_size) =
	internal::calc_num_tasks_and_chunk_size(begin, end, grain_size);

	struct {
	std::atomic_flag err_flag = ATOMIC_FLAG_INIT;
	std::exception_ptr eptr;
	std::mutex mutex;
	std::atomic_size_t remaining{0};
	std::condition_variable cv;
	} state;

	auto task = [f, &state, begin, end, chunk_size]
	(size_t task_id) {
	int64_t local_start = static_cast<int64_t>(begin + task_id * chunk_size);
	if (local_start < end) {
	int64_t local_end = std::min(end, static_cast<int64_t>(chunk_size + local_start));
	try {
	ParallelRegionGuard guard(static_cast<int>(task_id));
	f(local_start, local_end);
	} catch (...) {
	if (!state.err_flag.test_and_set()) {
	state.eptr = std::current_exception();
	}
	}
	}
	{
	std::unique_lock<std::mutex> lk(state.mutex);
	if (--state.remaining == 0) {
	state.cv.notify_one();
	}
	}
	};
	state.remaining = num_tasks;
	_run_with_pool(std::move(task), num_tasks);

	// Wait for all tasks to finish.
	{
	std::unique_lock<std::mutex> lk(state.mutex);
	if (state.remaining != 0) {
	state.cv.wait(lk);
	}
	}
	if (state.eptr) {
	std::rethrow_exception(state.eptr);
	}
	}

	} // namespace internal

	AOTI_TORCH_EXPORT bool aoti_torch_get_parallel_openmp_enabled();
	AOTI_TORCH_EXPORT bool aoti_torch_openmp_is_available();

	for (int64_t i = begin; i < end; i++) {
	for (const auto i : c10::irange(begin, end)) {

	Tensor res = test_parallel_for(to<int64_t>(stack[0]), to<int64_t>(stack[1]));
	auto& res = test_parallel_for(to<int64_t>(stack[0]), to<int64_t>(stack[1]));

	if openmp_problem and sys.platform == "darwin":
	instruction = (
	"\n\nOpenMP support not found. Please try one of the following solutions:\n"
	"(1) Set the `CXX` environment variable to a compiler other than Apple clang++/g++ "
	"that has builtin OpenMP support;\n"
	"(2) install OpenMP via conda: `conda install llvm-openmp`;\n"
	"(3) install libomp via brew: `brew install libomp`;\n"
	"(4) manually setup OpenMP and set the `OMP_PREFIX` environment variable to point to a path"
	" with `include/omp.h` under it."
	)

		@@ -0,0 +1,55 @@
		#pragma once

		#include <torch/csrc/inductor/aoti_torch/c/shim.h>

Add stable parallel_for #161320

Add stable parallel_for #161320

Uh oh!

Conversation

mikaylagawarecki commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161320

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

github-actions bot commented Aug 22, 2025

Attention! PyTorch one of the C-stable API file was changed

Uh oh!

Uh oh!

Uh oh!

mikaylagawarecki commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swolchok commented Oct 9, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

mikaylagawarecki commented Aug 22, 2025 •

edited

Loading

pytorch-bot bot commented Aug 22, 2025 •

edited

Loading

mikaylagawarecki commented Oct 9, 2025 •

edited

Loading

mikaylagawarecki commented Oct 15, 2025 •

edited

Loading