-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable autocast #794
Disable autocast #794
Conversation
We can probably provide a private api to run without amp |
Yes, this would be super helpful. Just to confirm that we would like private API to be applicable for a torchscript graph. We already have global flags that we can disable/enable. But, that changes the behavior of the user code. |
@anijain2305 i guess using jit compiler for forward/backward isnt in tree currently? |
Not sure I fully understand. We can definitely use Torchscript - https://github.com/pytorch/functorch/blob/main/functorch/_src/compilers.py#L23 This is the place where we call it - https://github.com/pytorch/functorch/blob/main/functorch/_src/aot_autograd.py#L169 (basically This is the AOTAutogrard return object - https://github.com/pytorch/functorch/blob/main/functorch/_src/aot_autograd.py#L143-L185. An autograd.Function with its forward and backward set to the compiled graphs. It is this place, where this PR wraps the forward and backward calls in |
Thanks! Yea was just looking for the |
I was about to file a separate issue for an AMP problem, but might be covered here. Testing PT 1.12 w/ the 0.2 release branch built locally I can no longer use aot-autograd with AMP (and that's really the only combo I'm interested in). PT 1.11 w/ the current pypi release of functorch seemed fine. Now, w/ AMP enabled I get errors like
and
|
Hi @rwightman, does this PR work for you? If you have a script, I can try on my end as well. We did not merge this one because there is a slightly better fix that @eellison is working on in PyTorch core. But, its still work in progress. So, if this PR works for you, I am inclined to merge this in and bring in functorch 0.2 release to unblock. |
bc5ecb2
to
3138665
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one minor nit.
@anijain2305 I cherry-picked this onto the 0.2 release branch locally and appears to resolve the AMP issues for me. |
* Disable autocast * Add global flag * Add a test
* Disable autocast * Add global flag * Add a test
* Disable autocast * Add global flag * Add a test
….13.1 = (4): [FX] Add torch.memory_format as a BaseArgumentType (#62593) Use output memory format based on input for cudnn_convolution_relu (#62482) Improve performance of index_select by avoiding item (#63008) Add crow_/col_indices to view types (#63176) Aaron Bockover (2): Add support for the ONNX Runtime Eager Mode backend (#58248) CODEOWNERS: [ONNX] remove @shubhambhokare1; add @abock (#85476) Aaron Enye Shi (6): [Kineto][Bug Fix] Avoid picking up old CUPTI headers (#72761) [WIP][Kineto] Manual Submodule Update (#73090) [Kineto] Manual Submodule Update (#73858) [Profiler] Store Input shapes, dtypes, and metadata into flat AppendOnlyList (#74241) [libkineto] Re-enable user-annotations in PyTorch (#75601) [Profiler] Add quoted metadata API to remove empty trace cpu_op metadata (#84128) Aashaka Shah (3): [c10d] Fix async error in batch_isend_irecv (#82450) Enable pg_nccl.reduce_scatter to perform vector ReduceScatter for uneven input splits (#82924) Enable pg_nccl to perform vector AllGather for uneven output splits (#83713) Aayush Prakash (2): Removing tensor.data usage in utils with tensor set_ method (#63867) [Reland] Replacing the p.data acccess in utils with tensor.set_ . Passes both test_post_localSGD_optimizer_pari and test_periodic_model_averager tests (#63895) Abhijit Deo (1): [Documentation] Minor rendering issue (#84856) Abhishek Gadewar (1): Revert D31325860: [PyTorch] Implement improved version of gather_ranges_to_dense Abhishek Pathak (12): MPS: Add amax and amin Ops with tests (#79682) [MPS] Handle 1D inputs for NLL (#81290) [MPS] Handle 1D bias for addmm (#81519) [MPS] Fixes for MPS testConsistency (#81735) [MPS] Handle casting for div operation (#84742) [MPS] Fix memory error in var (#85571) [MPS] Clamp op - fix shape issues (#114) (#85673) [MPS] Enable adaptive avg pool 2d with larger output size (#85726) [MPS] Handle output shape for empty input in binary ops (#85836) [MPS] Handle scalar input for scatter and gather (#85842) [MPS] Handle compatible inputs to where (#85946) [MPS] Cast dot inputs to int32 when needed (#86140) Adam Costarino (1): Extrapolated on equiv between linalg @ and solve (#71769) Adam J. Stewart (7): DataLoader: allow non-integer Samplers (#63500) Add type hints for a few random functions/classes Add type hints for a few random functions/classes Docs: build with Sphinx 5 (#70309) Fix linspace dtype replacement in docs (#81371) torch.cartesian_prod: add type hints (#81377) Add type hints to torch.save, torch.load (#83937) Adam Mainz (1): changes for pytorch issue 55577 (#66571) Adam Simpkins (8): [caffe2] add a basic implementation of run-time feature rollout checks (#59355) [caffe2] update the BlobSerializer acceptor to allow moving in the data (#60207) [caffe2] update db::Transaction::Put() to accept the value by rvalue reference (#60208) [caffe2] add an EstimateAllBlobSizes operator (#59775) [caffe2] update make_cifar_db to move the string into DB::Put() (#60692) [caffe2] update make_mnist_db and make_image_db to move strings into DB::Put() (#60919) [caffe2] break one circular dependency between Caffe2 and ATen-cpu (#62632) [caffe2] fix type annotations for workspace.SwitchWorkspace() (#77464) Aditya Kane (1): Nit in `TripletMarginLoss` Aditya Pillai (2): Add static_runtime::fused_equally_split (#2) Make fb::sigrid_hash_compute_multipler_shift return a std::tuple<int64_t, int64_t> (#67123) Aditya Tewary (2): correct NLLLoss parameters default value (#68426) added set_printoptions examples (#68324) Adnios (3): Add mish activation function (#58648) Add the `maximize` flag to AdamW (#70146) fix typo in adam docs (#70387) Adrian Wälchli (1): Fix inefficient recursive update in ShardedTensor.state_dict hook (#68806) Aidyn-A (14): Fix max pool forward nhwc (#76597) add bfloat16 support for kl_div_backward_cuda (#77676) [primTorch] Elementwise Binary Ops I (#78023) Add logsumexp to AMP autocast (#76330) [primTorch] support one tensor and two scalars in _prims.where (#80146) [primTorch] Elementwise unary ops vi (#79526) Modify D2H copy with a different dtype (#80607) Update sample inputs for fft.hfftn (#81416) [CUDA graphs] Clear autocast amp cache (#81558) [Re-land] [CUDA graphs] Clear autocast amp cache (#81896) Update start_index and end_index for adaptive pooling (#84010) Disable autocast cache in torch.cuda.make_graphed_callables (#84289) [CUDA graphs] Fixes errors in RNG seed (#84967) Fix exception handling, improve overheads and avoid constructing storage for element size (#84612) Ailing Zhang (4): Fix GIL issue when acquiring multiple sessions. (#58584) Revert D28802058: [pytorch][PR] add dispatch for bitwise_and Reuse run_torch_xla_tests from pytorch/xla (#59888) Reuse build_torch_xla from pytorch/xla repo. (#59989) Akifumi Imanishi (4): Fix some tensor operators to return `NotImplemented` for invalid inputs (#58216) Support `__rmod__` (#58476) Support `torch.bitwise_{left/right}_shift` and `__rlshift__`, `__rrshift__` (#59544) Support `__rand__`, `__ror__` and `__rxor__` (#59240) Akshay Parashar (15): [Static Runtime] Fix aten::clone out variant (#78297) (#78322) [Static Runtime] Implement prim::Fork and aten::wait (#78780) [Static Runtime] prim::fork asyunchronous execution on JIT interpreter (#78858) [Static Runtime] support fork and wait operations on Static Runtime (#79211) [Static Runtime] Exception handling during fork subgraph execution (#79292) [Static Runtime] Performance optimization for fork operation (#79482) [Static runtime] Pass parent graph metadata to forked subgraphs (#79578) [Static Runtime] Added nested prim::fork and aten::wait test case (#79746) [Static Runtime] Add Metadata to ProcessedNode depending upon the op type (#79961) [Static Runtime] Support Futures in Static Runtime Engine (#80162) [Static Runtime] support forked subgraph execution on parent graph's executor (#80381) [Static Runtime] test case for staticRuntime::runAsync() API (#80407) [Static Runtime] Fix precision error in test cases (#80935) [Static Runtime] documentation update for StaticRuntime (#81066) [Static Runtime] implementation of variadic grouped_accessor_async operation (#82680) Akshit Khurana (31): Run pthreadpool with _NoPThreadPoolGuard on the same thread (#58759) Fix pthreadpool guard test (#58977) Fix xnnpack hardswish memory issue (#59577) Add aten::avgpool2d NNAPI converter (#58538) Add aten::softmax NNAPI converter (#58539) Add aten::to NNAPI converter (#58540) Add aten::div NNAPI converter (#58541) Add aten::flatten NNAPI converter (#60885) Add aten::detach NNAPI converter (#58543) Add aten::slice NNAPI converter (#59364) Add Int32 support for NNAPI (#59365) Add conv2d transpose NNAPI converter (#59529) Make conv2d nnapi converter accept flexible batch (#61021) Make NNAPI linear converter accept flex inputs (#61022) Make nnapi cat converter accept flex inputs Make nnapi flatten converter accept flex inputs (#61024) Add option to specify custom NNAPI serializer (#61025) Fix broken assertion error test in NNAPI convertor (#61586) Fix hardswish inplace op for strided tensor with skipped elements (#61622) Handle simple NNAPI flatten NHWC case (#61796) NNAPI: Support const values in binary ops Fix typo in NNAPI tests (#63797) add qmul (#63913) Add quantized::convtranspose2d (#63914) Fix typo in tensor docs (#64160) nnapi: Add int32 type torchscript expressions (#70197) NNAPI: Add runtime flexible shapes & return shapes (#70334) NNAPI: Add qint16 support via int16 (#70621) NNAPI: quant logistic fix (#70847) [Pytorch NNAPI] Add compilation_preference & relax_f32_to_f16 APIs (#78758) [PTE] Fix module level information in profiling (#81727) Alban Desmaison (111): Revert D28494073: [Gradient Compression] Do not skip the comm hook tests for Gloo/MPI backends Revert D28112689: CUDA support in the CSR layout: constructors Revert D28913223: [pytorch][PR] Adding run-specified-test-cases option in run_test.py Revert D29100708: [pytorch][PR] Parametrizations depending on several inputs Revert D29312809: add quantized_resize and dequantize for some cuda backends Add missing docker build to slow gradcheck label-triggered build (#61941) Svd docfix (#62028) Fix forward ad for matrix power land race (#62291) These should be equivalent per the previous formula but breaks xla (#62329) clean torch_function handling in serialization (#62744) Add serialization support for slots and subclass getstate/setstate (#62745) Revert D30090760: [iOS] Add podspec for libTorch-lite nightly build Revert D29399533: Hoisting common expressions out of If blocks Update full backward hook doc with not-same-object note (#63245) Revert D30426527: Adding DataLoader2 class as future replacement of DataLoader Revert D30417370: [nnc] Enable CPU fusion Allow implementing either backward or vjp for Function (#63434) Revert D30388099: Add a common autograd TLS state Add a common autograd TLS state (#63860) Revert D30526034: [pytorch][PR] compute reduction intermediate buffer size in elements Back out "Added reference tests to ReductionOpInfo" (#64183) Revert D30561459: Fix bytes_written and bytes_read Add forward AD support for custom Functions (#64061) Move THPVariable_NewWithVar around (#64550) Add more error checking in subclass creation (#64746) Revert D30888794: [Model Averaging] Simplify PostLocalSGD Optimizer API Allow parametrization to be nested (#65167) fix typo missing f string (#65226) Fix autograd engine test in python_dispatch (#65567) Fix engine check for case where grad is a subclass (#65568) add deepcopy support to subclasses (#65584) Remove old code that is unused in test/ (#66331) Update extending doc to cover forward mode AD (#66962) Docs module check (#67440) Revert D32175959: Merging the implementations of ClearProfiling Revert D32175957: Adding custom testing based on opinfos input for ops with custom rules. Revert D32175958: Adding Custom Rules to Device Propagation Revert D32175960: Moving parts of the Shape Registry into a common file Revert D32175963: Converting hardswish to strucutred kernels with metatensor support Fix deadlock for multi-output forward AD (#67995) fix gradcheck to generate valid input for forward AD complex (#68001) Revert D32187063: [static runtime] dequantize out variant Revert D32541986: [pytorch][PR] [opinfo] use dtypes instead of dtypesIfCPU Revert D32010095: [pytorch][PR] Add ability for a mobile::Module to save as flatbuffer Follow the undefined Tensor <-> None rule better in torch dispatch (#67793) add python dispatch test back to CI and fix typo in test (#69565) fix typo changing the generated code (#69899) Revert D33024528: [quant][fx][graphmode] Add support for conv add pattern in backend_config_dict Revert D32974907: [quant][graphmode][fx] Enable fuse handler for sequence of 3 ops Fix adamw formula doc (#68587) fix loading of older models that don't have maximize (#71023) Revert D32994274: [ONNX] Link to the wiki (#68505) Simplify TensorImpl size check and fix error message (#72070) Process commit update2 Revert D31316086: [fx-acc] PassManager Tensorimpl cleanup try 2 (#72336) remove some spurious warnings fixing (#72352) Change ParameterList and ParameterDict to be able to contain any kind of objects (#70499) remove some spurious warnings fixing take 2 (#72542) Remove un-used function in autograd engine (#72687) Clean up use of cpu ready queue in autograd engine (#72688) Allow forking until a worker thread is created in autograd engine (#72689) Clean up LoggingTensor semantic (#72620) Add wrapped Tensor autograd test (#72622) Add new tls snapshot feature (#72623) Add new tls snapshot feature (#72832) Revert D34250357: Sync lazy_tensor_staging back to master Revert D34342689: Revert D34250357: Sync lazy_tensor_staging back to master Ensure that call before redispatch work well for PythonTLSSnapshot (#73045) Fix error handling TestSetDefaultMobileCPUAllocator add dry run option and improve test list printing Reland fix dispatch (#73231) Revert D34400588: [pytorch][PR] super setUp call missing in TestSparse Revert D33994011: Make debug_pkl smaller by only emitting unique traces. simplify run_test for distributed tests only run complex autograd tests once Revert D34599476: [Quant][test] Added test to check if fp16 packing->unpacking yields the same result as to(torch.float16).to(torch.float32) Fix deadlock in some edge case in autograd (#73961) rename config module file to work with gh pages better Cleanup all module references in doc (#73983) Update tls logic to work better with guarded call (#73925) Add Sherlock to superusers Revert D35284563: Use the same checks in all `grid_sampler` functions update codeowner for public API Fix compilation on macos Remove spurious warning when using disabled torch function Fix public binding check for modules with `__all__` Fix doc build Reland Fix public binding check for modules with `__all__` ReReland Fix public binding check for modules with `__all__` Update allowlist ReReReland Fix public binding check for modules with `__all__` Make distributed raise ImportError when not available Revert "Revert "record_function: update to use custom_class API"" Improve more the error message with explicit recommendation Make -h work with run_test.py Make sure requires_grad is propagated for all backend Readme update to remove old python version Add gradient choice detail to autograd doc Make sure that we can build without xcode on mac (#77450) Fix MPS interaction with autograd engine Migrate x86 trunk build/test to macos12 Migrate cross compilation trunk test to use macos12 to build MPS improve mps note to describe the different functions available (#77767) Move x86 binaries builder to macos-12 to enable MPS build Fix a few issues on assert/double error/legacy constructor (#77966) prims shouldn't be checked for BC checks (#78079) Remove prints and add proper asserts Speed up test_mps from 9min to 25s Add full support for serialization of MPS Tensors (#79465) Add full support for serialization of MPS Tensors (#79465) Albert Chung (1): Update docstring for scale_factor in torch.nn.functional.interpolate. (#80807) Albert Liang (1): Add `dict` methods to `ParameterDict` (#69403) Alex Beloi (15): [fx-acc] add acc_op optimization flags and decorator (#65928) [acc_shape_inference] add shape inference for quantize_per_channel (#66562) [fx-acc] add optimize_quantization to FX graph opts (#65929) [fx-acc] add automated graph opt testing using AccOpProperty (#67228) [fx-acc] add optimize_noop graph opt (#68131) [fx-acc][graph-opts] bug fixes for transpose_to_reshape, optimize_quantization, finalize_kwargs_to_concrete [fx] add documentation to AccOpProperties (#71450) [fx][graph opts] port FoldLayerNormArithmetic from glow to FX (#69715) [fx-acc] PassManager (#67261) [fx][acc_tracer] fix defaulted placeholder normalization (#73406) [fx][1/2] add PassManager and refactor AFG/AGM (#74972) [fx][ShapeProp] make shapes and args/kwargs concrete for minimizer (#75291) [fx] refactor fba_passes into FBAPassManagerBuilder (#83268) [fx] add deferred weights (xl_weight) and tracing for xl_embedding_bag (#84016) [perf][1/5] Replace IValue::toString()->string() with IValue::toStringRef() (#85437) Alex Dai (1): fix at::from_blob_quantized_per_tensor_affine strides calculation (#79314) Alex Hedges (2): Fix BytesWarning in torch.load() (#74813) Fix code that triggers BytesWarning (#79868) Alex Li (1): Update cross entropy documentation to metion logits clearly (#82538) Alex Suhan (2): Add device and key for lazy tensors (#61621) Fix reshape for the Lazy key (#62846) Alex Zhao (1): .github: Migrate linux-xenial-py3.6-gcc7 to GHA (#67072) Alex Zhuang (1): Correct torch.nn.CrossEntropyLoss output shape specification (#79568) Alexander (6): fixing csr addmm bug (#58768) CUDA support in the CSR layout: constructors (#57274) CUDA support in the CSR layout: constructors (#59010) CUDA support in the CSR layout: sparse_to_dense/add_sparse_csr (#59011) CUDA support in the CSR layout: CUDA addmm/matvec (#59012) Add typing return value to init in nn.Module (#45654) Alexander Golynski (3): Update Gloo submodule (#58853) Switch PG::Work to Future in default_comm_hooks.cpp (#59398) PG NCCL cleanup: remove usage of completed_ in WorkNCCL copies (#59899) Alexander Grund (21): Fix arange functions for VSX specializations of Vec256 (#58553) Fix vectorized calculations on POWER (#59382) Fix accuraccy failures when running test_nn on A100s (#59624) Increase tolerance for test_grad_scaling_clipping (#60458) Fix test failures with some glibc libraries (#60450) Increase some tolerances for tf32 for Conv3d tests (#60451) Increase tolerance for some distributed tests to 5e-5 (#60462) Add Github action to upload full source releases (#63022) Fix segmentation fault due to access to destroyed CudaIPCGlobalEntities instance (#56141) Deduplicate codegenOutputQuery to query maximum CUDA compute capabilities (#55901) Pass WITH_BLAS option from environment to CMake (#78037) Copy Tensor for tests to avoid in-place transform modifying the original tensor (#80331) Only sync CUDA if the operation is run on GPU (#80328) Choose test affinity based on current affinity (#80327) Fix faulty, vectorized `pow` function on VSX (#82646) Skip TestNNAPI tests if QNNPACK is not supported (#82882) Increase default test timeout for distributed tests (#80330) Fix failing test_model_dump due to empty file (#84744) Increase timeout for ProcessGroupGlooTest (#85474) Fix `check_compiler_ok_for_platform` on non-English locales (#85891) Limit world size in test_fsdp_pure_fp16 (#85957) Alexander Soare (1): add autowrap_functions kwarg to fx.Tracer (#62106) Alexandr Guzhva (2): [quant] Add fp32/fp16 zero_point support for GPU fakeQuant (#65836) [quant] Add op benchmark for GPU FakeQuantizePerChannel with float zero_points (#66183) Alfredo Canziani (1): Update state_dict docs (#83104) Aliaksandr Ivanou (20): [Error-reporting] Set upper boundary on border element (#59311) Remove use_env from torch.distributed.run, clarify bc around that parameter in comment. (#59409) [Torch] Correct launcher tests (#59635) [Torch] Cast timestamp type to int (#59712) [pytorch] Move signal handler test to internal codebase (#60394) [torch] Various improvements to `torch.distributed.launch` and `torch.distributed.run` (#60925) [torch] Various improvements to `torch.distributed.launch` and `torch.distributed.run` (#61294) [torchelastic] Set the correct maximum border width [torchelastic] Make sure `rdzv_configs[timeout]` is not getting overwritten (#61471) [torch] Set `nproc_per_node` to 1 (#61552) [torchelastic] Improve process termination logic (#61602) [torchelastic][multiprocessing] Print warning message only when child processes are stuck (#62823) [torch] Set default log level for torch elastic (#63214) [torch][launch] Add ability to override sys.executable for `torch.distributed.run` (#66179) [torchelastic] Fix failing tests (#66440) [torchelastic] Skip tests in tsan mode (#67103) [torchelastic] Remove stale `test_get_default_executable` test (#68609) [torchelastic][1/n] Fix `caffe2.test.distributed.launcher.api_test` flaky tests (#68624) [torch][distributed] Check for file existence before invoking cleanup logic in FileStore destructor (#68603) [torch][elastic] Make final agent barrier to shutdown properly Alice Ou (1): Revert D28643215: Adds an aten::_ops namespace with unambiguous function names Allen Goodman (14): Beta function (#78031) Chebyshev polynomial of the first kind (#78196) Chebyshev polynomial of the second kind (#78293) Physicist’s Hermite polynomial (#78352) Probabilist’s Hermite polynomial (#78357) Laguerre polynomial (#78366) Bessel functions (#78451) Orthogonal Polynomials (#78304) c10 mathematical constants (#78910) torch.special.airy_ai (#78902) torch.special.scaled_modified_bessel_k1 (#78901) torch.special.gamma (#78904) torch.special.spherical_bessel_j0 (#78912) torch.special.scaled_modified_bessel_k0 (#78900) Amir Khojaste (1): Upgrading the loop to use irange (#70326) Amit Kumar Chawla (3): [Contrib][Fakelowp] Change Lut Size for Tanh (#68334) Compilation fix to access pretty_print_onnx function (#79864) [HPU] Enable torch.jit.load for HPU (#81759) Amr Elshennawy (3): Reduce PyToch Warnings - Cast fixes from D26624430 (#65015) Reduce PyTorch warnings: Cast fix xplat/caffe2/c10/core/TensorOptions.h (#65030) Reduce PyTorch warnings: Cast fix xplat/caffe2/aten/src/ATen/core/DeprecatedTypeProperties.h (#65031) Amy He (16): Python basic module execution unit test on delegation of backend_with_compiler_demo (#60468) Python error unit tests on delegation of backend_with_compiler_demo (#60689) Python composite module execution unit tests on delegation of backend_with_compiler_demo (#60801) [1/N] Nnapi backend delegation preprocess (#61499) [4/N] Nnapi backend delegation preprocess: List Tensors & Comment Updates (#61752) [3/N] Nnapi Backend Delegate Preprocess: Basic OSS Test (#61594) Back out "Revert D29687143: [3/N] Nnapi Backend Delegate Preprocess: Basic OSS Test" (#61878) [6/N] Nnapi Backend Delegate: Comprehensive OSS Tests (#61782) [7/N] Nnapi backend delegation preprocess: compile_spec sanity check (#62213) [8/N] Nnapi backend delegation preprocess: New refactored design (#62225) [1/N] Nnapi backend execute and compile (#62272) Fix Nnapi backend execute's dangling pointer (#63092) Refactor NnapiCompilation registration into it's own file (#63183) Move Android Nnapi srcs from aten_native_cpu to aten_cpu (#62919) Remove backend_debug from torch_core srcs and replace with library dependency (#63111) Nnapi Delegation: Quick improvements (#63489) Andre (1): [functorch] update compile example imports (pytorch/functorch#834) Andreas Kouzelis (1): updated the docs for BatchNorm1d and InstanceNorm1d (#71371) Andres Lugo-Reyes (2): [ROCm] Enable/fix unit tests test_stream_args and test_event_args (#82346) [ROCm] Retry loop implemented to avoid transient memory leak errors (#82607) Andres Suarez (5): [codemod][dirsync] Apply clang-format [codemod][lint][caffe2] Extend BLACK coverage [codemod][fbcode/caffe2] Apply all buildifier fixes [codemod][fbcode/caffe2] Apply all buildifier fixes [lint][fbcode/caffe2] CLANGFORMAT Andrew Gallagher (10): [caffe2/utils] Add explicit rule to avoid package boundary violation [caffe2/utils] Add explicit rule to avoid package boundary violation [caffe2/utils] Add explicit rule to avoid package boundary violation (#60677) [caffe2/utils] Add some fine-grained rules to avoid package boundary violations [caffe2] Fix include of corresponding header [caffe2/libtorch] Remove already-owned source Update llvm deps for Buck build (#79919) [caffe2/perfkernels] Avoid `native.host_info()` in build files (#80812) [caffe2] Use `arch_deps` instead of host info for arch-specific deps (#80814) [caffe2] Remove last clang-for-cuda sources (#84021) Andrew Gu (127): Add NCCL_ASYNC_ERROR_HANDLING as an environment variable (#59109) Extract c10d Store tests to dedicated test file (#59271) Fix broken hyperlinks (#59425) Refactor c10d and dist aliases for torch.distributed (#59456) Sort params by size (decreasing) Refactor ZeroRedundancyOptimizer Assuming SPSD (#59834) Clean Up ZeRO (#60285) Fix ZeRO sort to be by numel (#60556) Refactor DDP join() API, adding hooks (#60757) Add Model Parallel Support to ZeRO (#61370) Refactor non-joined process computation (#61555) Remove `_broadcast_object()` from `ZeroRedundancyOptimizer` (#61539) Fix indent (#61784) Minor documentation fixes (#61785) Add generic join unit tests (#61786) Fix `c10d` -> `dist` in `test_ddp_hooks.py` (#61864) Add overlap with DDP to ZeRO (two approaches) (#62157) Add invariant check (bucket indices: 0, 1, ..., k-1) (#62623) Refactor commonalities between two approaches (#62624) Make _Join, _Joinable, _JoinHook public (#62605) Add tutorial link (#62785) Add ``allow_empty_param_list`` to functional optimizers (#62522) Simplify data structures, add uniform approximation, fix mem leak (#63162) Pass `_allow_empty_param_list` into func opt ctor (#63163) Remove req to call step() in training loop (#63164) [PT-D][BE] Fix DDP no_sync() test logic (#72348) [FSDP] Add no_sync() context manager (#72446) [ZeRO] Add ctor support for multiple param groups (#72578) [Join][BE] Fix typo; remove obsolete method (#72886) [ZeRO] (Reland) Add ctor support for multiple param groups (#72932) [DDP][BE] Remove bucket replicas (#73237) [DDP][BE] Fix clang-tidy (#73299) [Easy][c10d] Minor fixes (#73318) [DDP][BE] (Reland) Remove bucket replicas (#73567) [Easy][c10d][DDP] (Reland) Minor fixes (#73569) [FSDP] Add grad accumulation without `no_sync()` (#73535) [Easy][FSDP] Fix warning render (#73786) [FSDP][BE] Change assert to assertEqual (#73787) [ZeRO][BE] Clean up ZeRO tests (#73842) [FSDP] Override `named_parameters()` for clean names in `summon_full_params()` (#74333) [Easy][FSDP] Minor doc fixes (#74214) [PT-D] Update dist code owners (#74840) [Easy][FSDP] (Reland) Doc fixes (#74834) [FSDP] Add full optim state dict (#74215) [FSDP] Optim state chkpt: key by param name, not ID (#74879) [FSDP] Add re-key btw param names/IDs for optim state dict (#74912) [Easy][FSDP] Update full osd warning (#75109) [FSDP][Easy] Fix 0-dim tensor optim state device (#75243) [FSDP][Easy] Fix return in docstrings [FSDP] Add `rank0_only` to `full_optim_state_dict()` [FSDP] Fix `_get_param_to_unflat_param_names()` for shared params [FSDP] Add `ignored_modules` ctor arg [FSDP][Easy] `named_parameters()`, `named_buffers()` refactor [FSDP] Add `scatter_full_optim_state_dict()` [FSDP][Easy] Minor simplifications [FSDP] Fix `no_sync()` + `FULL_SHARD` root all-gather behavior [FSDP] Add exec order validation [FSDP] Fix exec order validation (static variable issue) [FSDP] Relax exec order valid. to only fwd [FSDP] Validate exec order using `compute_device` [FSDP] Faster dict inversion [FSDP] Optim state: ignore params if not in dict [FSDP] Include buffers in `ignored_modules` [FSDP] Move param/buffer name comp. to ctor for `ignored_modules` [FSDP] Do not clone buffers; offload buffers to CPU if needed [FSDP] Do not check fwd order in eval mode [FSDP][Easy] Remove extraneous print [FSDP][Easy] Doc fixes [FSDP][Easy] Fix `state_dict_type()` docstring example [FSDP][Easy] Reword device placement warning [FSDP][Easy] Update `state_dict()` docstring [FSDP] Remove unneeded padding logic for optim state dict [FSDP][Docs] Fix typo in `full_optim_state_dict()` [FSDP] Allow different `optim_input` orders across ranks [FSDP] Fix exec order validation for diff ignored modules across ranks [FSDP] Extend ignored modules test to not pass to root [FSDP] Fix param name prefixes for ignored modules (#79955) [Checkpoint Wrapper] Fix assert (#80283) [FSDP] Fix `full_optim_state_dict()` hang (#80712) [BE][FSDP] Remove unneeded `torch.cuda.synchronize()` (#80868) [BE][FSDP] Fix that MP config not being passed to FSDP (#80869) [BE][FSDP] Sort `common_fsdp.py` imports (#80870) [BE][FSDP] Retire `_get_full_detached_param()` (#80871) [BE][FSDP] Introduce `FSDPTestModel` interface (#80873) [BE][FSDP] Subtest prefetching in `test_fsdp_core.py` (#80908) [BE][FSDP] Subtest prefetching in `test_mixed_precision_e2e_full_shard()` (#80915) [Easy][FSDP] Delete dead code (#81158) [FSDP] Stricten `_update_p_data()` in `_summon_full_params()` (#81573) [FSDP] Introduce `FlatParamHandle` (#79652) [FSDP] Deduplicate `_orig_size` and `_unsharded_size` (#79984) [FSDP] Move tensor sharding logic to `FlatParamHandle` (#80000) [FSDP] Remove `self.numel_padded_per_param` (unused) (#80002) [Easy][FSDP] Add `zero_grad()` to unit test train loop (#80087) [FSDP] Clean up `_lazy_init()` (#80185) [FSDP] Move `_post_backward_called` to `_init_param_attributes` (#81243) [FSDP] Update `ShardingStrategy` and `_free_full_params()` docs (#80894) [FSDP] Refactor casting of grad to full param dtype (#81574) [Easy][FSDP] Remove variable shadowing (#82386) [Easy][FSDP] Fix sharded optim state dict doc formatting (#84198) [Easy][FSDP] ufmt `_optim_utils.py` (#84199) [Easy][FSDP] Update `StateDictType` doc (#84200) [FSDP] Retire `self.device_id`; clean up ctor (#83663) [FSDP] ufmt `flat_param.py`, `flatten_params_wrapper.py` (#83664) [FSDP][Easy] Move utils to `_utils.py` (#84212) [FSDP][Easy] Remove unused functions (#84598) [BE][PT-D] Fix race on checkpoint file (#84881) [FSDP] Remove `forward_prefetch` (#84600) [FSDP] Subtest prefetching for `test_fsdp_grad_acc.py` (#84601) [FSDP][Easy] Minor cleanup (#84761) [FSDP] Generalize prefetching; lower unshard/reshard to handle (#83665) [FSDP][Easy] Save unpadded/padded unsharded sizes as attributes (#84366) [FSDP] Add rate limiter (#83917) [FSDP] Fix `pin_memory()` for CPU offloading (#85048) [Easy][FSDP] Remove outdated comment (#85051) [Easy][FSDP] Change `assert` -> `p_assert` (#85052) [FSDP] Fix memory regression! (#85087) [FSDP] Add `_set_flattened()`; `_is_flattened()` (#85038) [FSDP] Short-term fix to remove `optim_input` (#84201) [FSDP] Simplify backward prefetch implementation (#85176) [FSDP] Add back `forward_prefetch` (#85177) [Easy][FSDP] Simplify `assert` to `p_assert` (#85479) [FSDP] Make `_ran_pre_backward_hook` check more robust (#85481) [ShardedTensor] Add `is_meta` (#85482) [ShardedTensor] Add `is_floating_point` (#85483) [FSDP] Add `FSDPExtensions` for TP support (#85039) [FSDP] Expose internal prefetch limits (#86198) [FSDP] Dequeue one instead of flush (#86165) Andrew M. James (19): Connect Tensor.__ipow__ to pow_ method Discover and check operator variants Enable index_add for ComplexHalf (#79897) Add support for BSR <-> Strided Conversion (#80354) Add spdiags sparse matrix initialization (#78439) Add spdiags sparse matrix initialization (#78439) Add support for `select` of batch dims for all sparse compressed formats. (#82119) Fix BSR->Dense Batched Bug (#82120) Dense <-> bsc conversions (#80781) Sparse_coo: Be more agressive in setting coalesced True to avoid suprising behaviors (#82426) Dense -> CSR support batch dimensions (#83084) Dense->BSR performance improvment (#83085) Dense -> CSC support batch dimensions (#83086) Sparse Compressed Transpose add support for Batch dims and BSR/BSC layouts (#82122) resize_as_sparse support all compressed layouts (#85378) sparse mm/addmm enable dense x csc, csc x dense and simplify layout check logic. (#85307) Enable dense x bsc mm/addmm (#85308) Enable CSC @ CSC addmm (#85379) [Docs] Update sparse Maintainers (#85126) Andrew McCollum (1): Fix DistributedSampler mem usage on large datasets (#51841) Andrew Or (27): [Quant][fx] Lower reference conv[1-3]d module (#69228) [Quant][DBR] Add test for serialization (#70078) Add lowering path for LinearReLU module (#71427) DBR Quantization: Add support for functional conv variants (#71795) [Quant][improvement] Rename ReferenceableQuantizedModule (#72717) [Quant][fx] Add lowering for functional linear (#72855) [Quant][fx] Add lowering for Linear-Bn1d in QAT mode (#73509) [Quant][fx] Add lowering for functional conv (#73708) [Quant][fx] Reenable serialization test after convert refactor (#74204) [Quant][fx] Refactor lowering code (part 2) (#74619) [Quant][fx] Define native backend_config_dict for linear and conv (#74636) [Quant][fx] Decouple prepare_*fx from training/eval modes (#75401) [Quant][fx] Fix get_default_qconfig_dict for fused modules [Quant][fx][bc-breaking] Replace qconfig_dict with a config object (#78452) [Quant][docs] Replace qconfig_dict with QConfigMapping in docs [Quant][fx] Add get_default_qconfig_mapping [Quant][fx][bc-breaking] Replace *custom_config_dict with config objects [Quant][fx] Hide equalization_config from prepare APIs (#80164) [Quant][fx][bc-breaking] Replace is_reference with convert_to_reference (#80091) [Quant][fx] Add default configs for fixed qparams ops (#80184) [Quant][fx][bc-breaking] Do not move models to CPU in convert (#80555) [Quant][fx] Rename convert_to_reference to convert_to_reference_fx (#81326) [Quant][fx] Implement BackendConfig (part 1) (#81469) [Quant][fx] Remove dequant-quant around getitem (#82675) [Quant][fx][bc-breaking] Integrate BackendConfig with quantization flow (part 2) (#82557) [Quant] Make quantizable LSTM scriptable (#83304) [Quant] Separate FBGEMM/QNNPACK BackendConfigs (#83566) Andrew Tulloch (4): [CUDA Pinned Memory] Event recording with non-blocking copies should track the storage context, not the tensor data pointer (#68749) [CUDA Pinned Memory] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#68906) [C10D] [Easy] Use pinned memory for HtoD copies in Reducer:: sync_bucket_indices (#69298) [CUDA Pinned Memory] [Retry] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#69299) Andrey (1): [c10d] Reorder macros so they are defined before getting used (#85850) Andrey Malevich (1): [PT] Make error message from jit.trace more meaningful. (#75056) Andrey Talman (66): Add timeouts for GHA jobs for pytorch/pytorch (#67912) replace platform specific CI environment variables with generic ones (#68022) replace platform specific CI environment variables with generic ones (#68133) Support cuda 11.5: install magma for cuda in conda (#68665) Adding Linux cuda 11.5 workflows (#68745) Enabling CUDA 11.5 for binary builds, Adding windows workflows for CUDA 11.5 (#69262) Revert of adding windows CUDA 11.5 workflow (#69365) Making cuda 11.5 workflows periodic (#69323) Adding windows cuda 11.5 workflows (#69377) Deprecating python 3.6 (#70325) Deprecating Python 3.6 (#70493) Deprecating Python 3.6 (#70493) Python3.10 migration adding to binary linux tests (#71130) Adding python 3.10 binary workflows (#71132) Fix for windows builds with python 3.10 , getting rid of ssize_t (ssize_t is not a C++ defined type) (#71390) Adding wheels with py3.10 (#71419) Implement labelling for release notes and topics check (#71726) Fixes pr-labels workflow trigger (#71871) CUDNN changes for cuda 11.5 (#71869) Run the pr-label check on PR closed action and validate closed_by (#71917) Revert D33820822: [pytorch][PR] Run the pr-label check on PR closed action and validate closed_by Remove code for using our own build cudnn image, use nvidia image (#71952) Run the pr-label check on PR closed action and validate closed_by (#71917) Make sure we set GITHUB token in the header for pr-label GHA (#72085) Bump torch version to 1.12 (#72221) Remove forcing CUDNN_STATIC when CAFFE2_STATIC_LINK_CUDA (#72290) Fix for builder repo not pinned in release branch (#72719) Release documentation update to include latest GHA changes (#72740) Set `BLAS_LIBRARIES` to `${MKL_LIBRARIES}` for MKL case (#72806) Documenting cuda 11.5 windows issue (#73013) Remove cuda 11.1 references (#73514) Fix alignment, make sure release labels are included (#73739) Excluding ASAN and periodic jobs from slow job calculation (#74253) Update Release.md with release day steps Update Release.md with release day steps Update RELEASE.md with steps to prepare before cutting RC Cuda 11.6 Disable failing tests (#75420) CUDA 11.6 workflows (#75518) Update docker-builds to add CUDA 11.6 Update Release.md with latest details (#78285) Don't include libiomp with conda install on MacOS (#78632) Move slow-grad checks to CUDA-11.6 (#84313) Release 1.13, Install torch from test channel, Pin build… (#86290) Fix for the binary upload (#86385) Fix binary builds for the release - unblock release (#86484) Conditionally build the TestApp benchmark based on lite interpreter (#86314) (#86377) ci: Just use regular checkout (#86824) (#86895) [CI] Fix builder ref for release, linux only (#86904) Reenable aot tests on windows for cuda 11.7 and up (#87193) (#87307) [ci] handle libomp upgrade on github (#87382) (#87408) Avoid calling logging.basicConfig (#86959) (#87455) Delete torch::deploy from pytorch core (#85953) (#85953) (#87454) Move PadNd from ATen/native to ATen (#87456) Reenable `isinstance` with `torch.distributed.ReduceOp` (#87303) (#87463) [ONNX] Reland: Update training state logic to support ScriptedModule (#86745) (#87457) Fix distributed issue by including distributed files (#87612) Add General Project Policies (#87385) (#87613) fix docs push (#87498) (#87628) attempted fix for nvrtc with lovelace (#87611) (#87618) [JIT][Security] Do not blindly eval input string (#89189) (#89925) Update masked.rst (#89758) (#89923) [Release only change] Uninstall sympy while running windows tests (#90210) Add platform markers for linux only extra_install_requires (#88826) (#89924) [Release only change] Fix rocm5.1.1 docker image (#90321) Add manual cuda deps search logic (#90411) (#90426) [BE] Do not package caffe2 in wheel (#87986) (#90433) Andrij David (2): Update argmin docs to reflect the code behavior (#78888) Propagate map_location arg to torch.jit.load in torch.load (#78733) Angela Yi (31): [quant][fx] Validate qconfig_dict keys (#58566) [quant] Eager mode equalization support for ConvReLU and LinearReLU (#58792) [quant] Implemented InputWeightObserver for Linear inputs [quant] EqualizationQConfig to distinguish input/output activations (#59739) [quant] Input Weight Equalization - prepare modifications (#59747) [quant] Equalization Observer modifications (#59953) [quant] Input-Weight Equaliaztion - convert modifications (#59963) [quant] Input-Weight Equalization - support for F.linear layers (#59964) [quant] Input-Weight Equalization - support for connected linear layers (#60034) [quant] Input-Weight Equalization - support for connected F.linear layer (#60272) [quant] Input-Weight Equalization - tests (#60378) [quant][fx][fix] Fused modules with object_type in qconfig (#60779) [quant] Input-Weight Equalization - support for LinearReLU layers (#60653) [quant] Added reset_min_max_vals() function to observers (#60883) [quant] Input-Weight Equalization - Conv observer support (#61285) [quant] Input-Weight Equalization - Conv prepare support (#61286) [quant] Input-Weight Equalization - Conv convert support (#61287) [quant] Input-Weight Equalization - ConvReLU support (#61350) [quant] Input-weight equalization - branch support (#62366) [quant] Input-Weight Equalization - selective equalization (#61916) [quant] Input-Weight Equalization - allow logical evaluation (#61603) [FX] Modified __deepcopy__ to also copy _codegen Added PassBase implementation [fx] PassResult (#81366) Serialize memory_format (#81332) [fx] PassManager changes (#80531) [fx] Minor modifications to pass infra (#82485) [fx][pass infra] Adding error catching (#83933) [fx][pass] Fix type of exception (#84094) [fx][pass] Fix type of exception (#84094) [fx] Add metadata to fx.GraphModule (#84378) Animesh Jain (58): [NNC] Add Softplus operator (#64589) [TensorExpr] Adding missing python binding for operators (#66336) [NNC] Normalize loops in SplitWithTail (#66242) [NNC] Adding more python bindings for missing operators (#66612) [NNC Testing] Randomized loop nest infrastructure (#70174) [NNC Testing] Randomized loop nest infrastructure (#70410) FX graph module - prevent infinite recursion (#73866) Monkey patch Variable module to fix FX codegen Minor fix in FX tests with TorchDynamo (#79174) Minor FX test fix for TorchDynamo (#79206) Minor fix in jit tests to pass TorchDynamo (#79903) Setting validation flag for Distributions tests to work with TorchDynamo (#80081) Setup for running PyTorch tests with TorchDynamo and skips for known failing tests (#80106) Switch on TorchDynamo for PyTorch tests (#81083) [functorch] [Op-Authoring] Adding mapping from torch ops to ExprHandles (pytorch/functorch#205) [functorch] [Partitioning] Recompute forward in the backward pass (pytorch/functorch#213) [functorch] [Operator Authoring] Memory efficient pointwise fusion (pytorch/functorch#233) [functorch] Rename pointwise operator CompileCache to PointwiseOperatorCompileCache (pytorch/functorch#243) [functorch] [CompileCache] Adding compilation cache (pytorch/functorch#250) [functorch] [Benchmarking] Adding scripts for lightseq benchmarking (pytorch/functorch#310) [functorch] [Benchmark] Layer norm patterns (pytorch/functorch#311) [functorch] Single cache (pytorch/functorch#319) [functorch] [Compile Cache] Handle non tensor args (pytorch/functorch#383) [functorch] Cleanup for memory efficient fusion (pytorch/functorch#388) [functorch] Use num_bytes instead of numel in min-cut partitioning (pytorch/functorch#398) [functorch] Minor refactor for default partition. (pytorch/functorch#390) [functorch] Cleaning up TVM compiler integration (pytorch/functorch#405) [functorch] [Compile Cache] Caching the forward and backward compiler ids (pytorch/functorch#413) [functorch] Better error message for static argnums (pytorch/functorch#431) [functorch] Decompose aten std op (pytorch/functorch#399) [functorch] Don't trace static_args (pytorch/functorch#435) [functorch] Doc strings (pytorch/functorch#463) [functorch] [Compile Cache] Dont pass None args to Compile Cache (pytorch/functorch#470) [functorch] Aot Autograd tutorial (pytorch/functorch#476) [functorch] Workaround to avoid Torchscript bug for new_empty (pytorch/functorch#538) [functorch] Fix the tutorial index (pytorch/functorch#577) [functorch] Setting tensor_meta attr for inplace ops (pytorch/functorch#565) [functorch] Compile readme (pytorch/functorch#585) [functorch] Sphinx and docstrings for AOT Autograd (pytorch/functorch#580) [functorch] Make default_decompositions visible in functorch.compile namespace (pytorch/functorch#613) [functorch] Trace the backward pass assuming contiguous tensors (pytorch/functorch#536) [functorch] Reduce overhead of AOT Module (pytorch/functorch#660) [functorch] Disbale torchdynamo on AOT Autograd generated graphs (pytorch/functorch#662) [functorch] Removing the hack to fix the avg_pool2d backward (pytorch/functorch#619) [functorch] Add cudnn_batch_norm decomposition to default nvfuser decompositions (pytorch/functorch#661) [functorch] Handle -inf for Fx to TS (pytorch/functorch#671) [functorch] Skip extracting meta tensor info for sparse tensors (pytorch/functorch#676) [functorch] Present Random state (pytorch/functorch#887) [functorch] Disable autocast (pytorch/functorch#794) [functorch] Use Functionalization pass (pytorch/functorch#810) Minifier fixes (#83754) Decomposition - batch_norm, save_mean and save_variance always float32 (#84013) Update Dynamo pin (#83829) [AOT Autograd] Redirect named_parameters to original mod (#84157) [AOT Autograd] Redirect named_parameters to original mod (#84157) TorchDynamo Remove context manager (#85124) Fix gelu repr (#85790) Return contiguous tensor from softmax decomposition (#85788) Anirudh Dagar (3): Support `torch.concat` alias, add `cat` OpInfo & remove OpInfo test_out skips {cat, stack, hstack, vtack, dstack} (#62560) Array API: Add `torch.linalg.matmul` alias to `torch.matmul` (#63227) Array API: Add torch.linalg.cross (#63285) Anish Mahishi (1): Refactoring the AO experimental sparsity tests Anjali Chourdia (6): Revert D29190420: [nnc][tests] Tests and benchmarks for computeSum Add neg bit (#56058) Revert D29698486: [pytorch][PR] Remove torch._bmm and remove torch.bmm deterministic arg documentation Revert D33834916: Set correct device id on efficientzerotensors Reland torch.ops API change machinery with the core functionality disabled (#71785) Back out Dispatcher change that makes Messenger Desktop crash on M1 devices (#77414) Ankita Sharma (1): fixed minor issues for index_add in docs (#65806) Ankur Singla (2): [DistributedInference] Relax the assertion for uniqueness of blob name across external inputs and outputs (#72492) Back out "[const_fold] Set requires_grad based on the folded tensor; add device_for_folding option" (#79655) Ansh Radhakrishnan (1): [nn] Add support for +=, * and *= operations for nn.Sequential objects (#81279) Ansha Yu (16): [static runtime] out variant for full_like (#58079) [static runtime][fix] resize to the input tensor size for full_like (#60229) [static runtime] Remove hasOperation() check (#61496) [static runtime] port c2 argmin kernel (#63632) [static runtime] fuse gather+to+lengths_to_offsets (#64075) [static runtime][dper] multi_env tests for static runtime: selective enable (#67467) [static runtime] dequantize out variant (#67873) [static runtime] dequantize out variant (#68664) [pyper] add flag to disable clip_ranges_gather fusions (#69198) [pyper] to + lengths_to_offsets (#73879) Back out D34696255 "[pyper] to + lengths_to_offsets" (#74906) [pyper] to + lengths_to_offsets with nnpi shape inference support (#5931) [sr] remove max_indices argument of embedding_bag when unncessary (#75993) [scuba] log to pytorch_model_stats when we've tried and failed to enable static runtime [RF][scuba] add pytorch_operator_stats column for Static Runtime out variant (#76566) [sr][pyper] add fusion broadcast_concat_batch_matmul_batch_gather (#76839) Ansley Ussery (15): Improve `CONTRIBUTING.md` (#58396) Improve error message when Proxy object is iterated (#58302) Add parsing logic for `Tuple[()]` annotation (#58340) Remove `Optional[None]` annotations (#60704) Support default values on NamedTuple fields (#54682) Allow for heterogenous List and Dict values + Improve container typing algorithm (#57137) Fix bug in `check_empty_containers` (#63492) Allow uncompiled strings as input to `checkScriptRaisesRegex` (#63901) Support Union in TorchScript (#64234) Preserve types during empty container assignment (#58911) Revert logic in `mobile/type_parser.cpp` (#65556) Clean up `ListLiteral` and `ListComprehension` emission logic (#64952) Clean up `DictLiteral` and `DictComprehension` emission logic (#64953) Port `amax` to structured kernel (#72124) Port `mse_loss` to structured (#72294) Anthony Barbier (2): Add new keys for Graphcore IPU (DispatchKey / Backend / DeviceType) Move IPU tensors to the CPU for printing. (#79287) Anthony Shoumikhin (1): [torch][edge] Add int to the copy kernel. (#69297) Anton Jansson (1): Remove duplicate call to objective function in strong wolfe line search in L-BFGS optimizer. (#72773) Antonio Cuni (1): update the pytorch-gdb example so that it works on current master (#61175) Antonio Kim (16): Move shape and operand definitions to base node (#75223) Decouple Lazy Node Shape Cache (#75324) Decouple LTC from TS Backend using Lazy IR Builder [LTC] Mark Step Indicator (#76840) Fix 'Code below assumes there is at least one tensor arg' assumption (#76917) Codegen Non-Native IR Nodes (#76535) Deprecate `TSNodeLoweringInterface` (#78273) Fix warning: cast from type `const char*` to type `char*` casts away qualifiers (#79520) Fix SequentialLR initialization (#72856) Add missing LTC headers to setup.py (#81424) [LTC] Add custom lazy tensor save function (#83294) Add Lazy backend type string (#84228) Add step closures (#84300) Add ShouldSyncTensor interface (#84418) Make addmm meta kernel consistent with mm (#84960) Add torch_lazy_all_numbers_special_scalars flag (#85902) Anush Elangovan (1): reorder cpuinfo and clog deps in TorchConfig.cmake (#79551) Apoorva Garg (1): Back out "[pytorch][PR] Support dataclasses in TorchScript" Arash Bakhtiari (1): Fix a typo in JIT overview.md (#82269) Ariel Kwiatkowski (1): Update empty and empty_like examples in docs (#68874) Arindam Roy (3): ROCM: Increase timeout for flaky test_with_kwargs (#76706) ROCM: Enable few more tests for ROCM (#77669) [ROCm] re-enable tensorexpr and test_openmp (#81367) Arpan Abhishek (1): fix type error in hipify_python.py (#66164) Artsiom Sanakoyeu (1): [pytorch] Fix loading from checkpoint after "maximize" flag was introduced in SGD (#68733) Arvind Kannan (1): Revert D33246843: [pytorch][PR] Implementation of Wishart distribution Ashish Solanki (1): Upgrade to ubuntu:trusty-20190515 (#63468) Ashwin Hari (1): CMake option for using static MKL libraries AspenStars (1): DOC Improve documentation for LayerNorm (#63144) Aswin John Mathews (2): Remove test linalg test skips from MAGMA integration (#58232) ROCm MIOpen NHWC Convolution support (#63617) Aswin Murali (1): Adds return type annotation for fork_rng function (#63724) Atul Jangra (2): [RFC] Reduce logging noise from AdagradOptimizer (#66443) Make sure that exit code is propagated from Child to parent process (#81408) Avery Wang (1): Added logging for the Reducer's non-member functions. (#65023) Ayaka Mikazuki (1): [docs] Move a sentence from `nn.Transformer` to `nn.TransformerEncoder` (#78337) Ayman Yousef (1): Add Hpu to the rebuild component list BBuf (1): fix resize bug (#61166) Baichuan Yuan (2): Weighted decay with frequency (count-based) (#60382) support counter-based fused rowwise adagrad (#66177) Bairen Yi (1): Fix incorrect decomposition for native_dropout (#77933) Balaji (1): Bug in CosineAnnealingWarmRestarts in optim/lr_scheduler.py (#64758) Bangsheng Tang (2): graceful failure for draw_graph() in acc_utils.py (#66631) [hpc][inference] enable cuda graph in engine holder (#66738) Banit Agrawal (1): [PyTorch GPU Allocator] Better use of blocks with rounding of allocation sizes (#74213) Baoshuo Ren (1): chore: remove git.io Bartek Rymkowski (1): CoreML .mlmodel export support (#84784) Basil Hosmer (7): remove redundant getDispatchKeySetUnboxed(eligibleKeys) (#58535) fix nn.MHA scriptability (#58727) bump out repeat_interleave BC allow date (#59057) configurable pre/post LayerNorm in nn.Transformer (#60593) faster generate_square_subsequent_mask in nn.Transformer (#60631) preserve residual in transformer norm_first (#61692) MaybeOwned page for dev wiki (#63450) Behrooz (1): Fix lists in the docstring Beilei Zheng (1): Add BFloat16 support for multinomial and poisson on CPU Ben Ahlbrand (1): [functorch] update typo in README.md (pytorch/functorch#596) Ben Koopman (17): [quant] Add fp32/fp16 zero_point support for CPU fakeQuant (#65055) [quant] Add op benchmark for CPU FakeQuantizePerChannel with float zero_points (#65241) Clean up unused model instantiation (#65487) [quant][embedding qat] Add basic EmbeddingBag QAT fakeQuant workflow (#65443) [quant][embedding qat] Enable quint4 in EmbeddingBag QAT workflow (#66348) [quant][embedding qat] Add eager QAT test for EmbeddingBag+Linear model (#66334) [quant][embedding qat][bugfix] Fix and test QAT EmbeddingBag from_float error message (#66989) [quant] Fix comparison against reference for test_qat_functional_linear (#68061) [quant][embedding qat] Support non-partial functions in qconfig comparison (#68067) [quant][embedding qat] eager mode QAT for Embeddings (#66429) [quant][embedding qat] Add benchmarks for QAT Embedding+EmbeddingBag (#66560) [quant][embedding qat] Set FakeQuant zeropoint dtype matches observer (#68390) [quant][embedding qat] Fix bug enforcing quant_min <= zero_point <= quant_max for float zeropoint (#68852) [quant][embedding qat] Support Embedding QAT via FX API (#68296) [quant][embdding qat] Add FX support for QAT EmbeddingBag (#68121) [quant][embedding qat] Re-Land Support Embedding QAT via FX API (#69333) [quant][embdding qat] Re-land Add FX support for QAT EmbeddingBag (#69334) Ben Wallace (1): Fix typos in `torch.package` documentation (#82994) Benjamin Rowell (1): Adds keyword only args to gradcheck (#65290) Benoit Steiner (1): Revert D39583438: Multisect successfully blamed D39583438 for test or build failures (#85277) Bert Maher (80): [nnc][scripts] Add a script for bisecting the TE fuser pass (#58357) [nnc] Make the pretty printer prettier (#57874) [nnc] Do not fuse unsqueeze with variable dim (#58346) VaryingShape<Strides>::isComplete() needs to consider whether each Stride is complete (#58510) [nnc] Enable CPU fusion inside Facebook, take 2 (#58347) Revert D28461013: [nnc] Enable CPU fusion inside Facebook, take 2 [nnc] Use int64 to compute matmul flops heuristic (#58676) [nnc] Concat input shapes must be known to fuse (#58974) [nnc] LLVMCodeGen for any target (#58713) [nnc] Enable CPU fusion inside Facebook, take 3 (#59253) Revert D28800692: [nnc] Enable CPU fusion inside Facebook, take 3 [nnc] Enable CPU fusion inside Facebook, take 4 Revert D28859795: [nnc] Enable CPU fusion inside Facebook, take 4 [nnc] Add hardsigmoid (#59069) Fix symbolic derivative of hardswish (#59405) [nnc] Infer device type from nodes if inputs are all scalars (#59430) [nnc] Enable CPU fuser inside FB, take 5 (#59461) [nnc] Do not fuse matmul/conv2d if inputs are discontiguous. (#59754) [nnc] Limit the number of inputs to a fusion group. [nnc] Handle more cases of excessive # of cat args (#60043) [nnc] Move operator implementations into a subdirectory (#59988) [nnc] Move batchnorm to operators library (#59992) [nnc] Speed up batchnorm benchmark [nnc][tests] Tests and benchmarks for computeSum (#60160) Reland D29190420: [nnc][tests] Tests and benchmarks for computeSum (#60550) [nnc] Merge inconsistent profiling information (#60510) Fix the NNC-disabled path in static runtime for perf comparisons [nnc] Serialize initialization of LLVM targets (#60996) [nnc] Get rid of fuser trigger counters (#57334) [nnc] Insert alloc/free at global scope (#61725) Linker version script to hide LLVM symbols (#62906) Hide all symbols in llvm namespace (#63272) Retry apt-get during setup_ci_workspace (#63319) [nnc] Support thread level parallelism in fused kernels (#63386) Remove flag to toggle CPU fusion in the presence of parallelism (#63514) [nnc] Enable CPU fusion (#63545) Revert D30417127: Remove flag to toggle CPU fusion in the presence of parallelism Revert D30360382: [nnc] Support thread level parallelism in fused kernels [nnc] Re-enable CPU fusion" (#63665) Fix some memory bugs in onnx passes (#63754) [nnc] Disable erf and erfc (#63775) Don't switch executors mid test (#63830) Re-apply: [nnc] Support thread level parallelism in fused kernels (#63776) [nnc] Fix dtype promotion involving scalars (#64002) [nnc] Fix batchnorm implementation (#64112) Parse int64 sizes/strides (#64076) [nnc] Make 64-bit dimensions work (#64077) [nnc] Fix half2float conversion and re-enable float16 (#64199) [nnc] Enable fusion of bfloat16 ops (#64196) [nnc] Make our exceptions c10::Errors, get C++ stacktraces (#64332) Revert D30745610: [nnc] Make our exceptions c10::Errors, get C++ stacktraces [nnc] Provide helpful error messages about turning off the fuser (#64516) Lock unpickling of source ranges Avoid UB when indexing into size-0 tensors (#65878) [nnc] Add call_with_numel interface for fast CUDA calls (#65213) [nnc] Add BufHandle.store to python API (#65213) Fix typo in name of LayerNormBackwardCUDAKernel (#66000) Make handle_torch_function_no_python_arg_parser public (#66054) Rename tensorexpr::Value so that it can coexist with torch::jit::Value (#66467) [nnc] Use a descriptive name for fused kernels when profiling (#66990) Benchmarks for various fusers (#67622) [pytorch/tensorexpr] Update use of LLJIT::lookup for LLVM 15 [functorch] Support functions with multiple outputs in `compiled_function` (pytorch/functorch#127) [functorch] Introduce compiled_module for eager compilation of modules (pytorch/functorch#133) [functorch] Remove some commented code (pytorch/functorch#146) [functorch] Support buffers in compiled_module (pytorch/functorch#147) [functorch] Shape-specialization key for op caching [functorch] Helper to convert SpecializationKey to python object [functorch] Class for caching compilation results [functorch] Proxies for binding compilation results to python objects [functorch] Num arg-and-dim specialized cache for generated kernels [functorch] Num arg specialized cache [functorch] Complete compile cache, with in-out specialization [functorch] Python bindings for compilation cache [functorch] Python pointwise compiler implementation (pytorch/functorch#163) [functorch] Revert the compile cache (pytorch/functorch#168) [functorch] Re-land the compile cache (pytorch/functorch#169) [functorch] "Scorecard" benchmarks for pointwise op authoring (pytorch/functorch#193) [functorch] Fix PointwiseCompiler on CUDA (pytorch/functorch#203) [functorch] Clean up perf scorecard and add barplot generation script (pytorch/functorch#212) Bhavya Medishetty (1): To add hipify_torch as a submodule in pytorch/third_party (#74704) Bill Darrow (1): [rpc/distributed] eliminate code duplication in distributed/rendezvou… (#81577) Bin Bao (29): Enable NNC fusion for relu6 (#58773) [JIT] Add a phase to perform inplace<->functional conversion for activation operators (#57477) [NNC] Add a dtypeToCppString virtual method in IRPrinter (#59449) [NNC] Handle int64 indices and loop bounds (#59769) [JIT] Initialize CUDA context before launching fused kernel (#65064) [LT] Add ir_util for ComputePostOrder (#67282) [LT] Merge permutation_util into master (#67766) [LT] Merge cache.h (#67929) Add lazy::Shape::numel() (#68314) [LT] Sync LTC branch changes on torch/csrc/lazy/core (#69012) [LT] Upstream more util functions (#69098) [LT] Upstream LazyView and view ops IR Nodes (#69277) [LT] Sync with the lazy_tensor_staging branch (#69527) [LTC] Upstream utils in computation_client (#69621) [LTC] Upstream several internal ops (#69716) [LTC] Upstream LazyTensor and LazyGraphExecutor (#69815) [LTC] Fix stride accessors in LTCTensorImpl (#70623) Dispatch to at::convolution intead of at::_convolution in _convolution_double_backward (#70661) [LT] Add a flag to control IR reusing [LT] Move MakeNode into ir_builder.h [LT] Add a trie data structure for caching IR nodes [LT] Store OpKind for each IR subclass in a static field [LT] Move device lock in LazyGraphExecutor to a later place Revert "Revert "[LT] Store OpKind for each IR subclass in a static field"" [LT] Codegen ReuseNode for supported ops Revert "Revert "[LT] Codegen ReuseNode for supported ops"" [LT] Add IR resuing support for manually-implemented ops [LTC] Pass a BackendDevice parameter into GetIrValueForScalarFromCodegen (#82970) Add a flag to trigger inductor testing (#85183) Bin Chen (3): Named pipe based watchdog timer (#83695) Add watchdog to TorchElastic agent and trainers (#84081) Log Watchdog events to scuba (#85391) Bin Wen (5): Add a timeout argument to RPC shutdown() (#65425) add gather to ShardedTensor (#65671) [fbcode] Fix operator_benchmark with jit mode (#67382) [fbcode][static runtime] out-variant for quantized::linear_dynamic_fp16 (#67663) [torch.package][doc] PackageExporter does not have file_structure (#79948) Bo Tan (1): Only set sccache_epilogue to run on build job exits (#67798) Bo Wang (9): Make broadcast_object_list accept a device parameter. (#61305) Compare DDP static graph (C++ core) with legacy DDP forward and backward delay. (#61507) Add driver function to run test_sharded_tensor.py and test_sharding_spec.py (#63189) Extend _sharded_tensor constructor to support other ops like torch.ones (#63378) Merge common fields from TensorInitParams and ShardedTensorMetadata into TensorProperties (#63731) More sharded_tensor creation ops: harded_tensor.zeros, sharded_tensor.full, sharded_tensor.rand (#63732) Add torch.nn.init.uniform_ operator to ShardedTensor. (#63997) Enroll bowangbj@ to PyTorch distributed package (#67062) Add torch.nn.init.normal_ and torch.nn.init.kaiming_uniform_ ops to ShardedTensor (#67057) Bo Wu (1): Back out "Make TorchScript Preserve Fully Qualified Class Name for Python Exceptions" BoTorch website deployment script (1): Update SobolEngine docstring w/ correct behavior (#62548) Bobby Impollonia (1): Fix typo in comment (#85635) Bowen Bao (13): [ONNX] Support conv-bn fusion in blocks (#66152) (#67272) [ONNX] Update value name copying logic for onnx (#66170) (#67275) [ONNX] Update onnx function export with comments and clean up (#66817) (#67803) [ONNX] Suppress ort warnings in onnx related test (#67054) (#67804) [ONNX…
With AMP, AOT Autograd traced graph already reflects the AMP modifications.
However, Torchscript does not know that, and can try to AMP-ify already AMP-ified AOTAutograd traced graph, resulting in weird type promotion errors.
Concern (and that's why WIP) - Calling
with
for each forward and backward pass, might add overhead. Is there any better way?