Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge 0803 #1887

Merged
merged 2,619 commits into from
Aug 4, 2022
Merged

Upstream merge 0803 #1887

merged 2,619 commits into from
Aug 4, 2022

Conversation

jjsjann123
Copy link
Collaborator

merging upstream/master into csarofeen/devel

Upstream master commit: 9647bec
Corresponding PR to bump our master branch: #1886

Chillee and others added 30 commits July 27, 2022 00:31
…ted data types (pytorch#82183)

This is in-continuation of fixes for TestConsistency for MPS backend.

* Add error messages for unsupported matmul ops

* Add error handling for int inputs for linear op

### Description
<!-- What did you change and why was it needed? -->

### Issue
<!-- Link to Issue ticket or RFP -->

### Testing
<!-- How did you test your change? -->

Pull Request resolved: pytorch#82183
Approved by: https://github.com/razarmehr
Docker docs says "For other items (files, directories) that do not require ADD’s tar auto-extraction capability, you should always use COPY": https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#add-or-copy

I've found this by running https://github.com/hadolint/hadolint

This is a follow-up after pytorch#81944
Pull Request resolved: pytorch#82151
Approved by: https://github.com/huydhn, https://github.com/jeffdaily, https://github.com/ZainRizvi
### Description
we need to make sure the int overload of expand gets redispatched to the same device. Otherwise at::native::expand just calls a bunch of lower-level ops.

### Issue
<!-- Link to Issue ticket or RFP -->

### Testing
<!-- How did you test your change? -->

Pull Request resolved: pytorch#82264
Approved by: https://github.com/bdhirsh
### Description
<!-- What did you change and why was it needed? -->

### Issue
<!-- Link to Issue ticket or RFP -->

### Testing
<!-- How did you test your change? -->

Pull Request resolved: pytorch#82269
Approved by: https://github.com/kit1980
PEr title, unfortunately testing invalid reads with caching allocator is hard.
Pull Request resolved: pytorch#82272
Approved by: https://github.com/cpuhrsch
Implements linspace with arange, and logspace with linspace.
- Implements a more precise path in linspace's ref when dtype is integral to avoid off-by-one issues when output of computation is casted to int. The trade off is that there's an increased chance of overflow.
- Files several issues pytorch#82242, pytorch#82230, pytorch#81996, on preexisting issues with the linspace and logspace. These mainly concern when dtype is integral - the affect tests are xfailed in this PR.
- Fixes the check that the reference implementation is closer to precise implementation than torch implementation to also update the dtype kwarg to the precise dtype.

TODO:
- ~support negative bases~ (not in this PR)
- ~support complex. Since arange does not support complex, but linspace does, one solution is to just call linspace separately on the real and imag components and sum the results in the end~ (not in this PR)
- ~default dtypes need to be explicitly handled since computation is done in a different dtype than result~ (done)
Pull Request resolved: pytorch#81826
Approved by: https://github.com/ngimel
### Description
Add compiler function to dump the forward, backward, and joint graphs. The partitioner is default partition.
The input meta to each dumped graphs will also be dumped as a pickle file.

Example usage:

```
    save_fx_func = graph_dumper_aot(current_name, folder_name, dump_example_input = False)
    optimize_ctx = torchdynamo.optimize(
        save_fx_func
    )
    with torch.enable_grad():
        with optimize_ctx:
            result = forward_and_backward_pass(model, example_inputs)
```

Pull Request resolved: pytorch#82184
Approved by: https://github.com/Chillee
…1522)

Move  aten.native_batch_norm_backward decomposition from  https://github.com/pytorch/functorch/blob/main/functorch/_src/decompositions.py#L148.

Changed to not recompute mean and invstd, added type cast.

In fucntorch, changed `@register_decomposition_for(aten.native_batch_norm_backward)` to `@register_decomposition_for_jvp(aten.native_batch_norm_backward)`

Passing `pytest test/test_decomp.py -k norm`

Note that when the output mask is False for grad_weight and grad_bias, we should return None to be consistent with the non-decomposed operator's behavior. But "None" doesn't work with vjp, so the version of decomposition in functorch used zeros. See https://github.com/pytorch/pytorch/blob/b33c1f7dd4a4d30ebc912f555e56d105ae66aa84/functorch/functorch/_src/decompositions.py#L210.
Pull Request resolved: pytorch#81522
Approved by: https://github.com/Chillee
As migration from Jenkins to GHA is complete.
Pull Request resolved: pytorch#82280
Approved by: https://github.com/huydhn
benchmarking of conv2d regular op

Differential Revision: [D38118137](https://our.internmc.facebook.com/intern/diff/D38118137/)
Pull Request resolved: pytorch#82125
Approved by: https://github.com/SS-JIA
Based off pytorch#80511 with extra changes:
- Update pybind to the latest release as it contains some needed fixes
- Extend the compat header to do reduce changes in code
Pull Request resolved: pytorch#81242
Approved by: https://github.com/malfet, https://github.com/mattip
…ion of new parameter (pytorch#82273)

### Description
PR pytorch#80336  introduced a new parameter to the Sparse Adam optimizer. The new parameter is accessed inside the `step` method of the optimizer. If we try to deserialize and run an older version of the optimizer before this change was introduced, it fails in the step that tries to access the missing parameter.

I have added a workaround to set a default value in case the parameter is unavailable in the optimizer.

### Issue
<!-- Link to Issue ticket or RFP -->

### Testing
* Testing on PyTorch CI
* Manual validation against existing serialized models to make sure they continue to work
Pull Request resolved: pytorch#82273
Approved by: https://github.com/mehtanirav, https://github.com/albanD
…elper (pytorch#81828)

Introduce _DistWrapper class that wraps a process group and provides functional
variants of collectives. It works without c10d enabled and is exception
robust.

Introduce tensor_narrow_n that handle narrowing over multiple dimentions.

Fixes #ISSUE_NUMBER

Pull Request resolved: pytorch#81828
Approved by: https://github.com/wanchaol
It looks like DEBUG macro is never actually set anywhere, see
pytorch#82276

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: pytorch#82277
Approved by: https://github.com/malfet
)

### Description
Improve the incremental build process on ROCM by eliminating unnecessary file changes.

### Issue
N/A

### Testing
1. Run `python tools/amd_build/build_amd.py --out-of-place-only` multiple times, and ensure File `third_party/gloo/cmake/Modules/Findrccl.cmake` does not contain patterns like `RCCL_LIBRARY_PATH_PATH`
2. Run `python tools/amd_build/build_amd.py; USE_ROCM=1 python3 setup.py develop` twice, and confirm the second run does not trigger the compiling of thousands of files.

Pull Request resolved: pytorch#82190
Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang
The next PR up in the stack requires this for lintrunner to be happy.
There are no logical changes; the file was autoformatted via the
following:
```
mv functorch/codegen/gen_vmap_plumbing.py torchgen/gen_vmap_plumbing.py
lintrunner torchgen/gen_vmap_plumbing.py -a
mv torchgen/gen_vmap_plumbing.py functorch/codegen/gen_vmap_plumbing.py
```

Test Plan:
- build functorch

Differential Revision: [D38171956](https://our.internmc.facebook.com/intern/diff/D38171956)
Pull Request resolved: pytorch#82246
Approved by: https://github.com/kit1980
zengk95 and others added 22 commits August 2, 2022 17:44
### Description
<!-- What did you change and why was it needed? -->
We forgot that the < was for comments in markdown. Also added a link to the wiki to the start land checks message so users can see why their PR is taking extra time to land.

### Issue
<!-- Link to Issue ticket or RFP -->
n/a

### Testing
<!-- How did you test your change? -->
n/a
Pull Request resolved: pytorch#82649
Approved by: https://github.com/janeyx99, https://github.com/ZainRizvi
fixes pytorch#81457
fixes pytorch#81216
fixes pytorch#81212
fixes pytorch#81207
fixes pytorch#81206
fixes pytorch#81218
fixes pytorch#81203
fixes pytorch#81202
fixes pytorch#81214
fixes pytorch#81220
fixes pytorch#81205
fixes pytorch#81200
fixes pytorch#81204
fixes pytorch#81221
fixes pytorch#81209
fixes pytorch#81210
fixes pytorch#81215
fixes pytorch#81217
fixes pytorch#81222
fixes pytorch#81211
fixes pytorch#81201
fixes pytorch#81208

As part of this PR I'm also re-enabling all of the functionalization tests that got marked as flaky in CI (they're not actually flaky - I think they got marked because a PR that should have changed their expect-test output made it to master without the changes. I'll let CI run on this PR to confirm though).

reland of pytorch#80897
Pull Request resolved: pytorch#82407
Approved by: https://github.com/ezyang
Adds the dispatch boilerplate for MPS backend.
Pull Request resolved: pytorch#82612
Approved by: https://github.com/malfet
…e case (pytorch#82441)

- Refactor SchemaInfo to be able to handle cases where other variables besides running_mean and running_var mutate due to training = true
- Add special case rrelu_with_noise to fix pytorch#82434
- Tested by running SchemaInfo tests
Pull Request resolved: pytorch#82441
Approved by: https://github.com/davidberard98
This reverts commit 714669e.

Reverted pytorch#82626 on behalf of https://github.com/zengk95 due to This looks like its breaking trunk
…uts (pytorch#82176)"

This reverts commit 1dfcad8.

Reverted pytorch#82176 on behalf of https://github.com/zengk95 due to This looks like it's breaking functorch tests on master
update production ops (7/28). This is only for calculating mobile op test coverage.

Meta employee can update it using
```
python test/mobile/model_test/update_production_ops.py ~/fbsource/xplat/pytorch_models/build/all_mobile_model_configs.yaml
```

Pull Request resolved: pytorch#82444
Approved by: https://github.com/kit1980
Re-lands pytorch#81558 that got reverted due failing tests.

This failure happened because of the test that I poorly designed. [The loop here](https://github.com/pytorch/pytorch/pull/81558/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3837) is doing `cache_enabled=False` and then `cache_enabled=True`. By doing this loop the graph from previous iteration (case `False`) conflicts with the next one (case `True`). I redesigned the test such that it does not do any loops. The new test does separate function calls with different argument values.
Pull Request resolved: pytorch#81896
Approved by: https://github.com/ngimel
This moves first-class dimensions, as prototyped in https://github.com/facebookresearch/torchdim
into the functorch build. This makes them availiable for use in PrimTorch more easily.
Pull Request resolved: pytorch#82454
Approved by: https://github.com/ezyang, https://github.com/zou3519
Differential Revision: D38368525

Pull Request resolved: pytorch#82676
Approved by: https://github.com/ngimel
Currently, if we run softmax_backward/logsoftmax_backward which are not along the last dim, the calculation will fall to a [scalar version](https://github.com/pytorch/pytorch/blob/32593ef2dd26e32ed44d3c03d3f5de4a42eb149a/aten/src/ATen/native/SoftMax.cpp#L220-L287). And we find actually we have the chance to vectorize the calculation along the inner_size dim.

Changes we made:

Use vectorized softmax_backward_kernel/log_softmax_backward_kernel instead of host_softmax_backward when not along the last dim.

We collected the benchmark data of softmax_backward and logsoftmax_backward for BFloat16 and Float32 data type by using the operator_benchmark tool of PyTorch on the platform of Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz.
Number of cores: 24 cores(1 socket)
[softmax_benchmark_32593ef.log](https://github.com/pytorch/pytorch/files/8962956/softmax_benchmark_32593ef.log)
[softmax_benchmark_the_pr.log](https://github.com/pytorch/pytorch/files/8962958/softmax_benchmark_the_pr.log)

Pull Request resolved: pytorch#80114
Approved by: https://github.com/frank-wei
Summary: no functional changes, just testing to make sure this is
working

Test Plan: python test/test_ao_sparsity.py TestFxComposability

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: pytorch#82204
Approved by: https://github.com/supriyar
…h#81802)

Summary: Needed to refactor this PR to add tests for some new layers without copy pasting the entirety of the code. Its basically just a helper that does exactly what the other tests did since they were essentially copies of one another. Its possible to do similar with the quantized kernels test but its different enough that it seemed more effort than it was worth. Also bugfix: Originally line 150  I believe was wrong since model.weight is never used, though the only effect was that the specific weight wasn't used.

Test Plan: python test/test_ao_sparsity.py TestQuantizedSparseLayers

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: pytorch#81802
Approved by: https://github.com/supriyar
`torch.cuda.is_bf16_supported()` return False on ROCm which is not correct, since BF16 is supported on all AMD GPU arch - gfx906, gfx908 and gfx90a.

cc @jithunnair-amd
Pull Request resolved: pytorch#80410
Approved by: https://github.com/jeffdaily, https://github.com/malfet
…2688)

Need to use `ASSERT_FLOAT_EQ` for floats.

Right now the test often fails internally like this:

```
xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-operator-tester.h:362
Expected equality of these values:
  output_dynamic[i * outputChannels() + c]
    Which is: -601.09
  ((float)accumulators[i * outputChannels() + c] * requantization_scales[c]) + float(bias[c])
    Which is: -601.09
at 0, 18: reference = -601.0899658203125, optimized = -601.09002685546875
```

```
xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/test/fully-connected-operator-tester.h:362
Expected equality of these values:
  output_dynamic[i * outputChannels() + c]
    Which is: -65.6251
  ((float)accumulators[i * outputChannels() + c] * requantization_scales[c]) + float(bias[c])
    Which is: -65.6251
at 0, 7: reference = -65.625106811523438, optimized = -65.625099182128906
```
Pull Request resolved: pytorch#82688
Approved by: https://github.com/mehtanirav
@jjsjann123 jjsjann123 marked this pull request as ready for review August 3, 2022 22:30
@jjsjann123 jjsjann123 changed the base branch from master to devel August 3, 2022 22:30
Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@csarofeen csarofeen merged commit 7cfb779 into devel Aug 4, 2022
jjsjann123 added a commit that referenced this pull request Aug 29, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Code changes includes:

- codegen improvements:
  1. removes un-necessary sync from redundant thread compute analysis
  2. symmetric API for BestEffortReplay
  3. support merge on trivial reductions
  4. Ampere async copy improvements
- bug fixes:
  1. vectorization bug fixes
  2. type inference patch : fixes upstream pytorch#81725
  3. segmenter bug fix with deterministic iteration ordering
- parser update
  1. added leaky_relu
- scheduler
  1. normalization scheduler clean up.
  2. simplifies matmul scheduling with new transform propagator
  3. merge all dimensions in PW scheduler
  4. various gemm related improvements
- debuggability
  1. nsight compute support
  2. debug dump for InlinePropagator
  3. Add `UnaryOpType::Print`

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD
1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884)
7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803
3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD
01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878)
0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881)
7bc76aa Fix most inlined propagator for mismatched dims (#1875)
501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826)
d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827)
e0ae11a Larger sized mma instructions to support full vectorization (#1824)
9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823)
a48270a Merge all dims in pointwise scheduler (#1872)
172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868)
a64462a Allow trivial reduction to be merged (#1871)
440102b Symmetric API for BestEffortReplay (#1870)
d1caf33 Some misc cleanups/refactor split out from #1854 (#1867)
1013eda Remove some welford specific logic. (#1864)
51589d3 Some cleanups on tests and heuristics params (#1866)
a6b3e70 Segmenter bug fix, and deterministic iteration ordering.  (#1865)
1b665b9 Add nullptr checks to IrBuilder (#1861)
1cd9451 Simplify matmul scheduling with the new transform propagator.  (#1817)
bbc1fb9 Add leaky_relu operation (#1852)
e842a9b Minor cleanup in pointwise scheduler (#1858)
9ee850c Fix stringstream usage (#1857)
20a36c1 Improve nsight compute support (#1855)
4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822)
01117bf Misc cleanup (#1853)
5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846)
92e6f02 Cleanup normalization scheduler (#1845)
db89c65 Type inference patch (#1848)
102fe93 Add debug dump for InlinePropagator (#1847)
b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687)
942be5b Upstream ci build fixes (#1842)
0b83645 Fix vectorization bug introduced in #1831 (#1840)
63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825)
9135a96 Fix transpose benchmark dtype (#1839)
2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D38543000](https://our.internmc.facebook.com/intern/diff/D38543000)
Pull Request resolved: pytorch#83067
Approved by: https://github.com/davidberard98
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.