Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ifu 2023 03 31 #38

Merged
merged 35 commits into from
Apr 3, 2023
Merged

Ifu 2023 03 31 #38

merged 35 commits into from
Apr 3, 2023

Conversation

liligwu
Copy link
Collaborator

@liligwu liligwu commented Mar 31, 2023

update hipify_torch and remove the manual mapping of the C10 macros.
all tests pass (batched_unary_embeddings_test skipped)
test_log.txt

banitag1 and others added 30 commits March 10, 2023 13:04
Summary:
Pull Request resolved: pytorch#1638

This diff adds another mechanism for allocating the host mapped pinned memory to reduce adverse affect on other processes running on the same host when one process is doing some large allocations.

Reviewed By: zyan0, jianyuh

Differential Revision: D43950253

fbshipit-source-id: 41a434cb63354509d32e00c851c5f3a2d68be686
Summary:
This PR addresses the issue pytorch#1636

akin to https://github.com/pytorch/FBGEMM/blob/8616ed701015f8b9e4c2825ce592b204b4cfaf28/fbgemm_gpu/test/split_table_batched_embeddings_test.py#L1009

Pull Request resolved: pytorch#1635

Reviewed By: shintaro-iwasaki

Differential Revision: D44033725

Pulled By: q10

fbshipit-source-id: 49f28fc2f1c20948a42728eebf3defc5195baa5d
… when using freq based methods (pytorch#1352)

Summary:
Pull Request resolved: pytorch#1352

1. Update interface to accomadate rowwise_adagrad_with_counter.
2. Route backend for rowwise_adagrad to the new rowwise_adagrad_with_counter when freq based methods (e.g. freq sgd, counter adjusted regularization) are used.

Reviewed By: csmiler

Differential Revision: D36788395

fbshipit-source-id: 8eb5da8a5c8b52bc1e237af1054aac9f7245c443
…ard (pytorch#1642)

Summary:
Pull Request resolved: pytorch#1642

Remove sync point in jagged_dense_elementwise_add_jagged_output backward

Reviewed By: brad-mengchi

Differential Revision: D44039901

fbshipit-source-id: 8e7e23e4d9e01359e67e5b166adc57f894a1224d
…ytorch#1639)

Summary:
- Remove `.post0` suffix from the autogenerated package version
- Document the full FBGEMM_GPU OSS build process in a separate Markdown file
- Remove installation of packages not needed for ROCm builds
- Migrate CPU and ROCm jobs to run on top of Docker containers instead of bare metal instances
- Update GitHub workflow configuration to cancel previous jobs for a PR if a new commit is pushed to the PR

Pull Request resolved: pytorch#1639

Reviewed By: shintaro-iwasaki

Differential Revision: D44076312

Pulled By: q10

fbshipit-source-id: 6b2d083022feb7421b26da2d998678e00c11f283
Summary:
fix build with gcc-13

Pull Request resolved: pytorch#1640

Reviewed By: shintaro-iwasaki

Differential Revision: D44044422

Pulled By: q10

fbshipit-source-id: 692ec9c34f4aaf726294a2b643fbceabf8159033
Summary:
Pull Request resolved: pytorch#1611

If group size is larger than 54, internally breaks the group down into
smaller groups (each subgroup size is less than or equal to 54).

Reviewed By: jianyuh

Differential Revision: D43585937

fbshipit-source-id: bf14eeb79881a5737dcf7660e3e0f56d21f7b326
Summary:
Pull Request resolved: pytorch#1637

Enforce cache misses (even if trace-driven testing doesn't experience cache miss due to limited trace size) so that we can evaluate performance under cache misses.

Note that it's not exactly cache misses; enforce access to UVM by overriding lxu_cache_locations -- N / 256 requests.

Reviewed By: YuzeDaiMeta

Differential Revision: D42194019

fbshipit-source-id: ab04c1cc7a749e84d605cfe4f1687489ceab5725
Summary:
Pull Request resolved: pytorch#1602

Illegal memory access is a common problem during GPU kernel execution.
The FBGEMM GPU relies on PyTorch's `C10_CUDA_KERNEL_LAUNCH_CHECK()` and
the CUDA runtime to detect such problems and throw an error.  However,
there are a few known issues with this approach.

(1) `C10_CUDA_KERNEL_LAUNCH_CHECK()` detects errors on the host.
However, due to the non-blocking, asynchronous nature of GPU kernel
execution, the error is caught on the host at a later point than where
the problematic kernel was launched.  This can cause the stack trace
to be inaccurate and make debugging more difficult.  Although the
issue can be fixed by running the code with `CUDA_LAUNCH_BLOCKING=1`,
this can change the state of the execution and cause Heisenbugs.

(2) Not all illegal memory accesses are caught by the runtime.  This
means that the system may not always throw an error when illegal
memory access occurs.

(3) Although the runtime throws an error for illegal memory access, it
is difficult to pinpoint the specific kernel and memory buffer/address
that is causing the problem.

For all the aforementioned reasons, we attempt to catch and throw an
error as soon as possible in the kernel when illegal memory accesses
occur in FBGEMM GPU.  We introduce the `FBGEMM_GPU_MEMCHECK` flag
to enable memory checking during compile time.  We copy PyTorch's
`TensorAccessor.h` into the FBGEMM GPU and extend it to check every
memory access through the `PackedTensorAccessor`.  If an invalid memory
access occurs, we throw an error using `CUDA_KERNEL_ASSERT`.  The error
message includes the name of the tensor and the kernel that caused the
problem.

If `FBGEMM_GPU_MEMCHECK` is enabled, FBGEMM operators will use
`fbgemm::PackedTensorAccessor`.  Otherwise, they will use
`at::PackedTensorAccessor`

`FBGEMM_GPU_MEMCHECK` integration in FBGEMM ops will be done in
subsequent diffs

Reviewed By: r-barnes

Differential Revision: D43421838

fbshipit-source-id: c8ef04970d94bb097cb5f09b42f994db72845167
Summary:
Pull Request resolved: pytorch#1648

This hack is not needed in Xcode 14.3 anymore, where the clang version is 14.0.3. So change the workaround to only include up to 14.0.2.

Reviewed By: MatzeB

Differential Revision: D44130421

fbshipit-source-id: 1fb2948567941bdf6ee9487ccfaa9dfb2caf92dd
…ch#1646)

Summary:
- Parallelize the FBGEMM CI builds to build and test static and shared libraries independently instead of in serial
- Move the FBGEMM CI builds to run inside Docker containers
- Add support for building FBGEMM_GPU against Python 3.11 in OSS
- Move all FBGEMM_GPU nightly and release build jobs to run inside `amazonlinux:2023` Docker container
- Assuming no build errors or resource starvation, the full OSS build process now runs under 30 minutes.

Pull Request resolved: pytorch#1646

Reviewed By: shintaro-iwasaki

Differential Revision: D44157228

Pulled By: q10

fbshipit-source-id: 6403ea9955856157785c50837b0b8e4c0cd26d53
Summary:
Pull Request resolved: pytorch#1629

Replaces magic numbers with constexpr variables

Reviewed By: sryap

Differential Revision: D43776442

fbshipit-source-id: 5cef7566816f8730f5daa08948ee3260367787aa
Summary:
Pull Request resolved: pytorch#1645

as in title

Reviewed By: jianyuh

Differential Revision: D44096435

fbshipit-source-id: a7a87a14ffecc2fb6e0be74d199d385357946672
Summary:
Pull Request resolved: pytorch#1643

This diff optimizes the jagged_dense_bmm operator with the following optimizations:
* tiling across thread blocks, and use GPU shared memory for thread block
* tiling across threads within a thread block, and use registers for each thread

Reviewed By: brad-mengchi

Differential Revision: D43674845

fbshipit-source-id: 85f0abf89fa958f79636ef59c3070a1c569b73c2
Summary:
This patch fixes test failures on AMD GPUs.

1. Remove `__restrict__ `. I don't think it is needed even for CUDA, but it confuses HIPCC.
2. Use `uint32_t` instead of `auto`: old ROCm (including ROCm <= 5.3) does not have `+=` operator for the type of `blockIdx.z`, causing a compilation error. We observed that this issue is fixed in ROCm 5.4.3, but let's use `uint32_t` for now. We should revisit and use `auto` later. See this for details: ROCm/hipamd@86a1634

Pull Request resolved: pytorch#1655

Test Plan: GitHub Actions' AMD CI

Reviewed By: q10, brad-mengchi

Differential Revision: D44242622

Pulled By: shintaro-iwasaki

fbshipit-source-id: c9b88155ebf1ed881b2d03e3be0e8991b4b30174
Summary:
Pull Request resolved: pytorch#1656

wushirong reported the failure on https://fburl.com/code/hae91ra7 .

- The embedding config is from  f418615450 .
- `max_int8_128b_rows` is 10 --> D = 1280

Our embedding dim has grown to 1024 + ?

Note that the static shared memory can only go up to 48 KB:

> Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays)

in https://docs.nvidia.com/cuda/cuda-c-programming-guide/

for ptx shared mem error:
```
[2023-03-21T22:04:33.899-07:00] ptxas error   : Entry function '_ZN4nbit60INT8_split_embedding_codegen_forward_weighted_kernel_small_LIiN3c104HalfELm2ELm4ELm4E
Lm8ELm16ELb1EEEvN2at27GenericPackedTensorAccessorIhLm1ENS3_17RestrictPtrTraitsElEES6_NS4_IiLm1ES5_iEENS4_IlLm1ES5_iEENS4_IhLm1ES5_iEES7_N10fbgemm_gpu12FixedDiv
isorENS4_IT_Lm1ES5_iEESD_llNS4_IfLm1ES5_iEENS4_IT0_Lm2ES5_iEENS4_IhLm2ES5_lEES7_' uses too much shared data (0x10080 bytes, 0xc000 max)
```

Currently we reduce `InputRowsInFlight` to bypass the issue (the static shared memory used in the kernel is
```
  typedef uint4 AllBuffers[WarpsPerBlock][OutputRowsPerThread][InputRowsInFlight][NumUint4LoadsPerRow];
  __shared__ AllBuffers buffers;
```

Long term, we can change the static shared memory to dynamic shared memory, and increase the shared memory size to be 64 KB +.

Reviewed By: wushirong

Differential Revision: D44270081

fbshipit-source-id: 367ae838ea073dfe58d859ea3c0e6c7190beca6a
Summary:
- Containerize the remaining FBGEMM_GPU CI jobs
- Add Conda cleanups to make PyTorch and CUDA installs more reliable
- Update post-install checks for PyTorch to work with ROCm
- Update the CI to continue running on jobs that fail on just a few variants
- Use PIP to install PyTorch GPU nightly as the nightly packages show up in PIP more reliably than in Conda

Pull Request resolved: pytorch#1658

Reviewed By: shintaro-iwasaki

Differential Revision: D44306708

Pulled By: q10

fbshipit-source-id: 5f0862f18eca7151759d9983aa97849222539d7d
Summary:
Pull Request resolved: pytorch#1647

Implement `tbe_input_combine_with_length` for GPU.  The operator takes
3 lists of tensors (`indices`, `lengths`, and `per_sample_weights`)
and concatenates each one into a single tensor.  Implicit type casting
is also performed if the input types are different from the output
types.  `indices` and `lengths` tensors can be of type `int32_t` or
`int64_t`.  The outputs for `indices` concatenation and `lengths`
concatenation are fixed to `int32_t`.  `per_sample_weights` must be
`float`.

Reviewed By: bangshengtang

Differential Revision: D44076452

fbshipit-source-id: f6ce8628e7345093bb55835f9523870c2914516f
Summary:
Pull Request resolved: pytorch#1644

This diff optimizes the jagged_jagged_bmm operator using tiling across thread blocks and GPU shared memory.

Reviewed By: brad-mengchi

Differential Revision: D44029528

fbshipit-source-id: fa5cd5a26893f935427bce5efb7dfcc731c3f47d
Summary:
Pull Request resolved: pytorch#1660

When enabled emulate cache miss, it caused illegal memory access, if we're using more than one GPU. It turns out that previous diff didn't specify device within emulate_cache_miss kernel.

This diff fixes it. In addition, cleaned up a bit (e.g., no need to used index_t based kernel launch for emulate_cache_miss kernel, as lxu_cache_locations is always with int32_t.

Reviewed By: sryap, YuzeDaiMeta

Differential Revision: D44340131

fbshipit-source-id: d99ba2364e9030cbca6c1166e578d24d99646bb1
Summary:
- Add C++17 support for the entire FBGEMM_GPU build
- Add C++17 support for the entire FBGEMM build
- Update FBGEMM tests and benchmarks to be C++17-compatible
- Make FBGEMM builds output more logging
- Cherry-pick code changes from D43776442 v4 now that C++17 is fully supported

Pull Request resolved: pytorch#1652

Reviewed By: shintaro-iwasaki

Differential Revision: D44287321

Pulled By: q10

fbshipit-source-id: 4bf2bcf66d528939865d42b6deafc470bee55d17
Summary:
Pull Request resolved: pytorch#1659

This diff aims to reduce the build time and libary size of
`//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops`.

The diff modifies the build target to generate and compile only the
necessary files. This is based on the fact that CPU and GPU do not
support all optimizers in `SplitTBE`.  (Before this diff, all optimizers
were generated and compiled for both CPU and GPU.)

The following is the list of supported optimizers

|OptimType|Generated optimizer|Supported on CPU|Supported on GPU|
|EXACT_ADAGRAD|adagrad|x|x|
|EXACT_ROWWISE_ADAGRAD|rowwise_adagrad_with_counter|x|x|
||rowwise_adagrad|x|x|
|EXACT_ROWWISE_WEIGHTED_ADAGRAD|rowwise_weighted_adagrad|x|x|
|EXACT_SGD|sgd|x|x|
|SGD|approx_sgd|x|x|
|ROWWISE_ADAGRAD|approx_rowwise_adagrad_with_counter|x||
||approx_rowwise_adagrad|x||
|ADAM|adam||x|
|LAMB|lamb||x|
|LARS_SGD|lars_sgd||x|
|PARTIAL_ROWWISE_ADAM|partial_rowwise_adam||x|
|PARTIAL_ROWWISE_LAMB|partial_rowwise_lamb||x|
|-|rowwise_adagrad_with_weight_decay|||
|-|approx_rowwise_adagrad_with_weight_decay|||
Note: x = supported

Reviewed By: jianyuh

Differential Revision: D44326540

fbshipit-source-id: 02413256b4a675f13ada8e8820820cb5112cb405
Summary:
- Rewrite the documentation builds job to use the build infrastructure tooling
- Rename workflow files for consistency

Pull Request resolved: pytorch#1673

Reviewed By: shintaro-iwasaki

Differential Revision: D44472660

Pulled By: q10

fbshipit-source-id: 60434c1f7098b7efa8c750133bb22f14fc98d5dc
Summary:
Pull Request resolved: pytorch#1675

Original commit changeset: 02413256b4a6

Original Phabricator Diff: D44326540

Reviewed By: q10, jianyuh

Differential Revision: D44475251

fbshipit-source-id: 5be66944a833e03a2737fc6d1baaa5c351455b2c
Summary:
Pull Request resolved: pytorch#1633

Prepare `bounds_check_indices` for variable batch size TBE (VBE).

- Update the frontend API to accept VBE args
- Update the backend logic to process VBE data

Reviewed By: jianyuh

Differential Revision: D43253703

fbshipit-source-id: 2870f0c41a96265650281a9b6362d4e6dc48009b
pytorch#1667)

Summary:
Pull Request resolved: pytorch#1667

As title. This diff moves pruning/index_remapping support to embedding inplace update files.

Reviewed By: jianyuh

Differential Revision: D44409419

fbshipit-source-id: 93fc91d83502eb95cb0feca2a8a03b003c336078
Summary:
Pull Request resolved: pytorch#1661

This diff optimizes jagged_softmax forward with more efficient reduction from cub library.

Reviewed By: brad-mengchi

Differential Revision: D44161021

fbshipit-source-id: bf2e059d14ef4d7ad311edac65155a463ba653ff
Summary:
Pull Request resolved: pytorch#1662

This diff optimizes jagged_softmax backward with more efficient reduction from cub library

Reviewed By: brad-mengchi

Differential Revision: D44205819

fbshipit-source-id: cd1d7a886d6ba68201dc1ad782c2e8cde7ff706b
Summary:
Pull Request resolved: pytorch#1674

improved multi-gpu all_to_one with:
	1. new intermediate hop selection taking advantage of distinct NVLinks
	2. overlapping of intermediate hop transfers with each-other and with direct-peer transfers

Reviewed By: doehyun

Differential Revision: D44285941

fbshipit-source-id: 0202083f04388b5ba60b8155809433f334993ef4
pytorch#1669)

Summary:
Pull Request resolved: pytorch#1669

Extract portions initializing the weights_placements/offsets tensors into separate functions and jit.export them.
SplitState is converted to a NamedTuple since we can't jit.script a dataclass that also holds an enum.

Reviewed By: houseroad

Differential Revision: D44338256

fbshipit-source-id: e1c12e5956f7217d51cd190958c3764d220e521d
q10 and others added 5 commits March 30, 2023 22:55
Summary:
- Clean up the ROCm test job and re-enable ROCm testing on the rocm instances.
- Update the build scripts framework to build FBGEMM_GPU against the correct hardware target that it is intended to be tested on.  One thing that was discovered was that if FBGEMM_GPU was built with `PYTORCH_ROCM_ARCH=gfx90a` but run on `gfx908` target, the tests will fail with a segfault.  While the failure is expected, the segfault can be unfriendly and confusing for users.
- Enable correct compilation of `merge_pooled_embeddings` operator under ROCm
- Fix existing code in `jagged_tensor_ops` from PR pytorch#1661 and pytorch#1662 that break its compilation under ROCm 5.3

Pull Request resolved: pytorch#1668

Reviewed By: shintaro-iwasaki

Differential Revision: D44453594

Pulled By: q10

fbshipit-source-id: 2030cd0e00c6ff9694c2783dfd62c31cf5543da2
…ts loading (pytorch#1676)

Summary:
Pull Request resolved: pytorch#1676

Export a function to reset the embedding specs by target location

Reviewed By: RoshanPAN, houseroad

Differential Revision: D44338258

fbshipit-source-id: 502733e9f3a164450a02656d2822492fbf69f994
…rch#1670)

Summary:
Pull Request resolved: pytorch#1670

ATT

Reviewed By: RoshanPAN, houseroad

Differential Revision: D44338257

fbshipit-source-id: c091666c7a4d294c283f5e3774d0494089fc3478
@liligwu liligwu requested a review from amathews-amd March 31, 2023 18:19
@liligwu liligwu self-assigned this Mar 31, 2023
@liligwu liligwu merged commit 642da31 into main Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.