Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌐 [i18n-KO] Translated debugging.md to Korean #26246

Merged
merged 3 commits into from
Sep 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/ko/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -141,8 +141,8 @@
title: ν›ˆλ ¨μš© μ‚¬μš©μž λ§žμΆ€ν˜• ν•˜λ“œμ›¨μ–΄
- local: in_translation
title: (λ²ˆμ—­μ€‘) Instantiating a big model
- local: in_translation
title: (λ²ˆμ—­μ€‘) Debugging
- local: debugging
title: 디버깅
- local: hpo_train
title: Trainer APIλ₯Ό μ‚¬μš©ν•œ ν•˜μ΄νΌνŒŒλΌλ―Έν„° 탐색
- local: tf_xla
Expand Down
306 changes: 306 additions & 0 deletions docs/source/ko/debugging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,306 @@
<!--Copyright 2021 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# 디버깅 [[debugging]]

## Multi-GPU λ„€νŠΈμ›Œν¬ 문제 디버그 [[multigpu-network-issues-debug]]

`DistributedDataParallel` 및 닀쀑 GPUλ₯Ό μ‚¬μš©ν•˜μ—¬ ν›ˆλ ¨ν•˜κ±°λ‚˜ μΆ”λ‘ ν•  λ•Œ, ν”„λ‘œμ„ΈμŠ€ 및/λ˜λŠ” λ…Έλ“œ κ°„μ˜ μƒν˜Έ 톡신 λ¬Έμ œκ°€ λ°œμƒν•˜λŠ” 경우, λ‹€μŒ 슀크립트λ₯Ό μ‚¬μš©ν•˜μ—¬ λ„€νŠΈμ›Œν¬ 문제λ₯Ό 진단할 수 μžˆμŠ΅λ‹ˆλ‹€.

```bash
wget https://raw.githubusercontent.com/huggingface/transformers/main/scripts/distributed/torch-distributed-gpu-test.py
```

예λ₯Ό λ“€μ–΄, 2개의 GPUκ°€ μƒν˜Έ μž‘μš©ν•˜λŠ” 방식을 ν…ŒμŠ€νŠΈν•˜λ €λ©΄ λ‹€μŒμ„ μ‹€ν–‰ν•˜μ„Έμš”:

```bash
python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
```
두 ν”„λ‘œμ„ΈμŠ€κ°€ μ„œλ‘œ ν†΅μ‹ ν•˜κ³  GPU λ©”λͺ¨λ¦¬λ₯Ό ν• λ‹Ήν•˜λŠ” 경우, 각각 "OK" μƒνƒœλ₯Ό 좜λ ₯ν•©λ‹ˆλ‹€.

더 λ§Žμ€ GPU λ˜λŠ” λ…Έλ“œμ˜ 경우 슀크립트의 인수λ₯Ό μ‘°μ •ν•˜λ©΄ λ©λ‹ˆλ‹€.

진단 슀크립트 λ‚΄μ—μ„œ 더 λ§Žμ€ μ„ΈλΆ€ 정보와 SLURM ν™˜κ²½μ—μ„œ μ‹€ν–‰ν•˜λŠ” 방법에 λŒ€ν•œ λ ˆμ‹œν”Όλ₯Ό 찾을 수 μžˆμŠ΅λ‹ˆλ‹€.

좔가적인 디버그 μˆ˜μ€€μ€ λ‹€μŒκ³Ό 같이 `NCCL_DEBUG=INFO` ν™˜κ²½ λ³€μˆ˜λ₯Ό μΆ”κ°€ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€:

```bash
NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
```

μ΄λ ‡κ²Œ ν•˜λ©΄ NCCL κ΄€λ ¨ 디버그 정보가 많이 좜λ ₯되며, λ¬Έμ œκ°€ 보고된 κ²½μš°μ—λŠ” μΈν„°λ„·μ—μ„œ 검색할 수 μžˆμŠ΅λ‹ˆλ‹€. λ˜λŠ” 좜λ ₯을 ν•΄μ„ν•˜λŠ” 방법을 잘 λͺ¨λ₯΄λŠ” 경우 둜그 νŒŒμΌμ„ μ΄μŠˆμ— κ³΅μœ ν•  수 μžˆμŠ΅λ‹ˆλ‹€.



## μ–Έλ”ν”Œλ‘œ 및 μ˜€λ²„ν”Œλ‘œ 감지 [[underflow-and-overflow-detection]]


<Tip>

이 κΈ°λŠ₯은 ν˜„μž¬ PyTorchμ—μ„œλ§Œ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

</Tip>

<Tip>

닀쀑 GPU ν›ˆλ ¨μ„ μœ„ν•΄μ„œλŠ” DDP (`torch.distributed.launch`)κ°€ ν•„μš”ν•©λ‹ˆλ‹€.

</Tip>

<Tip>

이 κΈ°λŠ₯은 `nn.Module`을 기반으둜 ν•˜λŠ” λͺ¨λΈκ³Ό ν•¨κ»˜ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

</Tip>

`loss=NaN`이 λ‚˜νƒ€λ‚˜κ±°λ‚˜ λͺ¨λΈμ΄ `inf` λ˜λŠ” `nan`으둜 인해 λ‹€λ₯Έ μ΄μƒν•œ λ™μž‘μ„ ν•˜λŠ” 경우, μ–Έλ”ν”Œλ‘œ λ˜λŠ” μ˜€λ²„ν”Œλ‘œμ˜ 첫 번째 λ°œμƒ μœ„μΉ˜μ™€ κ·Έ 원인을 νŒŒμ•…ν•΄μ•Ό ν•©λ‹ˆλ‹€. λ‹€ν–‰νžˆλ„ 이λ₯Ό μžλ™μœΌλ‘œ κ°μ§€ν•˜λŠ” 특수 λͺ¨λ“ˆμ„ ν™œμ„±ν™”ν•˜μ—¬ μ‰½κ²Œ μ•Œμ•„λ‚Ό 수 μžˆμŠ΅λ‹ˆλ‹€.

[`Trainer`]λ₯Ό μ‚¬μš©ν•˜λŠ” 경우, λ‹€μŒμ„ 기쑴의 λͺ…령쀄 μΈμˆ˜μ— μΆ”κ°€ν•˜λ©΄ λ©λ‹ˆλ‹€.

```bash
--debug underflow_overflow
```
λ˜λŠ” [`TrainingArguments`] 객체λ₯Ό 생성할 λ•Œ `debug="underflow_overflow"`λ₯Ό μ „λ‹¬ν•©λ‹ˆλ‹€.

자체 ν›ˆλ ¨ λ£¨ν”„λ‚˜ λ‹€λ₯Έ Trainerλ₯Ό μ‚¬μš©ν•˜λŠ” 경우, λ‹€μŒκ³Ό 같이 μˆ˜ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

```python
from transformers.debug_utils import DebugUnderflowOverflow

debug_overflow = DebugUnderflowOverflow(model)
```

[`~debug_utils.DebugUnderflowOverflow`]λŠ” λͺ¨λΈμ— 후크λ₯Ό μ‚½μž…ν•˜μ—¬ 각 forward 호좜 직후에 μž…λ ₯ 및 좜λ ₯ λ³€μˆ˜ 및 ν•΄λ‹Ή λͺ¨λ“ˆμ˜ κ°€μ€‘μΉ˜λ₯Ό ν…ŒμŠ€νŠΈν•©λ‹ˆλ‹€. ν™œμ„±ν™”λ‚˜ κ°€μ€‘μΉ˜μ˜ μ΅œμ†Œν•œ ν•˜λ‚˜μ˜ μš”μ†Œμ—μ„œ `inf` λ˜λŠ” `nan`이 κ°μ§€λ˜λ©΄ ν”„λ‘œκ·Έλž¨μ΄ μ–΄μ„€νŠΈλ˜κ³  λ‹€μŒκ³Ό 같은 λ³΄κ³ μ„œκ°€ 좜λ ₯λ©λ‹ˆλ‹€. (이 μ˜ˆμ œλŠ” fp16 ν˜Όν•© μ •λ°€λ„μ—μ„œ `google/mt5-small`μ—μ„œ 캑처된 κ²ƒμž…λ‹ˆλ‹€):

```
Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min abs max metadata
encoder.block.1.layer.1.DenseReluDense.dropout Dropout
0.00e+00 2.57e+02 input[0]
0.00e+00 2.85e+02 output
[...]
encoder.block.2.layer.0 T5LayerSelfAttention
6.78e-04 3.15e+03 input[0]
2.65e-04 3.42e+03 output[0]
None output[1]
2.25e-01 1.00e+04 output[2]
encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output
encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
encoder.block.2.layer.1.DenseReluDense.dropout Dropout
0.00e+00 8.76e+03 input[0]
0.00e+00 9.74e+03 output
encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00 inf output
```

예제 좜λ ₯은 κ°„λž΅μ„±μ„ μœ„ν•΄ 쀑간 뢀뢄이 잘렀 μžˆμŠ΅λ‹ˆλ‹€.

두 번째 열은 μ ˆλŒ€μ μœΌλ‘œ κ°€μž₯ 큰 μš”μ†Œμ˜ 값이며, λ”°λΌμ„œ λ§ˆμ§€λ§‰ λͺ‡ 개의 ν”„λ ˆμž„μ„ μžμ„Ένžˆ μ‚΄νŽ΄λ³΄λ©΄ μž…λ ₯κ³Ό 좜λ ₯이 `1e4` λ²”μœ„μ— μžˆμŒμ„ μ•Œ 수 μžˆμŠ΅λ‹ˆλ‹€. λ”°λΌμ„œ 이 ν›ˆλ ¨μ€ `fp16` ν˜Όν•© μ •λ°€λ„λ‘œ μˆ˜ν–‰λ  λ•Œ κ°€μž₯ λ§ˆμ§€λ§‰ λ‹¨κ³„μ—μ„œ μ˜€λ²„ν”Œλ‘œμš°κ°€ λ°œμƒν–ˆμŠ΅λ‹ˆλ‹€ (`fp16`μ—μ„œ `inf` μ΄μ „μ˜ κ°€μž₯ 큰 μˆ«μžλŠ” `64e3`μž…λ‹ˆλ‹€). `fp16` μ•„λž˜μ—μ„œ μ˜€λ²„ν”Œλ‘œμš°λ₯Ό ν”Όν•˜κΈ° μœ„ν•΄μ„œλŠ” ν™œμ„±ν™”λŠ” `1e4`보닀 훨씬 μž‘μ•„μ•Ό ν•©λ‹ˆλ‹€. μ™œλƒν•˜λ©΄ `1e4 * 1e4 = 1e8`이기 λ•Œλ¬Έμ— 큰 ν™œμ„±ν™”μ™€μ˜ ν–‰λ ¬ 곱은 수치적인 μ˜€λ²„ν”Œλ‘œμš° 쑰건으둜 μ΄μ–΄μ§ˆ κ²ƒμž…λ‹ˆλ‹€.

μΆ”μ μ˜ 맨 μ²˜μŒμ—μ„œ μ–΄λŠ 배치 λ²ˆν˜Έμ—μ„œ λ¬Έμ œκ°€ λ°œμƒν–ˆλŠ”μ§€ μ•Œ 수 μžˆμŠ΅λ‹ˆλ‹€ (μ—¬κΈ°μ„œ `Detected inf/nan during batch_number=0`은 λ¬Έμ œκ°€ 첫 번째 λ°°μΉ˜μ—μ„œ λ°œμƒν–ˆμŒμ„ μ˜λ―Έν•©λ‹ˆλ‹€).

각 보고된 ν”„λ ˆμž„μ€ ν•΄λ‹Ή ν”„λ ˆμž„μ΄ λ³΄κ³ ν•˜λŠ” ν•΄λ‹Ή λͺ¨λ“ˆμ— λŒ€ν•œ μ™„μ „ν•œ ν•­λͺ©μ„ μ„ μ–Έν•˜λ©°, 이 ν”„λ ˆμž„λ§Œ μ‚΄νŽ΄λ³΄λ©΄ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

```
encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output
```

μ—¬κΈ°μ„œ `encoder.block.2.layer.1.layer_norm`은 μΈμ½”λ”μ˜ 두 번째 λΈ”λ‘μ˜ 첫 번째 λ ˆμ΄μ–΄μ— λŒ€ν•œ λ ˆμ΄μ–΄ μ •κ·œν™”λ₯Ό μ˜λ―Έν•˜λ©°, `forward`의 νŠΉμ • ν˜ΈμΆœμ€ `T5LayerNorm`μž…λ‹ˆλ‹€.

이 λ³΄κ³ μ„œμ˜ λ§ˆμ§€λ§‰ λͺ‡ 개 ν”„λ ˆμž„μ„ μ‚΄νŽ΄λ³΄κ² μŠ΅λ‹ˆλ‹€:

```
Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min abs max metadata
[...]
encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00 inf output
```

λ§ˆμ§€λ§‰ ν”„λ ˆμž„μ€ `Dropout.forward` ν•¨μˆ˜μ— λŒ€ν•œ λ³΄κ³ μž…λ‹ˆλ‹€. 첫 번째 ν•­λͺ©μ€ μœ μΌν•œ μž…λ ₯을 λ‚˜νƒ€λ‚΄κ³  두 번째 ν•­λͺ©μ€ μœ μΌν•œ 좜λ ₯을 λ‚˜νƒ€λƒ…λ‹ˆλ‹€. 이 ν•¨μˆ˜κ°€ `DenseReluDense` 클래슀 λ‚΄λΆ€μ˜ `dropout` μ†μ„±μ—μ„œ 호좜된 것을 λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” 첫 번째 λ ˆμ΄μ–΄μ˜ 두 번째 λΈ”λ‘μ—μ„œ 첫 번째 배치 쀑에 λ°œμƒν–ˆλ‹€λŠ” 것을 μ•Œ 수 μžˆμŠ΅λ‹ˆλ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, μ ˆλŒ€μ μœΌλ‘œ κ°€μž₯ 큰 μž…λ ₯ μš”μ†ŒλŠ” `6.27e+04`이고 좜λ ₯도 λ§ˆμ°¬κ°€μ§€λ‘œ `inf`μž…λ‹ˆλ‹€.

μ—¬κΈ°μ—μ„œλŠ” `T5DenseGatedGeluDense.forward`κ°€ 좜λ ₯ ν™œμ„±ν™”λ₯Ό μƒμ„±ν•˜λŠ”λ°, μ ˆλŒ€μ μœΌλ‘œ κ°€μž₯ 큰 값이 μ•½ 62.7K인 것을 λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€. 이 값은 fp16의 μ΅œλŒ€ μ œν•œμΈ 64K에 맀우 κ·Όμ ‘ν•©λ‹ˆλ‹€. λ‹€μŒ ν”„λ ˆμž„μ—μ„œλŠ” 일뢀 μš”μ†Œλ₯Ό 0으둜 λ§Œλ“  ν›„ κ°€μ€‘μΉ˜λ₯Ό μž¬μ •κ·œν™”ν•˜λŠ” `Dropout`이 μžˆμŠ΅λ‹ˆλ‹€. 이둜 인해 μ ˆλŒ€ μ΅œλŒ€κ°’μ΄ 64Kλ₯Ό μ΄ˆκ³Όν•˜κ³  μ˜€λ²„ν”Œλ‘œμš°(`inf`)κ°€ λ°œμƒν•©λ‹ˆλ‹€.

λ³΄μ‹œλ‹€μ‹œν”Ό, fp16 숫자의 경우 μˆ«μžκ°€ 맀우 컀질 λ•Œ 이전 ν”„λ ˆμž„μ„ μ‚΄νŽ΄λ³΄μ•„μ•Ό ν•©λ‹ˆλ‹€.

λ³΄κ³ μ„œλ₯Ό `models/t5/modeling_t5.py`의 μ½”λ“œμ™€ μΌμΉ˜μ‹œμΌœ λ³΄κ² μŠ΅λ‹ˆλ‹€.

```python
class T5DenseGatedGeluDense(nn.Module):
def __init__(self, config):
super().__init__()
self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
self.dropout = nn.Dropout(config.dropout_rate)
self.gelu_act = ACT2FN["gelu_new"]

def forward(self, hidden_states):
hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
hidden_linear = self.wi_1(hidden_states)
hidden_states = hidden_gelu * hidden_linear
hidden_states = self.dropout(hidden_states)
hidden_states = self.wo(hidden_states)
return hidden_states
```

이제 `dropout` 호좜과 μ΄μ „μ˜ λͺ¨λ“  ν˜ΈμΆœμ„ μ‰½κ²Œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.

κ°μ§€λŠ” `forward` ν›„ν¬μ—μ„œ λ°œμƒν•˜λ―€λ‘œ, μ΄λŸ¬ν•œ λ³΄κ³ μ„œλŠ” 각 `forward`κ°€ λ°˜ν™˜λœ 직후에 μ¦‰μ‹œ 좜λ ₯λ©λ‹ˆλ‹€.

전체 λ³΄κ³ μ„œλ‘œ λŒμ•„κ°€μ„œ λ¬Έμ œμ— λŒ€ν•œ 쑰치 및 μˆ˜μ •μ„ ν•˜λ €λ©΄, μˆ«μžκ°€ μ¦κ°€ν•˜κΈ° μ‹œμž‘ν•œ λͺ‡ 개의 ν”„λ ˆμž„ μœ„λ‘œ μ΄λ™ν•΄μ„œ μ—¬κΈ°μ„œ `fp32` λͺ¨λ“œλ‘œ μ „ν™˜ν•΄μ•Ό ν•©λ‹ˆλ‹€. μ΄λ ‡κ²Œ ν•΄μ•Ό μˆ«μžκ°€ κ³±ν•΄μ§€κ±°λ‚˜ ν•©μ³μ§ˆ λ•Œ μ˜€λ²„ν”Œλ‘œμš°λ˜μ§€ μ•Šμ„ κ°€λŠ₯성이 λ†’μŠ΅λ‹ˆλ‹€. λ¬Όλ‘  λ‹€λ₯Έ 해결책도 μžˆμ„ 수 μžˆμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄, `amp`κ°€ ν™œμ„±ν™”λœ 경우 μΌμ‹œμ μœΌλ‘œ 끄고 μ›λž˜μ˜ `forward`λ₯Ό λ„μš°λ―Έ 래퍼둜 μ΄λ™ν•œ ν›„ λ‹€μŒκ³Ό 같이 ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

```python
def _forward(self, hidden_states):
hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
hidden_linear = self.wi_1(hidden_states)
hidden_states = hidden_gelu * hidden_linear
hidden_states = self.dropout(hidden_states)
hidden_states = self.wo(hidden_states)
return hidden_states


import torch


def forward(self, hidden_states):
if torch.is_autocast_enabled():
with torch.cuda.amp.autocast(enabled=False):
return self._forward(hidden_states)
else:
return self._forward(hidden_states)
```

μžλ™ κ°μ§€κΈ°λŠ” 전체 ν”„λ ˆμž„μ˜ μž…λ ₯κ³Ό 좜λ ₯에 λŒ€ν•΄μ„œλ§Œ λ³΄κ³ ν•˜λ―€λ‘œ, μ–΄λ””λ₯Ό μ‚΄νŽ΄λ΄μ•Ό ν•˜λŠ”μ§€ μ•Œλ©΄ νŠΉμ • `forward` ν•¨μˆ˜μ˜ 쀑간 단계도 뢄석할 수 μžˆμŠ΅λ‹ˆλ‹€. 이 κ²½μš°μ—λŠ” `detect_overflow` λ„μš°λ―Έ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ μ›ν•˜λŠ” μœ„μΉ˜μ— 감지기λ₯Ό μ‚½μž…ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄:

```python
from debug_utils import detect_overflow


class T5LayerFF(nn.Module):
[...]

def forward(self, hidden_states):
forwarded_states = self.layer_norm(hidden_states)
detect_overflow(forwarded_states, "after layer_norm")
forwarded_states = self.DenseReluDense(forwarded_states)
detect_overflow(forwarded_states, "after DenseReluDense")
return hidden_states + self.dropout(forwarded_states)
```

μ—¬κΈ°μ„œλŠ” 이λ₯Ό μΆ”κ°€ν•˜μ—¬ 2개의 것을 μΆ”μ ν•˜κ³  이제 `forwarded_states`의 `inf` λ˜λŠ” `nan`이 쀑간에 κ°μ§€λ˜μ—ˆλŠ”μ§€λ₯Ό μΆ”μ ν•©λ‹ˆλ‹€.

μ‹€μ œλ‘œ μœ„μ˜ μ˜ˆμ œμ—μ„œ 각 호좜이 `nn.Module`이기 λ•Œλ¬Έμ— 탐지기가 이미 이λ₯Ό λ³΄κ³ ν•©λ‹ˆλ‹€. λ‘œμ»¬μ—μ„œ 직접 κ³„μ‚°ν•˜λŠ” 경우 μ΄λ ‡κ²Œ μˆ˜ν–‰ν•œλ‹€κ³  κ°€μ •ν•΄ λ΄…μ‹œλ‹€.

λ˜ν•œ, 자체 μ½”λ“œμ—μ„œ 디버거λ₯Ό μΈμŠ€ν„΄μŠ€ν™”ν•˜λŠ” 경우 κΈ°λ³Έκ°’μ—μ„œ 좜λ ₯λ˜λŠ” ν”„λ ˆμž„ 수λ₯Ό μ‘°μ •ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄:

```python
from transformers.debug_utils import DebugUnderflowOverflow

debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
```

### νŠΉμ • 배치의 μ ˆλŒ“κ°’ μ΅œμ†Œ 및 μ΅œλŒ€ κ°’ 좔적 [[specific-batch-absolute-min-and-max-value-tracing]]

λ™μΌν•œ 디버깅 ν΄λž˜μŠ€λŠ” μ–Έλ”ν”Œλ‘œμš°/μ˜€λ²„ν”Œλ‘œμš° 감지 κΈ°λŠ₯이 꺼진 μƒνƒœμ—μ„œ λ°°μΉ˜λ³„ 좔적에도 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

예λ₯Ό λ“€μ–΄, νŠΉμ • 배치의 각 `forward` 호좜의 λͺ¨λ“  ꡬ성 성뢄에 λŒ€ν•œ μ ˆλŒ€ μ΅œμ†Ÿκ°’κ³Ό μ΅œλŒ“κ°’μ„ ν™•μΈν•˜κ³ , 이λ₯Ό 배치 1κ³Ό 3에 λŒ€ν•΄μ„œλ§Œ μˆ˜ν–‰ν•˜λ €λ©΄ λ‹€μŒκ³Ό 같이 이 클래슀λ₯Ό μΈμŠ€ν„΄μŠ€ν™”ν•©λ‹ˆλ‹€:

```python
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3])
```

그러면 이제 배치 1κ³Ό 3 전체가 μ–Έλ”ν”Œλ‘œμš°/μ˜€λ²„ν”Œλ‘œμš° 감지기와 λ™μΌν•œ ν˜•μ‹μœΌλ‘œ μΆ”μ λ©λ‹ˆλ‹€.

λ°°μΉ˜λŠ” 0λΆ€ν„° μ‹œμž‘ν•©λ‹ˆλ‹€.

μ΄λŠ” ν”„λ‘œκ·Έλž¨μ΄ νŠΉμ • 배치 번호 이후에 μ˜€μž‘λ™ν•˜κΈ° μ‹œμž‘ν•˜λŠ” 것을 μ•Œκ³  μžˆλŠ” κ²½μš°μ— μœ μš©ν•©λ‹ˆλ‹€. κ·Έλ ‡κΈ° λ•Œλ¬Έμ— ν•΄λ‹Ή μ˜μ—­μœΌλ‘œ λ°”λ‘œ 이동할 수 μžˆμŠ΅λ‹ˆλ‹€. 이런 ꡬ성에 λŒ€ν•œ μƒ˜ν”Œ μΆ•μ†Œλœ 좜λ ₯은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

```
*** Starting batch number=1 ***
abs min abs max metadata
shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.47e+04 input[0]
5.36e-05 7.92e+02 output
[...]
decoder.dropout Dropout
1.60e-07 2.27e+01 input[0]
0.00e+00 2.52e+01 output
decoder T5Stack
not a tensor output
lm_head Linear
1.01e-06 7.92e+02 weight
0.00e+00 1.11e+00 input[0]
6.06e-02 8.39e+01 output
T5ForConditionalGeneration
not a tensor output

*** Starting batch number=3 ***
abs min abs max metadata
shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.78e+04 input[0]
5.36e-05 7.92e+02 output
[...]
```

μ—¬κΈ°μ—μ„œλŠ” λͺ¨λΈμ˜ forward 호좜 μˆ˜μ™€ λ™μΌν•œ 수의 ν”„λ ˆμž„μ΄ λ€ν”„λ˜λ―€λ‘œ λ§Žμ€ 수의 ν”„λ ˆμž„μ΄ μƒμ„±λ©λ‹ˆλ‹€. λ”°λΌμ„œ μ›ν•˜λŠ” 것일 μˆ˜λ„ 있고 아닐 μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ λ•Œλ‘œλŠ” 일반 디버거보닀 디버깅 λͺ©μ μœΌλ‘œ 더 μ‰½κ²Œ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄, λ¬Έμ œκ°€ 배치 번호 150μ—μ„œ μ‹œμž‘ν•˜λŠ” 경우 149와 150의 좔적을 λ€ν”„ν•˜κ³  μˆ«μžκ°€ μ–΄λ””μ„œλΆ€ν„° λ‹€λ₯΄κ²Œ λ˜μ—ˆλŠ”μ§€ 비ꡐ할 수 μžˆμŠ΅λ‹ˆλ‹€.

λ˜ν•œ, ν›ˆλ ¨μ„ 쀑지할 배치 번호λ₯Ό 지정할 μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€. λ‹€μŒκ³Ό 같이 지정할 수 μžˆμŠ΅λ‹ˆλ‹€.

```python
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3)
```