Skip to content

Commit c5b2847

Browse files
authored
intermediate_source/scaled_dot_product_attention_tutorial.py λ²ˆμ—­ (#773)
intermediate_source/scaled_dot_product_attention_tutorial.py λ²ˆμ—­
1 parent d1bcce9 commit c5b2847

File tree

1 file changed

+82
-91
lines changed

1 file changed

+82
-91
lines changed

β€Žintermediate_source/scaled_dot_product_attention_tutorial.pyβ€Ž

Lines changed: 82 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -1,77 +1,74 @@
11
"""
2-
(Beta) Implementing High-Performance Transformers with Scaled Dot Product Attention (SDPA)
3-
==========================================================================================
2+
(Beta) Scaled Dot Product Attention (SDPA)둜 κ³ μ„±λŠ₯ 트랜슀포머(Transformers) κ΅¬ν˜„ν•˜κΈ°
3+
=================================================================================
44
55
6-
**Author:** `Driss Guessous <https://github.com/drisspg>`_
6+
**μ €μž:** `Driss Guessous <https://github.com/drisspg>`_
7+
**λ²ˆμ—­:** `이강희 <https://github.com/khleexv>`_
78
"""
89

910
######################################################################
10-
# Summary
11-
# ~~~~~~~~
11+
# μš”μ•½
12+
# ~~~~
1213
#
13-
# In this tutorial, we want to highlight a new ``torch.nn.functional`` function
14-
# that can be helpful for implementing transformer architectures. The
15-
# function is named ``torch.nn.functional.scaled_dot_product_attention``.
16-
# For detailed description of the function, see the `PyTorch documentation <https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention>`__.
17-
# This function has already been incorporated into ``torch.nn.MultiheadAttention`` and ``torch.nn.TransformerEncoderLayer``.
14+
# 이 νŠœν† λ¦¬μ–Όμ—μ„œ, 트랜슀포머(Transformer) μ•„ν‚€ν…μ²˜ κ΅¬ν˜„μ— 도움이 λ˜λŠ” μƒˆλ‘œμš΄
15+
# ``torch.nn.functional`` λͺ¨λ“ˆμ˜ ν•¨μˆ˜λ₯Ό μ†Œκ°œν•©λ‹ˆλ‹€. 이 ν•¨μˆ˜μ˜ 이름은 ``torch.nn.functional.scaled_dot_product_attention``
16+
# μž…λ‹ˆλ‹€. ν•¨μˆ˜μ— λŒ€ν•œ μžμ„Έν•œ μ„€λͺ…은 `PyTorch λ¬Έμ„œ <https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention>`__
17+
# λ₯Ό μ°Έκ³ ν•˜μ„Έμš”. 이 ν•¨μˆ˜λŠ” 이미 ``torch.nn.MultiheadAttention`` κ³Ό ``torch.nn.TransformerEncoderLayer``
18+
# μ—μ„œ μ‚¬μš©λ˜κ³  μžˆμŠ΅λ‹ˆλ‹€.
1819
#
19-
# Overview
20-
# ~~~~~~~~~
21-
# At a high level, this PyTorch function calculates the
22-
# scaled dot product attention (SDPA) between query, key, and value according to
23-
# the definition found in the paper `Attention is all you
24-
# need <https://arxiv.org/abs/1706.03762>`__. While this function can
25-
# be written in PyTorch using existing functions, a fused implementation can provide
26-
# large performance benefits over a naive implementation.
20+
# κ°œμš”
21+
# ~~~~
22+
# κ³ μˆ˜μ€€μ—μ„œ, 이 PyTorch ν•¨μˆ˜λŠ” 쿼리(query), ν‚€(key), κ°’(value) μ‚¬μ΄μ˜
23+
# scaled dot product attention (SDPA)을 κ³„μ‚°ν•©λ‹ˆλ‹€.
24+
# 이 ν•¨μˆ˜μ˜ μ •μ˜λŠ” `Attention is all you need <https://arxiv.org/abs/1706.03762>`__
25+
# λ…Όλ¬Έμ—μ„œ 찾을 수 μžˆμŠ΅λ‹ˆλ‹€. 이 ν•¨μˆ˜λŠ” κΈ°μ‘΄ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ PyTorch둜 μž‘μ„±ν•  수 μžˆμ§€λ§Œ,
26+
# ν“¨μ¦ˆλ“œ(fused) κ΅¬ν˜„μ€ λ‹¨μˆœν•œ κ΅¬ν˜„λ³΄λ‹€ 큰 μ„±λŠ₯ 이점을 μ œκ³΅ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
2727
#
28-
# Fused implementations
28+
# ν“¨μ¦ˆλ“œ κ΅¬ν˜„
2929
# ~~~~~~~~~~~~~~~~~~~~~~
3030
#
31-
# For CUDA tensor inputs, the function will dispatch into one of the following
32-
# implementations:
31+
# 이 ν•¨μˆ˜λŠ” CUDA tensor μž…λ ₯을 λ‹€μŒ 쀑 ν•˜λ‚˜μ˜ κ΅¬ν˜„μ„ μ‚¬μš©ν•©λ‹ˆλ‹€.
32+
#
33+
# κ΅¬ν˜„:
3334
#
3435
# * `FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness <https://arxiv.org/abs/2205.14135>`__
3536
# * `Memory-Efficient Attention <https://github.com/facebookresearch/xformers>`__
3637
# * A PyTorch implementation defined in C++
3738
#
3839
# .. note::
3940
#
40-
# This tutorial requires PyTorch 2.0.0 or later.
41+
# 이 νŠœν† λ¦¬μ–Όμ€ PyTorch 버전 2.0.0 이상이 ν•„μš”ν•©λ‹ˆλ‹€.
4142
#
4243

4344
import torch
4445
import torch.nn as nn
4546
import torch.nn.functional as F
4647
device = "cuda" if torch.cuda.is_available() else "cpu"
4748

48-
# Example Usage:
49+
# μ‚¬μš© μ˜ˆμ‹œ:
4950
query, key, value = torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device)
5051
F.scaled_dot_product_attention(query, key, value)
5152

5253

5354
######################################################################
54-
# Explicit Dispatcher Control
55-
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~
56-
#
57-
# While the function will implicitly dispatch to one of the three
58-
# implementations, the user can also explicitly control the dispatch via
59-
# the use of a context manager. This context manager allows users to
60-
# explicitly disable certain implementations. If a user wants to ensure
61-
# the function is indeed using the fastest implementation for their
62-
# specific inputs, the context manager can be used to sweep through
63-
# measuring performance.
55+
# λͺ…μ‹œμ  Dispatcher μ œμ–΄
56+
# ~~~~~~~~~~~~~~~~~~~~
6457
#
58+
# 이 ν•¨μˆ˜λŠ” μ•”μ‹œμ μœΌλ‘œ μ„Έ κ°€μ§€ κ΅¬ν˜„ 쀑 ν•˜λ‚˜λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€. ν•˜μ§€λ§Œ μ»¨ν…μŠ€νŠΈ λ§€λ‹ˆμ €λ₯Ό
59+
# μ‚¬μš©ν•˜λ©΄ λͺ…μ‹œμ μœΌλ‘œ μ–΄λ–€ κ΅¬ν˜„μ„ μ‚¬μš©ν•  μ§€ μ œμ–΄ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ»¨ν…μŠ€νŠΈ λ§€λ‹ˆμ €λ₯Ό 톡해
60+
# νŠΉμ • κ΅¬ν˜„μ„ λͺ…μ‹œμ μœΌλ‘œ λΉ„ν™œμ„±ν™” ν•  수 μžˆμŠ΅λ‹ˆλ‹€. νŠΉμ • μž…λ ₯에 λŒ€ν•œ κ°€μž₯ λΉ λ₯Έ κ΅¬ν˜„μ„ 찾고자
61+
# ν•œλ‹€λ©΄, μ»¨ν…μŠ€νŠΈ λ§€λ‹ˆμ €λ‘œ λͺ¨λ“  κ΅¬ν˜„μ˜ μ„±λŠ₯을 μΈ‘μ •ν•΄λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€.
6562

66-
# Lets define a helpful benchmarking function:
63+
# 벀치마크 ν•¨μˆ˜λ₯Ό μ •μ˜ν•©λ‹ˆλ‹€
6764
import torch.utils.benchmark as benchmark
6865
def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
6966
t0 = benchmark.Timer(
7067
stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f}
7168
)
7269
return t0.blocked_autorange().mean * 1e6
7370

74-
# Lets define the hyper-parameters of our input
71+
# μž…λ ₯의 ν•˜μ΄νΌνŒŒλΌλ―Έν„°λ₯Ό μ •μ˜ν•©λ‹ˆλ‹€
7572
batch_size = 32
7673
max_sequence_len = 1024
7774
num_heads = 32
@@ -85,7 +82,7 @@ def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
8582

8683
print(f"The default implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds")
8784

88-
# Lets explore the speed of each of the 3 implementations
85+
# μ„Έ κ°€μ§€ κ΅¬ν˜„μ˜ 속도λ₯Ό μΈ‘μ •ν•©λ‹ˆλ‹€
8986
from torch.backends.cuda import sdp_kernel, SDPBackend
9087

9188
# Helpful arguments mapper
@@ -114,24 +111,22 @@ def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
114111

115112

116113
######################################################################
117-
# Hardware dependence
118-
# ~~~~~~~~~~~~~~~~~~~
114+
# ν•˜λ“œμ›¨μ–΄ μ˜μ‘΄μ„±
115+
# ~~~~~~~~~~~~~
119116
#
120-
# Depending on what machine you ran the above cell on and what hardware is
121-
# available, your results might be different.
122-
# - If you don’t have a GPU and are running on CPU then the context manager
123-
# will have no effect and all three runs should return similar timings.
124-
# - Depending on what compute capability your graphics card supports
125-
# flash attention or memory efficient might have failed.
117+
# μœ„ 셀을 μ–΄λ–€ λ¨Έμ‹ μ—μ„œ μ‹€ν–‰ν–ˆλŠ”μ§€μ™€ μ‚¬μš© κ°€λŠ₯ν•œ ν•˜λ“œμ›¨μ–΄μ— 따라 κ²°κ³Όκ°€ λ‹€λ₯Ό 수 μžˆμŠ΅λ‹ˆλ‹€.
118+
# - GPUκ°€ μ—†κ³  CPUμ—μ„œ μ‹€ν–‰ 쀑이라면 μ»¨ν…μŠ€νŠΈ λ§€λ‹ˆμ €λŠ” νš¨κ³Όκ°€ μ—†κ³  μ„Έ κ°€μ§€ μ‹€ν–‰ λͺ¨λ‘
119+
# μœ μ‚¬ν•œ μ‹œκ°„μ„ λ°˜ν™˜ν•  κ²ƒμž…λ‹ˆλ‹€.
120+
# - κ·Έλž˜ν”½ μΉ΄λ“œκ°€ μ§€μ›ν•˜λŠ” μ»΄ν“¨νŒ… λŠ₯λ ₯에 따라 flash attention λ˜λŠ”
121+
# memory efficient κ΅¬ν˜„μ΄ λ™μž‘ν•˜μ§€ μ•Šμ„ 수 μžˆμŠ΅λ‹ˆλ‹€.
126122

127123

128124
######################################################################
129125
# Causal Self Attention
130126
# ~~~~~~~~~~~~~~~~~~~~~
131127
#
132-
# Below is an example implementation of a multi-headed causal self
133-
# attention block inspired by
134-
# `Andrej Karpathy NanoGPT <https://github.com/karpathy/nanoGPT>`__ repository.
128+
# μ•„λž˜λŠ” multi-head causal self attention λΈ”λ‘μ˜ κ΅¬ν˜„ μ˜ˆμ‹œμž…λ‹ˆλ‹€.
129+
# `Andrej Karpathy NanoGPT <https://github.com/karpathy/nanoGPT>`__ μ €μž₯μ†Œλ₯Ό μ°Έκ³ ν–ˆμŠ΅λ‹ˆλ‹€.
135130
#
136131

137132
class CausalSelfAttention(nn.Module):
@@ -187,12 +182,13 @@ def forward(self, x):
187182

188183

189184
#####################################################################
190-
# ``NestedTensor`` and Dense tensor support
191-
# -----------------------------------------
185+
# ``NestedTensor`` 및 Dense tensor 지원
186+
# ------------------------------------
192187
#
193-
# SDPA supports both ``NestedTensor`` and Dense tensor inputs. ``NestedTensors`` handle the case where the input is a batch of variable length sequences
194-
# without needing to pad each sequence to the maximum length in the batch. For more information about ``NestedTensors`` see
195-
# `torch.nested <https://pytorch.org/docs/stable/nested.html>`__ and `NestedTensors Tutorial <https://tutorials.pytorch.kr/prototype/nestedtensor.html>`__.
188+
# SDPAλŠ” ``NestedTensor`` 와 Dense tensor μž…λ ₯을 λͺ¨λ‘ μ§€μ›ν•©λ‹ˆλ‹€.
189+
# ``NestedTensors`` λŠ” μž…λ ₯이 κ°€λ³€ 길이 μ‹œν€€μŠ€λ‘œ κ΅¬μ„±λœ 배치인 κ²½μš°μ—
190+
# 배치 λ‚΄ μ‹œν€€μŠ€μ˜ μ΅œλŒ€ 길이에 맞좰 각 μ‹œν€€μŠ€λ₯Ό νŒ¨λ”©ν•  ν•„μš”κ°€ μ—†μŠ΅λ‹ˆλ‹€. ``NestedTensors`` 에 λŒ€ν•œ μžμ„Έν•œ λ‚΄μš©μ€
191+
# `torch.nested <https://pytorch.org/docs/stable/nested.html>`__ 와 `NestedTensors νŠœν† λ¦¬μ–Ό <https://tutorials.pytorch.kr/prototype/nestedtensor.html>`__ 을 μ°Έκ³ ν•˜μ„Έμš”.
196192
#
197193

198194
import random
@@ -236,7 +232,7 @@ def generate_rand_batch(
236232
random_nt, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=0.5, dtype=dtype, device=device)
237233
random_dense, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=None, dtype=dtype, device=device)
238234

239-
# Currently the fused implementations don't support ``NestedTensor`` for training
235+
# ν˜„μž¬ ν“¨μ¦ˆλ“œ κ΅¬ν˜„μ€ ``NestedTensor`` 둜 ν•™μŠ΅ν•˜λŠ” 것을 μ§€μ›ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
240236
model.eval()
241237

242238
with sdp_kernel(**backend_map[SDPBackend.FLASH_ATTENTION]):
@@ -248,15 +244,14 @@ def generate_rand_batch(
248244

249245

250246
######################################################################
251-
# Using SDPA with ``torch.compile``
252-
# =================================
247+
# ``torch.compile`` κ³Ό ν•¨κ»˜ SDPA μ‚¬μš©ν•˜κΈ°
248+
# =====================================
253249
#
254-
# With the release of PyTorch 2.0, a new feature called
255-
# ``torch.compile()`` has been introduced, which can provide
256-
# significant performance improvements over eager mode.
257-
# Scaled dot product attention is fully composable with ``torch.compile()``.
258-
# To demonstrate this, let's compile the ``CausalSelfAttention`` module using
259-
# ``torch.compile()`` and observe the resulting performance improvements.
250+
# PyTorch 2.0 λ¦΄λ¦¬μ¦ˆμ™€ ν•¨κ»˜ ``torch.compile()`` λΌλŠ” μƒˆλ‘œμš΄ κΈ°λŠ₯이 μΆ”κ°€λ˜μ—ˆλŠ”λ°,
251+
# μ΄λŠ” eager mode보닀 μƒλ‹Ήν•œ μ„±λŠ₯ ν–₯상을 μ œκ³΅ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
252+
# Scaled dot product attention은 ``torch.compile()`` 둜 μ™„μ „νžˆ ꡬ성할 수 μžˆμŠ΅λ‹ˆλ‹€.
253+
# 이λ₯Ό ν™•μΈν•˜κΈ° μœ„ν•΄ ``torch.compile()`` 을 톡해 ``CausalSelfAttention`` λͺ¨λ“ˆμ„ μ»΄νŒŒμΌν•˜κ³ 
254+
# 결과적으둜 μ–»μ–΄μ§€λŠ” μ„±λŠ₯ ν–₯상을 μ•Œμ•„λ΄…μ‹œλ‹€.
260255
#
261256

262257
batch_size = 32
@@ -276,12 +271,11 @@ def generate_rand_batch(
276271

277272
######################################################################
278273
#
279-
# The exact execution time is dependent on machine, however the results for mine:
280-
# The non compiled module runs in 166.616 microseconds
281-
# The compiled module runs in 166.726 microseconds
282-
# That is not what we were expecting. Let's dig a little deeper.
283-
# PyTorch comes with an amazing built-in profiler that you can use to
284-
# inspect the performance characteristics of your code.
274+
# μ •ν™•ν•œ μ‹€ν–‰ μ‹œκ°„μ€ ν™˜κ²½μ— 따라 λ‹€λ₯΄μ§€λ§Œ, λ‹€μŒμ€ μ €μžμ˜ κ²°κ³Όμž…λ‹ˆλ‹€.
275+
# 컴파일 λ˜μ§€ μ•Šμ€ λͺ¨λ“ˆμ€ 싀행에 166.616ms κ°€ μ†Œμš”λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
276+
# 컴파일 된 λͺ¨λ“ˆμ€ 싀행에 166.726ms κ°€ μ†Œμš”λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
277+
# μ΄λŠ” 우리의 μ˜ˆμƒκ³ΌλŠ” λ‹€λ¦…λ‹ˆλ‹€. μ’€ 더 μžμ„Ένžˆ μ•Œμ•„λ΄…μ‹œλ‹€.
278+
# PyTorchλŠ” μ½”λ“œμ˜ μ„±λŠ₯ νŠΉμ„±μ„ 점검할 수 μžˆλŠ” λ†€λΌμš΄ λ‚΄μž₯(built-in) ν”„λ‘œνŒŒμΌλŸ¬λ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€.
285279
#
286280

287281
from torch.profiler import profile, record_function, ProfilerActivity
@@ -302,7 +296,7 @@ def generate_rand_batch(
302296
compiled_model(x)
303297
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
304298

305-
# For even more insights, you can export the trace and use ``chrome://tracing`` to view the results
299+
# 더 λ§Žμ€ 정보λ₯Ό μ–»κΈ° μœ„ν•΄ 좔적(trace)λ₯Ό 내보내고 ``chrome://tracing``을 μ‚¬μš©ν•˜μ—¬ κ²°κ³Όλ₯Ό ν™•μΈν•΄λ³΄μ„Έμš”.
306300
# ::
307301
#
308302
# prof.export_chrome_trace("compiled_causal_attention_trace.json").
@@ -311,33 +305,30 @@ def generate_rand_batch(
311305

312306

313307
######################################################################
314-
# The previous code snippet generates a report of the top 10 PyTorch functions
315-
# that consumed the most GPU execution time, for both the compiled and non-compiled module.
316-
# The analysis reveals that the majority of time spent on the GPU is concentrated
317-
# on the same set of functions for both modules.
318-
# The reason for this here is that ``torch.compile`` is very good at removing the
319-
# framework overhead associated with PyTorch. If your model is launching
320-
# large, efficient CUDA kernels, which in this case ``CausaulSelfAttention``
321-
# is, then the overhead of PyTorch can be hidden.
308+
# 이전 μ½”λ“œ 쑰각(snippet)은 컴파일 된 λͺ¨λ“ˆκ³Ό μ»΄νŒŒμΌλ˜μ§€ μ•Šμ€ λͺ¨λ“ˆ λͺ¨λ‘μ— λŒ€ν•΄
309+
# κ°€μž₯ λ§Žμ€ GPU μ‹€ν–‰ μ‹œκ°„μ„ μ°¨μ§€ν•œ μƒμœ„ 10개의 PyTorch ν•¨μˆ˜μ— λŒ€ν•œ λ³΄κ³ μ„œλ₯Ό μƒμ„±ν•©λ‹ˆλ‹€.
310+
# 뢄석 κ²°κ³Ό, 두 λͺ¨λ“ˆ λͺ¨λ‘ GPUμ—μ„œ μ†Œμš”λœ μ‹œκ°„μ˜ λŒ€λΆ€λΆ„μ΄
311+
# λ™μΌν•œ ν•¨μˆ˜λ“€μ— μ§‘μ€‘λ˜μ–΄ μžˆμŒμ„ λ³΄μ—¬μ€λ‹ˆλ‹€.
312+
# PyTorchκ°€ ν”„λ ˆμž„μ›Œν¬ μ˜€λ²„ν—€λ“œλ₯Ό μ œκ±°ν•˜λŠ” 데 맀우 νƒμ›”ν•œ ``torch.compile`` λ₯Ό
313+
# μ œκ³΅ν•˜κΈ° λ•Œλ¬Έμž…λ‹ˆλ‹€. ``CausalSelfAttention`` 같은 경우처럼 크고, 효율적인 CUDA 컀널을
314+
# μ‚¬μš©ν•˜λŠ” λͺ¨λΈμ—μ„œ PyTorch μ˜€λ²„ν—€λ“œλŠ” μž‘μ•„μ§ˆ κ²ƒμž…λ‹ˆλ‹€.
322315
#
323-
# In reality, your module does not normally consist of a singular
324-
# ``CausalSelfAttention`` block. When experimenting with `Andrej Karpathy NanoGPT <https://github.com/karpathy/nanoGPT>`__ repository, compiling
325-
# the module took the time per train step from: ``6090.49ms`` to
326-
# ``3273.17ms``! This was done on commit: ``ae3a8d5`` of NanoGPT training on
327-
# the Shakespeare dataset.
316+
# 사싀, λͺ¨λ“ˆμ€ 보톡 ``CausalSelfAttention`` λΈ”λŸ­ ν•˜λ‚˜λ§ŒμœΌλ‘œ κ΅¬μ„±λ˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
317+
# `Andrej Karpathy NanoGPT <https://github.com/karpathy/nanoGPT>`__ μ €μž₯μ†Œμ—μ„œ μ‹€ν—˜ν•œ 경우,
318+
# λͺ¨λ“ˆμ„ 컴파일 ν•˜λŠ” 것은 ν•™μŠ΅μ˜ 각 단계별 μ†Œμš” μ‹œκ°„μ„ ``6090.49ms`` μ—μ„œ ``3273.17ms`` 둜
319+
# 쀄일 수 μžˆμ—ˆμŠ΅λ‹ˆλ‹€. 이 μ‹€ν—˜μ€ NanoGPT μ €μž₯μ†Œμ˜ ``ae3a8d5`` μ»€λ°‹μ—μ„œ Shakespeare
320+
# 데이터셋을 μ‚¬μš©ν•˜μ—¬ μ§„ν–‰λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
328321
#
329322

330323

331324
######################################################################
332-
# Conclusion
333-
# ==========
325+
# κ²°λ‘ 
326+
# ====
334327
#
335-
# In this tutorial, we have demonstrated the basic usage of
336-
# ``torch.nn.functional.scaled_dot_product_attention``. We have shown how
337-
# the ``sdp_kernel`` context manager can be used to assert a certain
338-
# implementation is used on GPU. As well, we built a simple
339-
# ``CausalSelfAttention`` module that works with ``NestedTensor`` and is torch
340-
# compilable. In the process we have shown how to the profiling tools can
341-
# be used to explore the performance characteristics of a user defined
342-
# module.
328+
# 이 νŠœν† λ¦¬μ–Όμ—μ„œ, ``torch.nn.functional.scaled_dot_product_attention`` 의 기본적인
329+
# μ‚¬μš©λ²•μ„ μ‚΄νŽ΄λ΄€μŠ΅λ‹ˆλ‹€. ``sdp_kernel`` μ»¨ν…μŠ€νŠΈ λ§€λ‹ˆμ €λ‘œ GPUκ°€ νŠΉμ • κ΅¬ν˜„μ„
330+
# μ‚¬μš©ν•˜λ„λ‘ ν•  수 μžˆλ‹€λŠ” 것을 λ³΄μ•˜μŠ΅λ‹ˆλ‹€. λ˜ν•œ, κ°„λ‹¨ν•œ ``NestedTensor`` μ—μ„œ μž‘λ™ν•˜κ³ 
331+
# 컴파일 κ°€λŠ₯ν•œ ``CausalSelfAttention`` λͺ¨λ“ˆμ„ λ§Œλ“€μ—ˆμŠ΅λ‹ˆλ‹€.
332+
# 이 κ³Όμ •μ—μ„œ ν”„λ‘œνŒŒμΌλ§ 도ꡬλ₯Ό μ‚¬μš©ν•˜μ—¬ μœ μ €κ°€ μ •μ˜ν•œ λͺ¨λ“ˆμ˜ μ„±λŠ₯ νŠΉμ„±μ„ μ–΄λ–»κ²Œ
333+
# 확인할 수 μžˆλŠ”μ§€λ„ μ‚΄νŽ΄λ΄€μŠ΅λ‹ˆλ‹€.
343334
#

0 commit comments

Comments
Β (0)