From f79a910983b4ec47ad83ae548cd28938760ed387 Mon Sep 17 00:00:00 2001
From: KH <ganghe74@gmail.com>
Date: Tue, 5 Sep 2023 16:15:31 +0900
Subject: [PATCH 01/12] =?UTF-8?q?SDPA=20=EB=B2=88=EC=97=AD:=20=EC=A0=9C?=
 =?UTF-8?q?=EB=AA=A9,=20=EC=9A=94=EC=95=BD(Summary),=20=EA=B0=9C=EC=9A=94(?=
 =?UTF-8?q?Overview)=20=EB=AC=B8=EB=8B=A8?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

1. Scaled Dot Product Attention는 번역하지 않는게 좋다고 판단했습니다.
2. 트랜스포머는 처음에 음차와 영어 원문(Transformers)를 병행 표기했습니다.
   이후 Transformer 용어가 등장하면 "트랜스포머"로 음차 표기할 예정입니다.
3. query, key, value 는 각각 쿼리, 키, 값으로 번역했습니다.
  번역 사례 2가지를 참고했습니다.
  ref 1) "딥 러닝을 이용한 자연어 처리 입문" (https://wikidocs.net/31379)
  ref 2) "트랜스포머를 활용한 자연어 처리" 루이스 턴스톨, 레안드로 폰
         베라, 토마스 울프 저/박해선 역
4. fused 는 적당한 번역을 생각해내지 못해서 음차와 영어 원문을 병행
   표기하는 것으로 결정했습니다.

issue) https://github.com/PyTorchKorea/tutorials-kr/issues/747
---
 .../scaled_dot_product_attention_tutorial.py  | 34 +++++++++----------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/intermediate_source/scaled_dot_product_attention_tutorial.py b/intermediate_source/scaled_dot_product_attention_tutorial.py
index 4ec20d077..82d02fcad 100644
--- a/intermediate_source/scaled_dot_product_attention_tutorial.py
+++ b/intermediate_source/scaled_dot_product_attention_tutorial.py
@@ -1,29 +1,29 @@
 """
-(Beta) Implementing High-Performance Transformers with Scaled Dot Product Attention (SDPA)
-==========================================================================================
+(Beta) Scaled Dot Product Attention (SDPA)로 고성능 트랜스포머(Transformers) 구현하기
+=================================================================================
 
 
 **Author:** `Driss Guessous <https://github.com/drisspg>`_
+**번역** : `이강희 <https://github.com/khleexv>`_
 """
 
 ######################################################################
-# Summary
-# ~~~~~~~~
+# 요약
+# ~~~~
 #
-# In this tutorial, we want to highlight a new ``torch.nn.functional`` function
-# that can be helpful for implementing transformer architectures. The
-# function is named ``torch.nn.functional.scaled_dot_product_attention``.
-# For detailed description of the function, see the `PyTorch documentation <https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention>`__.
-# This function has already been incorporated into ``torch.nn.MultiheadAttention`` and ``torch.nn.TransformerEncoderLayer``.
+# 이 튜토리얼에서, 트랜스포머(Transformer) 아키텍처 구현에 도움이 되는 새로운
+# ``torch.nn.functional`` 모듈의 함수를 소개합니다. 이 함수의 이름은 ``torch.nn.functional.scaled_dot_product_attention``
+# 입니다. 함수에 대한 자세한 설명은 `PyTorch 문서 <https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention>`__
+# 를 참고하세요. 이 함수는 이미 ``torch.nn.MultiheadAttention`` 과 ``torch.nn.TransformerEncoderLayer``
+# 에서 사용되고 있습니다.
 #
-# Overview
-# ~~~~~~~~~
-# At a high level, this PyTorch function calculates the
-# scaled dot product attention (SDPA) between query, key, and value according to
-# the definition found in the paper `Attention is all you
-# need <https://arxiv.org/abs/1706.03762>`__. While this function can
-# be written in PyTorch using existing functions, a fused implementation can provide
-# large performance benefits over a naive implementation.
+# 개요
+# ~~~~
+# 높은 수준에서 이 PyTorch 함수는 쿼리(query), 키(key), 값(value) 사이의
+# scaled dot product attention (SDPA)을 계산합니다.
+# 이 함수의 정의는 `Attention is all you need <https://arxiv.org/abs/1706.03762>`__
+# 논문에서 찾을 수 있습니다. 이 함수는 기존 함수를 사용하여 PyTorch로 작성할 수 있지만,
+# 퓨즈드(fused) 구현은 단순한 구현보다 큰 성능 이점을 제공할 수 있습니다.
 #
 # Fused implementations
 # ~~~~~~~~~~~~~~~~~~~~~~

From 06681073075e66ac68df4e9510d24c56b75b4032 Mon Sep 17 00:00:00 2001
From: KH <ganghe74@gmail.com>
Date: Tue, 5 Sep 2023 16:22:24 +0900
Subject: [PATCH 02/12] =?UTF-8?q?SDPA:=20=ED=93=A8=EC=A6=88=EB=93=9C(Fused?=
 =?UTF-8?q?)=20=EA=B5=AC=ED=98=84=20=EB=AC=B8=EB=8B=A8=20=EB=B2=88?=
 =?UTF-8?q?=EC=97=AD?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

issue) https://github.com/PyTorchKorea/tutorials-kr/issues/747
---
 .../scaled_dot_product_attention_tutorial.py          | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/intermediate_source/scaled_dot_product_attention_tutorial.py b/intermediate_source/scaled_dot_product_attention_tutorial.py
index 82d02fcad..b11e84285 100644
--- a/intermediate_source/scaled_dot_product_attention_tutorial.py
+++ b/intermediate_source/scaled_dot_product_attention_tutorial.py
@@ -25,11 +25,12 @@
 # 논문에서 찾을 수 있습니다. 이 함수는 기존 함수를 사용하여 PyTorch로 작성할 수 있지만,
 # 퓨즈드(fused) 구현은 단순한 구현보다 큰 성능 이점을 제공할 수 있습니다.
 #
-# Fused implementations
+# 퓨즈드(Fused) 구현
 # ~~~~~~~~~~~~~~~~~~~~~~
 #
-# For CUDA tensor inputs, the function will dispatch into one of the following
-# implementations:
+# 이 함수는 CUDA 텐서 입력을 다음 중 하나의 구현을 사용합니다.
+#
+# 구현:
 #
 # * `FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness <https://arxiv.org/abs/2205.14135>`__
 # * `Memory-Efficient Attention <https://github.com/facebookresearch/xformers>`__
@@ -37,7 +38,7 @@
 #
 # .. note::
 #
-#   This tutorial requires PyTorch 2.0.0 or later.
+#   이 튜토리얼은 PyTorch 버전 2.0.0 이상이 필요합니다.
 #
 
 import torch
@@ -45,7 +46,7 @@
 import torch.nn.functional as F
 device = "cuda" if torch.cuda.is_available() else "cpu"
 
-# Example Usage:
+# 사용 예시:
 query, key, value = torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device)
 F.scaled_dot_product_attention(query, key, value)
 

From 95c16214a5f47a9a15791845c5f50af852257ca0 Mon Sep 17 00:00:00 2001
From: KH <ganghe74@gmail.com>
Date: Tue, 5 Sep 2023 16:25:02 +0900
Subject: [PATCH 03/12] =?UTF-8?q?SDPA:=20=EB=AA=85=EC=8B=9C=EC=A0=81=20Dis?=
 =?UTF-8?q?patcher=20=EC=A0=9C=EC=96=B4=20=EB=AC=B8=EB=8B=A8=20=EB=B2=88?=
 =?UTF-8?q?=EC=97=AD?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Dispatcher는 번역할 마땅한 단어를 찾지 못해서,
    영문 표기를 그대로 사용했습니다.
Helpful arguments mapper 또한 영문 표기를 유지했습니다.

issue) https://github.com/PyTorchKorea/tutorials-kr/issues/747
---
 .../scaled_dot_product_attention_tutorial.py  | 22 ++++++++-----------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/intermediate_source/scaled_dot_product_attention_tutorial.py b/intermediate_source/scaled_dot_product_attention_tutorial.py
index b11e84285..d7a87d6d4 100644
--- a/intermediate_source/scaled_dot_product_attention_tutorial.py
+++ b/intermediate_source/scaled_dot_product_attention_tutorial.py
@@ -52,19 +52,15 @@
 
 
 ######################################################################
-# Explicit Dispatcher Control
-# ~~~~~~~~~~~~~~~~~~~~~~~~~~~
-#
-# While the function will implicitly dispatch to one of the three
-# implementations, the user can also explicitly control the dispatch via
-# the use of a context manager. This context manager allows users to
-# explicitly disable certain implementations. If a user wants to ensure
-# the function is indeed using the fastest implementation for their
-# specific inputs, the context manager can be used to sweep through
-# measuring performance.
+# 명시적 Dispatcher 제어
+# ~~~~~~~~~~~~~~~~~~~~
 #
+# 이 함수는 암시적으로 세 가지 구현 중 하나를 사용합니다. 하지만 컨텍스트 매니저를
+# 사용하면 명시적으로 어떤 구현을 사용할 지 제어할 수 있습니다. 컨텍스트 매니저를 통해
+# 특정 구현을 명시적으로 비활성화 할 수 있습니다. 특정 입력에 대한 가장 빠른 구현을 찾고자
+# 한다면, 컨텍스트 매니저로 모든 구현의 성능을 측정해볼 수 있습니다.
 
-# Lets define a helpful benchmarking function:
+# 벤치마크 함수를 정의합니다
 import torch.utils.benchmark as benchmark
 def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
     t0 = benchmark.Timer(
@@ -72,7 +68,7 @@ def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
     )
     return t0.blocked_autorange().mean * 1e6
 
-# Lets define the hyper-parameters of our input
+# 입력의 하이퍼파라미터를 정의합니다
 batch_size = 32
 max_sequence_len = 1024
 num_heads = 32
@@ -86,7 +82,7 @@ def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
 
 print(f"The default implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds")
 
-# Lets explore the speed of each of the 3 implementations
+# 세 가지 구현의 속도를 측정합니다
 from torch.backends.cuda import sdp_kernel, SDPBackend
 
 # Helpful arguments mapper

From 4be1dea18845ae45b69371adeaf05d58bb55cde7 Mon Sep 17 00:00:00 2001
From: KH <ganghe74@gmail.com>
Date: Tue, 5 Sep 2023 16:28:45 +0900
Subject: [PATCH 04/12] =?UTF-8?q?SDPA:=20=ED=95=98=EB=93=9C=EC=9B=A8?=
 =?UTF-8?q?=EC=96=B4=20=EC=9D=98=EC=A1=B4=EC=84=B1(Hardware=20dependence)?=
 =?UTF-8?q?=20=EB=AC=B8=EB=8B=A8=20=EB=B2=88=EC=97=AD?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

issue) https://github.com/PyTorchKorea/tutorials-kr/issues/747
---
 .../scaled_dot_product_attention_tutorial.py      | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/intermediate_source/scaled_dot_product_attention_tutorial.py b/intermediate_source/scaled_dot_product_attention_tutorial.py
index d7a87d6d4..e13f47b58 100644
--- a/intermediate_source/scaled_dot_product_attention_tutorial.py
+++ b/intermediate_source/scaled_dot_product_attention_tutorial.py
@@ -111,15 +111,14 @@ def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
 
 
 ######################################################################
-# Hardware dependence
-# ~~~~~~~~~~~~~~~~~~~
+# 하드웨어 의존성
+# ~~~~~~~~~~~~~
 #
-# Depending on what machine you ran the above cell on and what hardware is
-# available, your results might be different.
-# - If you don’t have a GPU and are running on CPU then the context manager
-# will have no effect and all three runs should return similar timings.
-# - Depending on what compute capability your graphics card supports
-# flash attention or memory efficient might have failed.
+# 위 셀을 어떤 머신에서 실행했는지와 사용 가능한 하드웨어에 따라 결과가 다를 수 있습니다.
+# - GPU가 없고 CPU에서 실행 중이라면 컨텍스트 매니저는 효과가 없고 세 가지 실행 모두
+# 유사한 시간을 반환할 것입니다.
+# - 그래픽 카드가 지원하는 컴퓨팅 능력에 따라 flash attention 또는
+# memory efficient 구현이 동작하지 않을 수 있습니다.
 
 
 ######################################################################

From 79c1d0c9424e2c92a5beb9f5fb2e1b6dcad53b25 Mon Sep 17 00:00:00 2001
From: KH <ganghe74@gmail.com>
Date: Tue, 5 Sep 2023 16:31:47 +0900
Subject: [PATCH 05/12] =?UTF-8?q?SDPA:=20Causal=20Self=20Attention,=20``Ne?=
 =?UTF-8?q?stedTensor``=20=EB=B0=8F=20Dense=20tensor=20=EC=A7=80=EC=9B=90?=
 =?UTF-8?q?=20=EB=AC=B8=EB=8B=A8=20=EB=B2=88=EC=97=AD?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

issue) https://github.com/PyTorchKorea/tutorials-kr/issues/747
---
 .../scaled_dot_product_attention_tutorial.py   | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/intermediate_source/scaled_dot_product_attention_tutorial.py b/intermediate_source/scaled_dot_product_attention_tutorial.py
index e13f47b58..07cf9f93f 100644
--- a/intermediate_source/scaled_dot_product_attention_tutorial.py
+++ b/intermediate_source/scaled_dot_product_attention_tutorial.py
@@ -125,9 +125,8 @@ def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
 # Causal Self Attention
 # ~~~~~~~~~~~~~~~~~~~~~
 #
-# Below is an example implementation of a multi-headed causal self
-# attention block inspired by
-# `Andrej Karpathy NanoGPT <https://github.com/karpathy/nanoGPT>`__ repository.
+# 아래는 multi-head causal self attention 블록의 구현 예시입니다.
+# `Andrej Karpathy NanoGPT <https://github.com/karpathy/nanoGPT>`__ 저장소를 참고했습니다.
 #
 
 class CausalSelfAttention(nn.Module):
@@ -183,12 +182,13 @@ def forward(self, x):
 
 
 #####################################################################
-# ``NestedTensor`` and Dense tensor support
-# -----------------------------------------
+# ``NestedTensor`` 및 Dense tensor 지원
+# ------------------------------------
 #
-# SDPA supports both ``NestedTensor`` and Dense tensor inputs. ``NestedTensors`` handle the case where the input is a batch of variable length sequences
-# without needing to pad each sequence to the maximum length in the batch. For more information about ``NestedTensors`` see
-# `torch.nested <https://pytorch.org/docs/stable/nested.html>`__ and `NestedTensors Tutorial <https://tutorials.pytorch.kr/prototype/nestedtensor.html>`__.
+# SDPA는 ``NestedTensor``와 Dense tensor 입력을 모두 지원합니다.
+# ``NestedTensors``는 입력이 가변 길이 시퀀스로 구성된 배치인 경우에
+# 배치 내 시퀀스의 최대 길이에 맞춰 각 시퀀스를 패딩할 필요가 없습니다. ``NestedTensors``에 대한 자세한 내용은
+# `torch.nested <https://pytorch.org/docs/stable/nested.html>`__와 `NestedTensors 튜토리얼 <https://tutorials.pytorch.kr/prototype/nestedtensor.html>`__을 참고하세요.
 #
 
 import random
@@ -232,7 +232,7 @@ def generate_rand_batch(
 random_nt, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=0.5, dtype=dtype, device=device)
 random_dense, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=None, dtype=dtype, device=device)
 
-# Currently the fused implementations don't support ``NestedTensor`` for training
+# 현재 퓨즈드(fused) 구현은 ``NestedTensor``로 학습하는 것을 지원하지 않습니다.
 model.eval()
 
 with sdp_kernel(**backend_map[SDPBackend.FLASH_ATTENTION]):

From c01137a110364dd1f4787ee30bba97197f901bcf Mon Sep 17 00:00:00 2001
From: KH <ganghe74@gmail.com>
Date: Tue, 5 Sep 2023 16:44:56 +0900
Subject: [PATCH 06/12] =?UTF-8?q?SDPA:=20``torch.compile``=EA=B3=BC=20?=
 =?UTF-8?q?=ED=95=A8=EA=BB=98=20SDPA=20=EC=82=AC=EC=9A=A9=ED=95=98?=
 =?UTF-8?q?=EA=B8=B0=20(=EB=85=BC=EC=9D=98=20=ED=95=84=EC=9A=94)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

마지막 전 문단 (line 308~312) 번역에 도움이 필요합니다.
이해를 돕기 위해 약간의 의역이 필요할 것 같습니다.

issue) https://github.com/PyTorchKorea/tutorials-kr/issues/747
---
 .../scaled_dot_product_attention_tutorial.py  | 53 +++++++++----------
 1 file changed, 25 insertions(+), 28 deletions(-)

diff --git a/intermediate_source/scaled_dot_product_attention_tutorial.py b/intermediate_source/scaled_dot_product_attention_tutorial.py
index 07cf9f93f..3b258df53 100644
--- a/intermediate_source/scaled_dot_product_attention_tutorial.py
+++ b/intermediate_source/scaled_dot_product_attention_tutorial.py
@@ -244,15 +244,14 @@ def generate_rand_batch(
 
 
 ######################################################################
-# Using SDPA with ``torch.compile``
-# =================================
+# ``torch.compile``과 함께 SDPA 사용하기
+# ===================================
 #
-# With the release of PyTorch 2.0, a new feature called
-# ``torch.compile()`` has been introduced, which can provide
-# significant performance improvements over eager mode.
-# Scaled dot product attention is fully composable with ``torch.compile()``.
-# To demonstrate this, let's compile the ``CausalSelfAttention`` module using
-# ``torch.compile()`` and observe the resulting performance improvements.
+# PyTorch 2.0 릴리즈와 함께 ``torch.compile()``이라는 새로운 기능이 추가되었는데,
+# 이는 eager mode보다 상당한 성능 향상을 제공할 수 있습니다.
+# Scaled dot product attention은 ``torch.compile()``로 완전히 구성할 수 있습니다.
+# 이를 확인하기 위해 ``torch.compile()``을 통해 ``CausalSelfAttention`` 모듈을 컴파일하고
+# 결과적으로 얻어지는 성능 향상을 알아봅시다.
 #
 
 batch_size = 32
@@ -272,12 +271,11 @@ def generate_rand_batch(
 
 ######################################################################
 #
-# The exact execution time is dependent on machine, however the results for mine:
-# The non compiled module runs in  166.616 microseconds
-# The compiled module runs in  166.726 microseconds
-# That is not what we were expecting. Let's dig a little deeper.
-# PyTorch comes with an amazing built-in profiler that you can use to
-# inspect the performance characteristics of your code.
+# 정확한 실행 시간은 환경에 따라 다르지만, 다음은 저자의 결과입니다.
+# 컴파일 되지 않은 모듈은 실행에 166.616ms 가 소요되었습니다.
+# 컴파일 된 모듈은 실행에 166.726ms 가 소요되었습니다.
+# 이는 우리의 예상과는 다릅니다. 좀 더 자세히 알아봅시다.
+# PyTorch는 코드의 성능 특성을 점검할 수 있는 놀라운 내장(built-in) 프로파일러를 제공합니다.
 #
 
 from torch.profiler import profile, record_function, ProfilerActivity
@@ -298,7 +296,7 @@ def generate_rand_batch(
             compiled_model(x)
 print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
 
-# For even more insights, you can export the trace and use ``chrome://tracing`` to view the results
+# 더 많은 정보를 얻기 위해 추적(trace)를 내보내고 ``chrome://tracing``을 사용하여 결과를 확인해보세요.
 # ::
 #
 #    prof.export_chrome_trace("compiled_causal_attention_trace.json").
@@ -307,20 +305,19 @@ def generate_rand_batch(
 
 
 ######################################################################
-# The previous code snippet generates a report of the top 10 PyTorch functions
-# that consumed the most GPU execution time, for both the compiled and non-compiled module.
-# The analysis reveals that the majority of time spent on the GPU is concentrated
-# on the same set of functions for both modules.
-# The reason for this here is that ``torch.compile`` is very good at removing the
-# framework overhead associated with PyTorch. If your model is launching
-# large, efficient CUDA kernels, which in this case ``CausaulSelfAttention``
-# is, then the overhead of PyTorch can be hidden.
+# 이전 코드 조각(snippet)은 컴파일 된 모듈과 컴파일되지 않은 모듈 모두에 대해
+# 가장 많은 GPU 실행 시간을 차지한 상위 10개의 PyTorch 함수에 대한 보고서를 생성합니다.
+# 분석 결과, 두 모듈 모두 GPU에서 소요된 시간의 대부분이
+# 동일한 함수들에 집중되어 있음을 보여줍니다.
+# PyTorch가 프레임워크 오버헤드를 제거하는 데 매우 탁월한 ``torch.compile``를
+# 제공하기 때문입니다. ``CausaulSelfAttention`` 같은 경우처럼 크고, 효율적인 CUDA 커널을
+# 사용하는 모델에서 PyTorch 오버헤드는 작아질 것입니다.
 #
-# In reality, your module does not normally consist of a singular
-# ``CausalSelfAttention`` block. When experimenting with `Andrej Karpathy NanoGPT <https://github.com/karpathy/nanoGPT>`__ repository, compiling
-# the module took the time per train step from: ``6090.49ms`` to
-# ``3273.17ms``! This was done on commit: ``ae3a8d5`` of NanoGPT training on
-# the Shakespeare dataset.
+# 사실, 모듈은 보통 ``CausalSelfAttention`` 블럭 하나만으로 구성되지 않습니다.
+# `Andrej Karpathy NanoGPT <https://github.com/karpathy/nanoGPT>`__ 저장소에서 실험한 경우,
+# 모듈을 컴파일 하는 것은 학습의 각 단계별 소요 시간을 ``6090.49ms``에서 ``3273.17ms``로
+# 줄일 수 있었습니다. 이 실험은 NanoGPT 저장소의 ``ae3a8d5`` 커밋에서 Shakespeare
+# 데이터셋을 사용하여 진행되었습니다.
 #
 
 

From b649e1001f2462676005f108fa25db4b12fdd26d Mon Sep 17 00:00:00 2001
From: KH <ganghe74@gmail.com>
Date: Tue, 5 Sep 2023 16:47:47 +0900
Subject: [PATCH 07/12] =?UTF-8?q?SDPA:=20=EA=B2=B0=EB=A1=A0=20=EB=AC=B8?=
 =?UTF-8?q?=EB=8B=A8=20=EB=B2=88=EC=97=AD?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

issue) https://github.com/PyTorchKorea/tutorials-kr/issues/747
---
 .../scaled_dot_product_attention_tutorial.py   | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/intermediate_source/scaled_dot_product_attention_tutorial.py b/intermediate_source/scaled_dot_product_attention_tutorial.py
index 3b258df53..1b6e20413 100644
--- a/intermediate_source/scaled_dot_product_attention_tutorial.py
+++ b/intermediate_source/scaled_dot_product_attention_tutorial.py
@@ -322,15 +322,13 @@ def generate_rand_batch(
 
 
 ######################################################################
-# Conclusion
-# ==========
+# 결론
+# ===
 #
-# In this tutorial, we have demonstrated the basic usage of
-# ``torch.nn.functional.scaled_dot_product_attention``. We have shown how
-# the ``sdp_kernel`` context manager can be used to assert a certain
-# implementation is used on GPU. As well, we built a simple
-# ``CausalSelfAttention`` module that works with ``NestedTensor`` and is torch
-# compilable. In the process we have shown how to the profiling tools can
-# be used to explore the performance characteristics of a user defined
-# module.
+# 이 튜토리얼에서, ``torch.nn.functional.scaled_dot_product_attention``의 기본적인
+# 사용법을 살펴봤습니다. ``sdp_kernel`` 컨텍스트 매니저로 GPU가 특정 구현을
+# 사용하도록 할 수 있다는 것을 보았습니다. 또한, 간단한 ``NestedTensor``에서 작동하고
+# 컴파일 가능한 ``CausalSelfAttention``모듈을 만들었습니다.
+# 이 과정에서 프로파일링 도구를 사용하여 유저가 정의한 모듈의 성능 특성을 어떻게
+# 확인할 수 있는지도 살펴봤습니다.
 #

From ca5bd75497c0d48675be16af858b191bce1138d3 Mon Sep 17 00:00:00 2001
From: KH <ganghe74@gmail.com>
Date: Tue, 5 Sep 2023 17:12:07 +0900
Subject: [PATCH 08/12] =?UTF-8?q?SDPA:=20reStructuredText=20=EB=AC=B8?=
 =?UTF-8?q?=EB=B2=95=20=EC=98=A4=EB=A5=98=20=EC=88=98=EC=A0=95?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

``inline literal`` 뒤에 공백을 추가했습니다.
---
 .../scaled_dot_product_attention_tutorial.py  | 32 +++++++++----------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/intermediate_source/scaled_dot_product_attention_tutorial.py b/intermediate_source/scaled_dot_product_attention_tutorial.py
index 1b6e20413..2d6cf60dc 100644
--- a/intermediate_source/scaled_dot_product_attention_tutorial.py
+++ b/intermediate_source/scaled_dot_product_attention_tutorial.py
@@ -185,10 +185,10 @@ def forward(self, x):
 # ``NestedTensor`` 및 Dense tensor 지원
 # ------------------------------------
 #
-# SDPA는 ``NestedTensor``와 Dense tensor 입력을 모두 지원합니다.
-# ``NestedTensors``는 입력이 가변 길이 시퀀스로 구성된 배치인 경우에
-# 배치 내 시퀀스의 최대 길이에 맞춰 각 시퀀스를 패딩할 필요가 없습니다. ``NestedTensors``에 대한 자세한 내용은
-# `torch.nested <https://pytorch.org/docs/stable/nested.html>`__와 `NestedTensors 튜토리얼 <https://tutorials.pytorch.kr/prototype/nestedtensor.html>`__을 참고하세요.
+# SDPA는 ``NestedTensor`` 와 Dense tensor 입력을 모두 지원합니다.
+# ``NestedTensors`` 는 입력이 가변 길이 시퀀스로 구성된 배치인 경우에
+# 배치 내 시퀀스의 최대 길이에 맞춰 각 시퀀스를 패딩할 필요가 없습니다. ``NestedTensors`` 에 대한 자세한 내용은
+# `torch.nested <https://pytorch.org/docs/stable/nested.html>`__ 와 `NestedTensors 튜토리얼 <https://tutorials.pytorch.kr/prototype/nestedtensor.html>`__ 을 참고하세요.
 #
 
 import random
@@ -232,7 +232,7 @@ def generate_rand_batch(
 random_nt, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=0.5, dtype=dtype, device=device)
 random_dense, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=None, dtype=dtype, device=device)
 
-# 현재 퓨즈드(fused) 구현은 ``NestedTensor``로 학습하는 것을 지원하지 않습니다.
+# 현재 퓨즈드(fused) 구현은 ``NestedTensor`` 로 학습하는 것을 지원하지 않습니다.
 model.eval()
 
 with sdp_kernel(**backend_map[SDPBackend.FLASH_ATTENTION]):
@@ -244,13 +244,13 @@ def generate_rand_batch(
 
 
 ######################################################################
-# ``torch.compile``과 함께 SDPA 사용하기
-# ===================================
+# ``torch.compile`` 과 함께 SDPA 사용하기
+# =====================================
 #
-# PyTorch 2.0 릴리즈와 함께 ``torch.compile()``이라는 새로운 기능이 추가되었는데,
+# PyTorch 2.0 릴리즈와 함께 ``torch.compile()`` 라는 새로운 기능이 추가되었는데,
 # 이는 eager mode보다 상당한 성능 향상을 제공할 수 있습니다.
-# Scaled dot product attention은 ``torch.compile()``로 완전히 구성할 수 있습니다.
-# 이를 확인하기 위해 ``torch.compile()``을 통해 ``CausalSelfAttention`` 모듈을 컴파일하고
+# Scaled dot product attention은 ``torch.compile()`` 로 완전히 구성할 수 있습니다.
+# 이를 확인하기 위해 ``torch.compile()`` 을 통해 ``CausalSelfAttention`` 모듈을 컴파일하고
 # 결과적으로 얻어지는 성능 향상을 알아봅시다.
 #
 
@@ -309,13 +309,13 @@ def generate_rand_batch(
 # 가장 많은 GPU 실행 시간을 차지한 상위 10개의 PyTorch 함수에 대한 보고서를 생성합니다.
 # 분석 결과, 두 모듈 모두 GPU에서 소요된 시간의 대부분이
 # 동일한 함수들에 집중되어 있음을 보여줍니다.
-# PyTorch가 프레임워크 오버헤드를 제거하는 데 매우 탁월한 ``torch.compile``를
+# PyTorch가 프레임워크 오버헤드를 제거하는 데 매우 탁월한 ``torch.compile`` 를
 # 제공하기 때문입니다. ``CausaulSelfAttention`` 같은 경우처럼 크고, 효율적인 CUDA 커널을
 # 사용하는 모델에서 PyTorch 오버헤드는 작아질 것입니다.
 #
 # 사실, 모듈은 보통 ``CausalSelfAttention`` 블럭 하나만으로 구성되지 않습니다.
 # `Andrej Karpathy NanoGPT <https://github.com/karpathy/nanoGPT>`__ 저장소에서 실험한 경우,
-# 모듈을 컴파일 하는 것은 학습의 각 단계별 소요 시간을 ``6090.49ms``에서 ``3273.17ms``로
+# 모듈을 컴파일 하는 것은 학습의 각 단계별 소요 시간을 ``6090.49ms`` 에서 ``3273.17ms`` 로
 # 줄일 수 있었습니다. 이 실험은 NanoGPT 저장소의 ``ae3a8d5`` 커밋에서 Shakespeare
 # 데이터셋을 사용하여 진행되었습니다.
 #
@@ -323,12 +323,12 @@ def generate_rand_batch(
 
 ######################################################################
 # 결론
-# ===
+# ====
 #
-# 이 튜토리얼에서, ``torch.nn.functional.scaled_dot_product_attention``의 기본적인
+# 이 튜토리얼에서, ``torch.nn.functional.scaled_dot_product_attention`` 의 기본적인
 # 사용법을 살펴봤습니다. ``sdp_kernel`` 컨텍스트 매니저로 GPU가 특정 구현을
-# 사용하도록 할 수 있다는 것을 보았습니다. 또한, 간단한 ``NestedTensor``에서 작동하고
-# 컴파일 가능한 ``CausalSelfAttention``모듈을 만들었습니다.
+# 사용하도록 할 수 있다는 것을 보았습니다. 또한, 간단한 ``NestedTensor`` 에서 작동하고
+# 컴파일 가능한 ``CausalSelfAttention`` 모듈을 만들었습니다.
 # 이 과정에서 프로파일링 도구를 사용하여 유저가 정의한 모듈의 성능 특성을 어떻게
 # 확인할 수 있는지도 살펴봤습니다.
 #

From 51748d9b03d735e277a353aba2ff3097171c6f20 Mon Sep 17 00:00:00 2001
From: KH <ganghe74@gmail.com>
Date: Fri, 8 Sep 2023 19:56:05 +0900
Subject: [PATCH 09/12] SDPA: fetch tutorials

ref) https://github.com/pytorch/tutorials/pull/2549
---
 intermediate_source/scaled_dot_product_attention_tutorial.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/intermediate_source/scaled_dot_product_attention_tutorial.py b/intermediate_source/scaled_dot_product_attention_tutorial.py
index 2d6cf60dc..e9c7a025f 100644
--- a/intermediate_source/scaled_dot_product_attention_tutorial.py
+++ b/intermediate_source/scaled_dot_product_attention_tutorial.py
@@ -310,7 +310,7 @@ def generate_rand_batch(
 # 분석 결과, 두 모듈 모두 GPU에서 소요된 시간의 대부분이
 # 동일한 함수들에 집중되어 있음을 보여줍니다.
 # PyTorch가 프레임워크 오버헤드를 제거하는 데 매우 탁월한 ``torch.compile`` 를
-# 제공하기 때문입니다. ``CausaulSelfAttention`` 같은 경우처럼 크고, 효율적인 CUDA 커널을
+# 제공하기 때문입니다. ``CausalSelfAttention`` 같은 경우처럼 크고, 효율적인 CUDA 커널을
 # 사용하는 모델에서 PyTorch 오버헤드는 작아질 것입니다.
 #
 # 사실, 모듈은 보통 ``CausalSelfAttention`` 블럭 하나만으로 구성되지 않습니다.

From e956e1d7d6461304622cc9b2b3a2307552d11795 Mon Sep 17 00:00:00 2001
From: KH <ganghe74@gmail.com>
Date: Sun, 17 Sep 2023 22:57:34 +0900
Subject: [PATCH 10/12] =?UTF-8?q?SDPA:=20`Author`=20to=20`=EC=A0=80?=
 =?UTF-8?q?=EC=9E=90`?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 intermediate_source/scaled_dot_product_attention_tutorial.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/intermediate_source/scaled_dot_product_attention_tutorial.py b/intermediate_source/scaled_dot_product_attention_tutorial.py
index e9c7a025f..bea99ef1b 100644
--- a/intermediate_source/scaled_dot_product_attention_tutorial.py
+++ b/intermediate_source/scaled_dot_product_attention_tutorial.py
@@ -3,8 +3,8 @@
 =================================================================================
 
 
-**Author:** `Driss Guessous <https://github.com/drisspg>`_
-**번역** : `이강희 <https://github.com/khleexv>`_
+**저자:** `Driss Guessous <https://github.com/drisspg>`_
+**번역:** `이강희 <https://github.com/khleexv>`_
 """
 
 ######################################################################

From a374998d1bb2308573773d84c413aa2b3c20e471 Mon Sep 17 00:00:00 2001
From: KH <ganghe74@gmail.com>
Date: Sun, 17 Sep 2023 23:02:06 +0900
Subject: [PATCH 11/12] =?UTF-8?q?SDPA:=20`=ED=93=A8=EC=A6=88=EB=93=9C(Fuse?=
 =?UTF-8?q?d)`=20=EB=B3=91=EA=B8=B0=20=ED=91=9C=ED=98=84=20to=20`=ED=93=A8?=
 =?UTF-8?q?=EC=A6=88=EB=93=9C`?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

첫 등장에만 병기 표현 사용
---
 intermediate_source/scaled_dot_product_attention_tutorial.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/intermediate_source/scaled_dot_product_attention_tutorial.py b/intermediate_source/scaled_dot_product_attention_tutorial.py
index bea99ef1b..3f8f1ead9 100644
--- a/intermediate_source/scaled_dot_product_attention_tutorial.py
+++ b/intermediate_source/scaled_dot_product_attention_tutorial.py
@@ -25,7 +25,7 @@
 # 논문에서 찾을 수 있습니다. 이 함수는 기존 함수를 사용하여 PyTorch로 작성할 수 있지만,
 # 퓨즈드(fused) 구현은 단순한 구현보다 큰 성능 이점을 제공할 수 있습니다.
 #
-# 퓨즈드(Fused) 구현
+# 퓨즈드 구현
 # ~~~~~~~~~~~~~~~~~~~~~~
 #
 # 이 함수는 CUDA 텐서 입력을 다음 중 하나의 구현을 사용합니다.
@@ -232,7 +232,7 @@ def generate_rand_batch(
 random_nt, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=0.5, dtype=dtype, device=device)
 random_dense, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=None, dtype=dtype, device=device)
 
-# 현재 퓨즈드(fused) 구현은 ``NestedTensor`` 로 학습하는 것을 지원하지 않습니다.
+# 현재 퓨즈드 구현은 ``NestedTensor`` 로 학습하는 것을 지원하지 않습니다.
 model.eval()
 
 with sdp_kernel(**backend_map[SDPBackend.FLASH_ATTENTION]):

From c20752b26915588a734f8c285dbb8c5e71289efc Mon Sep 17 00:00:00 2001
From: KH <ganghe74@gmail.com>
Date: Mon, 18 Sep 2023 23:09:55 +0900
Subject: [PATCH 12/12] =?UTF-8?q?SDPA:=20`=EB=86=92=EC=9D=80=20=EC=88=98?=
 =?UTF-8?q?=EC=A4=80`=20to=20`=EA=B3=A0=EC=88=98=EC=A4=80`,=20`=ED=85=90?=
 =?UTF-8?q?=EC=84=9C`=20to=20`tensor`?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

피드백 반영
---
 intermediate_source/scaled_dot_product_attention_tutorial.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/intermediate_source/scaled_dot_product_attention_tutorial.py b/intermediate_source/scaled_dot_product_attention_tutorial.py
index 3f8f1ead9..68a51fd12 100644
--- a/intermediate_source/scaled_dot_product_attention_tutorial.py
+++ b/intermediate_source/scaled_dot_product_attention_tutorial.py
@@ -19,7 +19,7 @@
 #
 # 개요
 # ~~~~
-# 높은 수준에서 이 PyTorch 함수는 쿼리(query), 키(key), 값(value) 사이의
+# 고수준에서, 이 PyTorch 함수는 쿼리(query), 키(key), 값(value) 사이의
 # scaled dot product attention (SDPA)을 계산합니다.
 # 이 함수의 정의는 `Attention is all you need <https://arxiv.org/abs/1706.03762>`__
 # 논문에서 찾을 수 있습니다. 이 함수는 기존 함수를 사용하여 PyTorch로 작성할 수 있지만,
@@ -28,7 +28,7 @@
 # 퓨즈드 구현
 # ~~~~~~~~~~~~~~~~~~~~~~
 #
-# 이 함수는 CUDA 텐서 입력을 다음 중 하나의 구현을 사용합니다.
+# 이 함수는 CUDA tensor 입력을 다음 중 하나의 구현을 사용합니다.
 #
 # 구현:
 #