11# -*- coding: utf-8 -*-
22"""
3- TorchMultimodal Tutorial: Finetuning FLAVA
3+ TorchMultimodal ํํ ๋ฆฌ์ผ: FLAVA ๋ฏธ์ธ์กฐ์
44============================================
5+
6+ **๋ฒ์ญ:** `๊น์ฐฌ <https://github.com/chanmuzi>`__
7+
58"""
69
10+
711######################################################################
8- # Multimodal AI has recently become very popular owing to its ubiquitous
9- # nature, from use cases like image captioning and visual search to more
10- # recent applications like image generation from text. **TorchMultimodal
11- # is a library powered by Pytorch consisting of building blocks and end to
12- # end examples, aiming to enable and accelerate research in
13- # multimodality**.
14- #
15- # In this tutorial, we will demonstrate how to use a **pretrained SoTA
16- # model called** `FLAVA <https://arxiv.org/pdf/2112.04482.pdf>`__ **from
17- # TorchMultimodal library to finetune on a multimodal task i.e. visual
18- # question answering** (VQA). The model consists of two unimodal transformer
19- # based encoders for text and image and a multimodal encoder to combine
20- # the two embeddings. It is pretrained using contrastive, image text matching and
21- # text, image and multimodal masking losses.
12+ # ๋ฉํฐ ๋ชจ๋ฌ AI๋ ์ต๊ทผ์ ์ด๋ฏธ์ง ์๋ง์ถ๊ฐ, ์๊ฐ์ ๊ฒ์๋ถํฐ ํ
์คํธ๋ก๋ถํฐ ์ด๋ฏธ์ง๋ฅผ ์์ฑ๊ฐ์
13+ # ์ต๊ทผ์ ์์ฉ๊น์ง ๊ทธ ์ฌ์ฉ์ด ๋น ๋ฅด๊ฒ ํ์ฐ๋๊ณ ์์ต๋๋ค. **TorchMultimodal์ PyTorch๋ฅผ
14+ # ๊ธฐ๋ฐ์ผ๋ก ํ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ก, ๋ฉํฐ ๋ชจ๋ฌ ์ฐ๊ตฌ๋ฅผ ๊ฐ๋ฅํ๊ฒ ํ๊ณ ๊ฐ์ํํ๊ธฐ ์ํ ๋น๋ฉ ๋ธ๋ก๊ณผ
15+ # end-to-end ์์ ๋ค์ ์ ๊ณตํฉ๋๋ค**.
16+ #
17+ # ๋ณธ ํํ ๋ฆฌ์ผ์์๋ **์ฌ์ ํ๋ จ๋ SoTA ๋ชจ๋ธ์ธ** `FLAVA <https://arxiv.org/pdf/2112.04482.pdf>`__ **๋ฅผ**
18+ # **TorchMultimodal ๋ผ์ด๋ธ๋ฌ๋ฆฌ์์ ์ฌ์ฉํ์ฌ ๋ฉํฐ ๋ชจ๋ฌ ์์
์ธ ์๊ฐ์ ์ง์ ์๋ต(VQA)์ ๋ฏธ์ธ์กฐ์ ํ๋ ๋ฐฉ๋ฒ์ ๋ณด์ฌ ๋๋ฆฌ๊ฒ ์ต๋๋ค.**
19+ # ์ด ๋ชจ๋ธ์ ํ
์คํธ์ ์ด๋ฏธ์ง๋ฅผ ์ํ ๋ ๊ฐ์ ๋จ์ผ ๋ชจ๋ฌ ํธ๋์คํฌ๋จธ ๊ธฐ๋ฐ ์ธ์ฝ๋์
20+ # ๋ ์๋ฒ ๋ฉ์ ๊ฒฐํฉํ๋ ๋ค์ค ๋ชจ๋ฌ ์ธ์ฝ๋๋ก ๊ตฌ์ฑ๋์ด ์์ต๋๋ค.
21+ # ์ด ๋ชจ๋ธ์ ๋์กฐ์ , ์ด๋ฏธ์ง-ํ
์คํธ ๋งค์นญ, ๊ทธ๋ฆฌ๊ณ ํ
์คํธ, ์ด๋ฏธ์ง ๋ฐ ๋ค์ค ๋ชจ๋ฌ ๋ง์คํน ์์ค์ ์ฌ์ฉํ์ฌ ์ฌ์ ํ๋ จ๋์์ต๋๋ค.
22+
2223
2324
2425######################################################################
25- # Installation
26+ # ์ค์น
2627# -----------------
27- # We will use TextVQA dataset and ``bert tokenizer `` from Hugging Face for this
28- # tutorial. So you need to install datasets and transformers in addition to TorchMultimodal .
28+ # ์ด ํํ ๋ฆฌ์ผ์ ์ํด์๋ TextVQA ๋ฐ์ดํฐ์
๊ณผ Hugging Face์ ``bert ํ ํฌ๋์ด์ `` ๋ฅผ ์ฌ์ฉํ ๊ฒ์
๋๋ค.
29+ # ๋ฐ๋ผ์ TorchMultimodal ์ธ์๋ datasets๊ณผ transformers๋ฅผ ์ค์นํด์ผ ํฉ๋๋ค .
2930#
3031# .. note::
31- #
32- # When running this tutorial in Google Colab, install the required packages by
33- # creating a new cell and running the following commands :
32+ #
33+ # ์ด ํํ ๋ฆฌ์ผ์ Google Colab์์ ์คํํ ๊ฒฝ์ฐ, ์๋ก์ด ์
์ ๋ง๋ค๊ณ ๋ค์์ ๋ช
๋ น์ด๋ฅผ ์คํํ์ฌ
34+ # ํ์ํ ํจํค์ง๋ฅผ ์ค์นํ์ธ์ :
3435#
3536# .. code-block::
3637#
4041#
4142
4243######################################################################
43- # Steps
44+ # ๋จ๊ณ
4445# -----
4546#
46- # 1. Download the Hugging Face dataset to a directory on your computer by running the following command :
47+ # 1. ๋ค์ ๋ช
๋ น์ด๋ฅผ ์คํํ์ฌ Hugging Face ๋ฐ์ดํฐ์
์ ์ปดํจํฐ์ ๋๋ ํ ๋ฆฌ์ ๋ค์ด๋ก๋ํ์ธ์ :
4748#
4849# .. code-block::
4950#
5051# wget http://dl.fbaipublicfiles.com/pythia/data/vocab.tar.gz
5152# tar xf vocab.tar.gz
5253#
5354# .. note::
54- # If you are running this tutorial in Google Colab, run these commands
55- # in a new cell and prepend these commands with an exclamation mark (!)
55+ # ์ด ํํ ๋ฆฌ์ผ์ Google Colab์์ ์คํํ๋ ๊ฒฝ์ฐ, ์ ์
์์ ์ด ๋ช
๋ น์ด๋ฅผ ์คํํ๊ณ ๋ช
๋ น์ด ์์ ๋๋ํ (!)๋ฅผ ๋ถ์ด์ธ์.
5656#
5757#
58- # 2. For this tutorial, we treat VQA as a classification task where
59- # the inputs are images and question (text) and the output is an answer class.
60- # So we need to download the vocab file with answer classes and create the answer to
61- # label mapping.
58+ # 2. ๋ณธ ํํ ๋ฆฌ์ผ์์๋ VQA๋ฅผ ์ด๋ฏธ์ง์ ์ง๋ฌธ(ํ
์คํธ)์ด ์
๋ ฅ๋๊ณ ์ถ๋ ฅ์ด ๋ต๋ณ ํด๋์ค์ธ ๋ถ๋ฅ ์์
์ผ๋ก ์ทจ๊ธํฉ๋๋ค.
59+ # ๋ฐ๋ผ์ ๋ต๋ณ ํด๋์ค์ ๋ ์ด๋ธ ๋งคํ์ ์์ฑํ ๋จ์ด์ฅ ํ์ผ์ ๋ค์ด๋ก๋ํด์ผ ํฉ๋๋ค.
6260#
63- # We also load the `textvqa
64- # dataset <https://arxiv.org/pdf/1904.08920.pdf>`__ containing 34602 training samples
65- # (images,questions and answers) from Hugging Face
61+ # ๋ํ Hugging Face์์ `textvqa ๋ฐ์ดํฐ์
<https://arxiv.org/pdf/1904.08920.pdf>`__ ์ ๋ถ๋ฌ์ค๋๋ฐ,
62+ # ์ด ๋ฐ์ดํฐ์
์ 34602๊ฐ์ ํ๋ จ ์ํ(์ด๋ฏธ์ง, ์ง๋ฌธ, ๋ต๋ณ)์ ํฌํจํ๊ณ ์์ต๋๋ค.
6663#
67- # We see there are 3997 answer classes including a class representing
68- # unknown answers.
64+ # 3997๊ฐ์ ๋ต๋ณ ํด๋์ค๊ฐ ์์์ ํ์ธํ ์ ์์ผ๋ฉฐ, ์ด์๋ ์ ์ ์๋ ๋ต๋ณ์ ๋ํ๋ด๋ ํด๋์ค๋ ํฌํจ๋์ด ์์ต๋๋ค.
6965#
7066
7167with open ("data/vocabs/answers_textvqa_more_than_1.txt" ) as f :
8177dataset = load_dataset ("textvqa" )
8278
8379######################################################################
84- # Lets display a sample entry from the dataset :
80+ # ๋ฐ์ดํฐ์
์์ ์ํ ์ํธ๋ฆฌ๋ฅผ ํ์ํด ๋ด
์๋ค :
8581#
8682
8783import matplotlib .pyplot as plt
9591
9692
9793######################################################################
98- # 3. Next, we write the transform function to convert the image and text into
99- # Tensors consumable by our model - For images, we use the transforms from
100- # torchvision to convert to Tensor and resize to uniform sizes - For text,
101- # we tokenize (and pad) them using the ``BertTokenizer`` from Hugging Face -
102- # For answers (i.e. labels), we take the most frequently occurring answer
103- # as the label to train with:
94+ # 3. ๋ค์์ผ๋ก, ์ด๋ฏธ์ง์ ํ
์คํธ๋ฅผ ๋ชจ๋ธ์์ ์ฌ์ฉํ ์ ์๋ ํ
์๋ก ๋ณํํ๊ธฐ ์ํ ๋ณํ ํจ์๋ฅผ ์์ฑํฉ๋๋ค.
95+ # - ์ด๋ฏธ์ง์ ๊ฒฝ์ฐ, torchvision์ ๋ณํ์ ์ฌ์ฉํ์ฌ ํ
์๋ก ๋ณํํ๊ณ ์ผ์ ํ ํฌ๊ธฐ๋ก ์กฐ์ ํฉ๋๋ค.
96+ # - ํ
์คํธ์ ๊ฒฝ์ฐ, Hugging Face์ ``BertTokenizer`` ๋ฅผ ์ฌ์ฉํ์ฌ ํ ํฐํ(๋ฐ ํจ๋ฉ)ํฉ๋๋ค.
97+ # - ๋ต๋ณ(์ฆ, ๋ ์ด๋ธ)์ ๊ฒฝ์ฐ, ๊ฐ์ฅ ๋น๋ฒํ๊ฒ ๋ํ๋๋ ๋ต๋ณ์ ํ๋ จ ๋ ์ด๋ธ๋ก ์ฌ์ฉํฉ๋๋ค:
10498#
10599
106100import torch
@@ -133,25 +127,21 @@ def transform(tokenizer, input):
133127
134128
135129######################################################################
136- # 4. Finally, we import the ``flava_model_for_classification`` from
137- # ``torchmultimodal``. It loads the pretrained FLAVA checkpoint by default and
138- # includes a classification head.
130+ # 4. ๋ง์ง๋ง์ผ๋ก, ``torchmultimodal`` ์์ ``flava_model_for_classification`` ์ ๊ฐ์ ธ์ต๋๋ค.
131+ # ์ด๊ฒ์ ๊ธฐ๋ณธ์ ์ผ๋ก ์ฌ์ ํ๋ จ๋ FLAVA ์ฒดํฌํฌ์ธํธ๋ฅผ ๋ก๋ํ๊ณ ๋ถ๋ฅ ํค๋๋ฅผ ํฌํจํฉ๋๋ค.
139132#
140- # The model forward function passes the image through the visual encoder
141- # and the question through the text encoder. The image and question
142- # embeddings are then passed through the multimodal encoder. The final
143- # embedding corresponding to the CLS token is passed through a MLP head
144- # which finally gives the probability distribution over each possible
145- # answers.
133+ # ๋ชจ๋ธ์ ์๋ฐฉํฅ ํจ์๋ ์ด๋ฏธ์ง๋ฅผ ์๊ฐ ์ธ์ฝ๋์ ํต๊ณผ์ํค๊ณ ์ง๋ฌธ์ ํ
์คํธ ์ธ์ฝ๋์ ํต๊ณผ์ํต๋๋ค.
134+ # ์ด๋ฏธ์ง์ ์ง๋ฌธ์ ์๋ฒ ๋ฉ์ ๊ทธ ํ ๋ฉํฐ ๋ชจ๋ฌ ์ธ์ฝ๋๋ฅผ ํต๊ณผํฉ๋๋ค.
135+ # ์ต์ข
์๋ฒ ๋ฉ์ CLS ํ ํฐ์ ํด๋นํ๋ฉฐ, ์ด๋ MLP ํค๋๋ฅผ ํต๊ณผํ์ฌ ๊ฐ ๊ฐ๋ฅํ ๋ต๋ณ์ ๋ํ ํ๋ฅ ๋ถํฌ๋ฅผ ์ ๊ณตํฉ๋๋ค.
146136#
147137
148138from torchmultimodal .models .flava .model import flava_model_for_classification
149139model = flava_model_for_classification (num_classes = len (vocab ))
150140
151141
152142######################################################################
153- # 5. We put together the dataset and model in a toy training loop to
154- # demonstrate how to train the model for 3 iterations :
143+ # 5. ๋ฐ์ดํฐ์
๊ณผ ๋ชจ๋ธ์ ํจ๊ป ๋ชจ์ 3ํ ๋ฐ๋ณต์ ์ํ ๊ฐ๋จํ ํ๋ จ ๋ฃจํ๋ฅผ ์์ฑํ์ฌ
144+ # ๋ชจ๋ธ ํ๋ จ ๋ฐฉ๋ฒ์ ๋ณด์ฌ์ค๋๋ค :
155145#
156146
157147from torch import nn
@@ -177,14 +167,12 @@ def transform(tokenizer, input):
177167
178168
179169######################################################################
180- # Conclusion
170+ # ๊ฒฐ๋ก
181171# -------------------
182172#
183- # This tutorial introduced the basics around how to finetune on a
184- # multimodal task using FLAVA from TorchMultimodal. Please also check out
185- # other examples from the library like
186- # `MDETR <https://github.com/facebookresearch/multimodal/tree/main/torchmultimodal/models/mdetr>`__
187- # which is a multimodal model for object detection and
188- # `Omnivore <https://github.com/facebookresearch/multimodal/blob/main/torchmultimodal/models/omnivore.py>`__
189- # which is multitask model spanning image, video and 3d classification.
173+ # ์ด ํํ ๋ฆฌ์ผ์์๋ TorchMultimodal์ FLAVA๋ฅผ ์ฌ์ฉํ์ฌ ๋ฉํฐ ๋ชจ๋ฌ ์์
์ ๋ฏธ์ธ ์กฐ์ ํ๋
174+ # ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ์์ ์๊ฐํ์ต๋๋ค. ๊ฐ์ฒด ํ์ง๋ฅผ ์ํ ๋ฉํฐ ๋ชจ๋ฌ ๋ชจ๋ธ์ธ `MDETR <https://github.com/facebookresearch/multimodal/tree/main/torchmultimodal/models/mdetr>`__ ๊ณผ
175+ # ์ด๋ฏธ์ง, ๋น๋์ค, 3D ๋ถ๋ฅ๋ฅผ ํฌ๊ดํ๋ ๋ค์์
๋ชจ๋ธ `Omnivore <https://github.com/facebookresearch/multimodal/blob/main/torchmultimodal/models/omnivore.py>`__
176+ # ๊ฐ์ ๋ผ์ด๋ธ๋ฌ๋ฆฌ์ ๋ค๋ฅธ ์์ ๋ค๋ ํ์ธํด ๋ณด์ธ์.
177+ #
190178#
0 commit comments