Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weโ€™ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

๐ŸŒ [i18n-KO] Translated tasks/image_captioning.mdx to Korean #22943

Merged
merged 1 commit into from
Apr 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/ko/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,8 +91,8 @@
title: (๋ฒˆ์—ญ์ค‘) Depth estimation
title: (๋ฒˆ์—ญ์ค‘) ์ปดํ“จํ„ฐ ๋น„์ „
- sections:
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) Image captioning
- local: tasks/image_captioning
title: ์ด๋ฏธ์ง€ ์บก์…”๋‹
- local: in_translation
title: (๋ฒˆ์—ญ์ค‘) Document Question Answering
title: (๋ฒˆ์—ญ์ค‘) ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ
Expand Down
277 changes: 277 additions & 0 deletions docs/source/ko/tasks/image_captioning.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,277 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->


# ์ด๋ฏธ์ง€ ์บก์…”๋‹[[image-captioning]]

[[open-in-colab]]

์ด๋ฏธ์ง€ ์บก์…”๋‹(Image captioning)์€ ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์บก์…˜์„ ์˜ˆ์ธกํ•˜๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค.
์ด๋ฏธ์ง€ ์บก์…”๋‹์€ ์‹œ๊ฐ ์žฅ์• ์ธ์ด ๋‹ค์–‘ํ•œ ์ƒํ™ฉ์„ ํƒ์ƒ‰ํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ค„ ์ˆ˜ ์žˆ๋„๋ก ์‹œ๊ฐ ์žฅ์• ์ธ์„ ๋ณด์กฐํ•˜๋Š” ๋“ฑ ์‹ค์ƒํ™œ์—์„œ ํ”ํžˆ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.
๋”ฐ๋ผ์„œ ์ด๋ฏธ์ง€ ์บก์…”๋‹์€ ์ด๋ฏธ์ง€๋ฅผ ์„ค๋ช…ํ•จ์œผ๋กœ์จ ์‚ฌ๋žŒ๋“ค์˜ ์ฝ˜ํ…์ธ  ์ ‘๊ทผ์„ฑ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.

์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” ์†Œ๊ฐœํ•  ๋‚ด์šฉ์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค:

* ์ด๋ฏธ์ง€ ์บก์…”๋‹ ๋ชจ๋ธ์„ ํŒŒ์ธํŠœ๋‹ํ•ฉ๋‹ˆ๋‹ค.
* ํŒŒ์ธํŠœ๋‹๋œ ๋ชจ๋ธ์„ ์ถ”๋ก ์— ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ํ•„์š”ํ•œ ๋ชจ๋“  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

```bash
pip install transformers datasets evaluate -q
pip install jiwer -q
```

Hugging Face ๊ณ„์ •์— ๋กœ๊ทธ์ธํ•˜๋ฉด ๋ชจ๋ธ์„ ์—…๋กœ๋“œํ•˜๊ณ  ์ปค๋ฎค๋‹ˆํ‹ฐ์— ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ํ† ํฐ์„ ์ž…๋ ฅํ•˜์—ฌ ๋กœ๊ทธ์ธํ•˜์„ธ์š”.


```python
from huggingface_hub import notebook_login

notebook_login()
```

## ํฌ์ผ“๋ชฌ BLIP ์บก์…˜ ๋ฐ์ดํ„ฐ์„ธํŠธ ๊ฐ€์ ธ์˜ค๊ธฐ[[load-the-pokmon-blip-captions-dataset]]

{์ด๋ฏธ์ง€-์บก์…˜} ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด ๐Ÿค— Dataset ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
PyTorch์—์„œ ์ž์‹ ๋งŒ์˜ ์ด๋ฏธ์ง€ ์บก์…˜ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ๋งŒ๋“ค๋ ค๋ฉด [์ด ๋…ธํŠธ๋ถ](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GIT/Fine_tune_GIT_on_an_image_captioning_dataset.ipynb)์„ ์ฐธ์กฐํ•˜์„ธ์š”.


```python
from datasets import load_dataset

ds = load_dataset("lambdalabs/pokemon-blip-captions")
ds
```
```bash
DatasetDict({
train: Dataset({
features: ['image', 'text'],
num_rows: 833
})
})
```

์ด ๋ฐ์ดํ„ฐ์„ธํŠธ๋Š” `image`์™€ `text`๋ผ๋Š” ๋‘ ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

<Tip>

๋งŽ์€ ์ด๋ฏธ์ง€ ์บก์…˜ ๋ฐ์ดํ„ฐ์„ธํŠธ์—๋Š” ์ด๋ฏธ์ง€๋‹น ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์บก์…˜์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ, ์ผ๋ฐ˜์ ์œผ๋กœ ํ•™์Šต ์ค‘์— ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์บก์…˜ ์ค‘์—์„œ ๋ฌด์ž‘์œ„๋กœ ์ƒ˜ํ”Œ์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.

</Tip>

[~datasets.Dataset.train_test_split] ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ์„ธํŠธ์˜ ํ•™์Šต ๋ถ„ํ• ์„ ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค:


```python
ds = ds["train"].train_test_split(test_size=0.1)
train_ds = ds["train"]
test_ds = ds["test"]
```

ํ•™์Šต ์„ธํŠธ์˜ ์ƒ˜ํ”Œ ๋ช‡ ๊ฐœ๋ฅผ ์‹œ๊ฐํ™”ํ•ด ๋ด…์‹œ๋‹ค.
Let's visualize a couple of samples from the training set.


```python
from textwrap import wrap
import matplotlib.pyplot as plt
import numpy as np


def plot_images(images, captions):
plt.figure(figsize=(20, 20))
for i in range(len(images)):
ax = plt.subplot(1, len(images), i + 1)
caption = captions[i]
caption = "\n".join(wrap(caption, 12))
plt.title(caption)
plt.imshow(images[i])
plt.axis("off")


sample_images_to_visualize = [np.array(train_ds[i]["image"]) for i in range(5)]
sample_captions = [train_ds[i]["text"] for i in range(5)]
plot_images(sample_images_to_visualize, sample_captions)
```

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/sample_training_images_image_cap.png" alt="Sample training images"/>
</div>

## ๋ฐ์ดํ„ฐ์„ธํŠธ ์ „์ฒ˜๋ฆฌ[[preprocess-the-dataset]]

๋ฐ์ดํ„ฐ์„ธํŠธ์—๋Š” ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ์–‘์‹์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ „์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ์ด๋ฏธ์ง€์™€ ์บก์…˜์„ ๋ชจ๋‘ ์ „์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์ „์ฒ˜๋ฆฌ ์ž‘์—…์„ ์œ„ํ•ด, ํŒŒ์ธํŠœ๋‹ํ•˜๋ ค๋Š” ๋ชจ๋ธ์— ์—ฐ๊ฒฐ๋œ ํ”„๋กœ์„ธ์„œ ํด๋ž˜์Šค๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.

```python
from transformers import AutoProcessor

checkpoint = "microsoft/git-base"
processor = AutoProcessor.from_pretrained(checkpoint)
```

ํ”„๋กœ์„ธ์„œ๋Š” ๋‚ด๋ถ€์ ์œผ๋กœ ํฌ๊ธฐ ์กฐ์ • ๋ฐ ํ”ฝ์…€ ํฌ๊ธฐ ์กฐ์ •์„ ํฌํ•จํ•œ ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ์บก์…˜์„ ํ† ํฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

```python
def transforms(example_batch):
images = [x for x in example_batch["image"]]
captions = [x for x in example_batch["text"]]
inputs = processor(images=images, text=captions, padding="max_length")
inputs.update({"labels": inputs["input_ids"]})
return inputs


train_ds.set_transform(transforms)
test_ds.set_transform(transforms)
```

๋ฐ์ดํ„ฐ์„ธํŠธ๊ฐ€ ์ค€๋น„๋˜์—ˆ์œผ๋‹ˆ ์ด์ œ ํŒŒ์ธํŠœ๋‹์„ ์œ„ํ•ด ๋ชจ๋ธ์„ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

## ๊ธฐ๋ณธ ๋ชจ๋ธ ๊ฐ€์ ธ์˜ค๊ธฐ[[load-a-base-model]]

["microsoft/git-base"](https://huggingface.co/microsoft/git-base)๋ฅผ [`AutoModelForCausalLM`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM) ๊ฐ์ฒด๋กœ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.


```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(checkpoint)
```

## ํ‰๊ฐ€[[evaluate]]

์ด๋ฏธ์ง€ ์บก์…˜ ๋ชจ๋ธ์€ ์ผ๋ฐ˜์ ์œผ๋กœ [Rouge ์ ์ˆ˜](https://huggingface.co/spaces/evaluate-metric/rouge) ๋˜๋Š” [๋‹จ์–ด ์˜ค๋ฅ˜์œจ(Word Error Rate)](https://huggingface.co/spaces/evaluate-metric/wer)๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” ๋‹จ์–ด ์˜ค๋ฅ˜์œจ(WER)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ด๋ฅผ ์œ„ํ•ด ๐Ÿค— Evaluate ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
WER์˜ ์ž ์žฌ์  ์ œํ•œ ์‚ฌํ•ญ ๋ฐ ๊ธฐํƒ€ ๋ฌธ์ œ์ ์€ [์ด ๊ฐ€์ด๋“œ](https://huggingface.co/spaces/evaluate-metric/wer)๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.


```python
from evaluate import load
import torch

wer = load("wer")


def compute_metrics(eval_pred):
logits, labels = eval_pred
predicted = logits.argmax(-1)
decoded_labels = processor.batch_decode(labels, skip_special_tokens=True)
decoded_predictions = processor.batch_decode(predicted, skip_special_tokens=True)
wer_score = wer.compute(predictions=decoded_predictions, references=decoded_labels)
return {"wer_score": wer_score}
```

## ํ•™์Šต![[train!]]

์ด์ œ ๋ชจ๋ธ ํŒŒ์ธํŠœ๋‹์„ ์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๐Ÿค— [`Trainer`]๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋จผ์ €, [`TrainingArguments`]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ์ธ์ˆ˜๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.


```python
from transformers import TrainingArguments, Trainer

model_name = checkpoint.split("/")[1]

training_args = TrainingArguments(
output_dir=f"{model_name}-pokemon",
learning_rate=5e-5,
num_train_epochs=50,
fp16=True,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
gradient_accumulation_steps=2,
save_total_limit=3,
evaluation_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=50,
logging_steps=50,
remove_unused_columns=False,
push_to_hub=True,
label_names=["labels"],
load_best_model_at_end=True,
)
```

ํ•™์Šต ์ธ์ˆ˜๋ฅผ ๋ฐ์ดํ„ฐ์„ธํŠธ, ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ๐Ÿค— Trainer์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.

```python
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=test_ds,
compute_metrics=compute_metrics,
)
```

ํ•™์Šต์„ ์‹œ์ž‘ํ•˜๋ ค๋ฉด [`Trainer`] ๊ฐ์ฒด์—์„œ [`~Trainer.train`]์„ ํ˜ธ์ถœํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

```python
trainer.train()
```

ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฉด์„œ ํ•™์Šต ์†์‹ค์ด ์›ํ™œํ•˜๊ฒŒ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต์ด ์™„๋ฃŒ๋˜๋ฉด ๋ชจ๋“  ์‚ฌ๋žŒ์ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก [`~Trainer.push_to_hub`] ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ์— ๊ณต์œ ํ•˜์„ธ์š”:


```python
trainer.push_to_hub()
```

## ์ถ”๋ก [[inference]]

`test_ds`์—์„œ ์ƒ˜ํ”Œ ์ด๋ฏธ์ง€๋ฅผ ๊ฐ€์ ธ์™€ ๋ชจ๋ธ์„ ํ…Œ์ŠคํŠธํ•ฉ๋‹ˆ๋‹ค.


```python
from PIL import Image
import requests

url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/pokemon.png"
image = Image.open(requests.get(url, stream=True).raw)
image
```

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/test_image_image_cap.png" alt="Test image"/>
</div>

๋ชจ๋ธ์— ์‚ฌ์šฉํ•  ์ด๋ฏธ์ง€๋ฅผ ์ค€๋น„ํ•ฉ๋‹ˆ๋‹ค.

```python
device = "cuda" if torch.cuda.is_available() else "cpu"

inputs = processor(images=image, return_tensors="pt").to(device)
pixel_values = inputs.pixel_values
```

[`generate`]๋ฅผ ํ˜ธ์ถœํ•˜๊ณ  ์˜ˆ์ธก์„ ๋””์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.

```python
generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_caption)
```
```bash
a drawing of a pink and blue pokemon
```

ํŒŒ์ธํŠœ๋‹๋œ ๋ชจ๋ธ์ด ๊ฝค ๊ดœ์ฐฎ์€ ์บก์…˜์„ ์ƒ์„ฑํ•œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค!