Dataset Card for CAMERA 📷

annotations_creators

language

language_creators

license

multilinguality

pretty_name

size_categories

source_datasets

Dataset Card for CAMERA 📷

Dataset Description

Homepage: https://github.com/CyberAgentAILab/camera
Repository: https://github.com/shunk031/huggingface-datasets_CAMERA

Dataset Summary

From the official README.md:

CAMERA (CyberAgent Multimodal Evaluation for Ad Text GeneRAtion) is the Japanese ad text generation dataset. We hope that our dataset will be useful in research for realizing more advanced ad text generation models.

Supported Tasks and Leaderboards

[More Information Needed]

Supported Tasks

[More Information Needed]

Leaderboard

[More Information Needed]

Languages

The language data in CAMERA is in Japanese (BCP-47 ja-JP).

Dataset Structure

Data Instances

When loading a specific configuration, users has to append a version dependent suffix:

without-lp-images

from datasets import load_dataset

dataset = load_dataset("shunk031/CAMERA", name="without-lp-images")

print(dataset)
# DatasetDict({
#     train: Dataset({
#         features: ['asset_id', 'kw', 'lp_meta_description', 'title_org', 'title_ne1', 'title_ne2', 'title_ne3', 'domain', 'parsed_full_text_annotation'],
#         num_rows: 12395
#     })
#     validation: Dataset({
#         features: ['asset_id', 'kw', 'lp_meta_description', 'title_org', 'title_ne1', 'title_ne2', 'title_ne3', 'domain', 'parsed_full_text_annotation'],
#         num_rows: 3098
#     })
#     test: Dataset({
#         features: ['asset_id', 'kw', 'lp_meta_description', 'title_org', 'title_ne1', 'title_ne2', 'title_ne3', 'domain', 'parsed_full_text_annotation'],
#         num_rows: 872
#     })
# })

An example of the CAMERA (w/o LP images) dataset looks as follows:

{
    "asset_id": 13861, 
    "kw": "仙台 ホテル", 
    "lp_meta_description": "仙台のホテルや旅館をお探しなら楽天トラベルへ！楽天ポイントが使えて、貯まって、とってもお得な宿泊予約サイトです。さらに割引クーポンも使える！国内ツアー・航空券・レンタカー・バス予約も！", 
    "title_org": "仙台市のホテル", 
    "title_ne1": "", 
    "title_ne2": "", 
    "title_ne3": "", 
    "domain": "", 
    "parsed_full_text_annotation": {
        "text": [
            "trivago", 
            "Oops...AccessDenied 可", 
            "Youarenotallowedtoviewthispage!Ifyouthinkthisisanerror,pleasecontacttrivago.", 
            "Errorcode:0.3c99e86e.1672026945.25ba640YourIP:240d:1a:4d8:2800:b9b0:ea86:2087:d141AffectedURL:https://www.trivago.jp/ja/odr/%E8%BB%92", "%E4%BB%99%E5%8F%B0-%E5%9B%BD%E5%86%85?search=20072325", 
            "Backtotrivago"
        ], 
        "xmax": [
            653, 
            838, 
            765, 
            773, 
            815, 
            649
        ], 
        "xmin": [
            547, 
            357, 
            433, 
            420, 
            378, 
            550
        ], 
        "ymax": [
            47, 
            390, 
            475, 
            558, 
            598, 
            663
        ], 
        "ymin": [
            18, 
            198, 
            439, 
            504, 
            566, 
            651
        ]
    }
}

with-lp-images

from datasets import load_dataset

dataset = load_dataset("shunk031/CAMERA", name="with-lp-images")

print(dataset)
# DatasetDict({
#     train: Dataset({
#         features: ['asset_id', 'kw', 'lp_meta_description', 'title_org', 'title_ne1', 'title_ne2', 'title_ne3', 'domain', 'parsed_full_text_annotation', 'lp_image'],
#         num_rows: 12395
#     })
#     validation: Dataset({
#         features: ['asset_id', 'kw', 'lp_meta_description', 'title_org', 'title_ne1', 'title_ne2', 'title_ne3', 'domain', 'parsed_full_text_annotation', 'lp_image'],
#         num_rows: 3098
#     })
#     test: Dataset({
#         features: ['asset_id', 'kw', 'lp_meta_description', 'title_org', 'title_ne1', 'title_ne2', 'title_ne3', 'domain', 'parsed_full_text_annotation', 'lp_image'],
#         num_rows: 872
#     })
# })

An example of the CAMERA (w/ LP images) dataset looks as follows:

{
    "asset_id": 13861, 
    "kw": "仙台 ホテル", 
    "lp_meta_description": "仙台のホテルや旅館をお探しなら楽天トラベルへ！楽天ポイントが使えて、貯まって、とってもお得な宿泊予約サイトです。さらに割引クーポンも使える！国内ツアー・航空券・レンタカー・バス予約も！", 
    "title_org": "仙台市のホテル", 
    "title_ne1": "", 
    "title_ne2": "", 
    "title_ne3": "", 
    "domain": "", 
    "parsed_full_text_annotation": {
        "text": [
            "trivago", 
            "Oops...AccessDenied 可", 
            "Youarenotallowedtoviewthispage!Ifyouthinkthisisanerror,pleasecontacttrivago.", 
            "Errorcode:0.3c99e86e.1672026945.25ba640YourIP:240d:1a:4d8:2800:b9b0:ea86:2087:d141AffectedURL:https://www.trivago.jp/ja/odr/%E8%BB%92", "%E4%BB%99%E5%8F%B0-%E5%9B%BD%E5%86%85?search=20072325", 
            "Backtotrivago"
        ], 
        "xmax": [
            653, 
            838, 
            765, 
            773, 
            815, 
            649
        ], 
        "xmin": [
            547, 
            357, 
            433, 
            420, 
            378, 
            550
        ], 
        "ymax": [
            47, 
            390, 
            475, 
            558, 
            598, 
            663
        ], 
        "ymin": [
            18, 
            198, 
            439, 
            504, 
            566, 
            651
        ]
    },
    "lp_image": <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1200x680 at 0x7F8513446B20>
}

Data Fields

without-lp-images

asset_id: ids (associated with LP images)
kw: search keyword
lp_meta_description: meta description extracted from LP (i.e., LP Text)
title_org: ad text (original gold reference)
title_ne{1-3}: ad text (additonal gold references for multi-reference evaluation)
domain: industry domain (HR, EC, Fin, Edu) for industry-wise evaluation
parsed_full_text_annotation: OCR results for LP images

with-lp-images

asset_id: ids (associated with LP images)
kw: search keyword
lp_meta_description: meta description extracted from LP (i.e., LP Text)
title_org: ad text (original gold reference)
title_ne{1-3}: ad text (additional gold references for multi-reference evaluation)
domain: industry domain (HR, EC, Fin, Edu) for industry-wise evaluation
parsed_full_text_annotation: OCR results for LP images
lp_image: Landing page (LP) image

Data Splits

From the official paper:

Split	# of data	# of reference ad text	industry domain label
Train	12,395	1	-
Valid	3,098	1	-
Test	869	4	✔

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

[More Information Needed]

Dataset Curators

[More Information Needed]

Licensing Information

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citation Information

@inproceedings{mita-et-al:nlp2023,
    author =    "三田 雅人 and 村上 聡一朗 and 張 培楠",
    title =	    "広告文生成タスクの規定とベンチマーク構築",
    booktitle = "言語処理学会 第 29 回年次大会",
    year =      2023,
}

Contributions

Thanks to Masato Mita, Soichiro Murakami, and Peinan Zhang for creating this dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
tests		tests
.gitignore		.gitignore
CAMERA.py		CAMERA.py
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

creative-graphic-design/huggingface-datasets_CAMERA

Folders and files

Latest commit

History

Repository files navigation

Dataset Card for CAMERA 📷

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Supported Tasks

Leaderboard

Languages

Dataset Structure

Data Instances

without-lp-images

with-lp-images

Data Fields

without-lp-images

with-lp-images

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages