wikiHow dataset (Japanese version)

This dataset is based on a paper, which describes wikiHow English summarization dataset.
This dataset is crawled from Japanese wikiHow for Japanese summarization dataset.

Requirements

Python3
pip install -r requirements.txt

How to get the dataset

For quick start, run the script bash get.sh.
- The train/dev/test json files are made in data/output.
In detail, you run the following steps.

bash crawl_article.sh
- Crawling each article from the url addresses in data/urls.
bash scrape2jsonl.sh
- Extract knowledge from the html files crawled in step 1.
  - the extracted info is described in below #json_description.
python script/make_data.py
- Make train/dev/test data from the json files extracted in step 2 based on data/divided_data.tsv.
  - The detail of the json files is below section.

json description

Howtojson

key	value
meta_title	html meta title text
num_part	the total part number in the article
original_title	title text
part_name_exist	exist the part title or not
contents	list of part (number is the same as num_part)
- part_title	part title
- part_contents	list of the each step content in the part
-- article	the article text in the step
-- bold_line	the bold line in the step

train/dev/test.json

key	value
src	source text
tgt	target text; bold lines in the article
title	article title + current part number

Related repository

English wikiHow summarization dataset: https://github.com/mahnazkoupaee/WikiHow-Dataset

License

The articles are provided by wikiHow. Content on wikiHow can be shared under a Creative Commons License (CC-BY-NC-SA).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikiHow dataset (Japanese version)

Requirements

How to get the dataset

json description

Related repository

License

About

Releases 2

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
script		script
README.md		README.md
crawl_article.sh		crawl_article.sh
get.sh		get.sh
requirements.txt		requirements.txt
scrape2jsonl.sh		scrape2jsonl.sh

Katsumata420/wikihow_japanese

Folders and files

Latest commit

History

Repository files navigation

wikiHow dataset (Japanese version)

Requirements

How to get the dataset

json description

Related repository

License

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages