- This dataset is based on a paper, which describes wikiHow English summarization dataset.
- This dataset is crawled from Japanese wikiHow for Japanese summarization dataset.
- Python3
pip install -r requirements.txt
- For quick start, run the script
bash get.sh
.- The train/dev/test json files are made in
data/output
.
- The train/dev/test json files are made in
- In detail, you run the following steps.
bash crawl_article.sh
- Crawling each article from the url addresses in
data/urls
.
- Crawling each article from the url addresses in
bash scrape2jsonl.sh
- Extract knowledge from the html files crawled in step 1.
- the extracted info is described in below #json_description.
- Extract knowledge from the html files crawled in step 1.
python script/make_data.py
- Make train/dev/test data from the json files extracted in step 2 based on
data/divided_data.tsv
.- The detail of the json files is below section.
- Make train/dev/test data from the json files extracted in step 2 based on
- Howtojson
key | value |
---|---|
meta_title | html meta title text |
num_part | the total part number in the article |
original_title | title text |
part_name_exist | exist the part title or not |
contents | list of part (number is the same as num_part) |
- part_title | part title |
- part_contents | list of the each step content in the part |
-- article | the article text in the step |
-- bold_line | the bold line in the step |
- train/dev/test.json
key | value |
---|---|
src | source text |
tgt | target text; bold lines in the article |
title | article title + current part number |
- English wikiHow summarization dataset: https://github.com/mahnazkoupaee/WikiHow-Dataset
The articles are provided by wikiHow. Content on wikiHow can be shared under a Creative Commons License (CC-BY-NC-SA).