Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modu_ne.md 영문 번역 반영 #160

Merged
merged 15 commits into from
Nov 11, 2020
55 changes: 54 additions & 1 deletion en-docs/corpuslist/modu_ne.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,57 @@ sort: 16

# Modu: Named Entity

TBD
Modu: Named Entity is a dataset released by National Institute of Korean Language.
Data specification is as follows.


- author: National Institute of Korean Language
- repository: [https://corpus.korean.go.kr](https://corpus.korean.go.kr)
- references: [document](https://rlkujwkk7.toastcdn.net/NIKL_NE(v1.0).pdf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

이와 관련된 이슈를 #165 에 남겼습니다. 이후 수정을 위한 인덱싱 용으로 커멘트를 남깁니다.

- size:
- train: 20,188 examples

```warning
Due to the licensing issue of Modu corpus, `Korpora` does not provide any download functions for this corpus. Rather, it only offers a load function.
If you wish to use this corpus, please complete the authentication process required by the National Institue of Korean Language and manually download the corpus.
```

You can load the corpus from your Python console as follows.

```python
from Korpora import Korpora
corpus = Korpora.load("modu_ne")
```

```warning
The code assumes that the corpus has already been unzipped into NIKL_NE directory within `~/Korpora` (`~/Korpora/NIKL_NE`).
If the root directory is not `~/Korpora`, add `root_dir=custom_path` argument to the `load` method.
```

You can also load the corpus as follows.
The output of these codes is identical to that of previous codes.

```python
from Korpora import ModuNEKorpus
corpus = ModuNEKorpus()
```

```warning
The codes assumes that the corpus has already been unzipped into `~/Korpora/NIKL_NE` within the current user's local root.
If the corpus exists in another directory, add `root_dir_or_paths=custom_path` argument in `ModuNEKorpus` class declaration.
```

If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
`train` refers to the training dataset of the corpus, and you can check its first training instance as follows.

```
>>> corpus.train[0]
NamedEntityExample(
id=NWRW1800000029.315.1.1,
sentence=[횡설수설/권순활]北 ‘외화벌이’ 뜯어먹기,
tags=['AF', 'PS', 'LC'],
positions=[(1, 5), (6, 9), (10, 11)]
)
>>> corpus.train[0].sentence
[횡설수설/권순활]北 ‘외화벌이’ 뜯어먹기
```
71 changes: 70 additions & 1 deletion en-docs/corpuslist/modu_news.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,73 @@ sort: 13

# Modu: Newspaper

TBD
Modu: Newspaper is a dataset released by National Institute of Korean Language.
Data specification is as follows.

- author: National Institute of Korean Language
- repository: [https://corpus.korean.go.kr](https://corpus.korean.go.kr)
- references: [document](https://rlkujwkk7.toastcdn.net/NIKL_NEWSPAPER(v1.0).pdf)
- size:
- train: about 3,500,000 examples

Data structure is as follows:

|Attributes|Property|
| --- | --- |
| document_id | Unique id of the article|
| title | Title of the metadata (not the actual title of the article) |
| author | author of the article |
| publisher | newspaper publisher |
| date | published date |
| topic | topic of the article (politics, business, social affairs, lifestyle, IT/science, entertainment, sports, culture, beauty/health) |
| original_topic | original topic categorized by the newpaper publishers |
| paragraph | body of the article (the first line seems to the title of the article) |

```warning
Due to the licensing issue of Modu corpus, `Korpora` does not provide any download functions for this corpus. Rather, it only offers a load function.
If you wish to use this corpus, please complete the authentication process required by the National Institue of Korean Language and manually download the corpus.
```

You can load the corpus from your Python console as follows.

```python
from Korpora import Korpora
corpus = Korpora.load("modu_news")
```

```warning
The code assumes that the corpus has already been unzipped into NIKL_WRITTEN directory within `~/Korpora` (`~/Korpora/NIKL_WRITTEN`).
If the root directory is not `~/Korpora`, add `root_dir=custom_path` argument to the `load` method.
```

You can also load the corpus as follows.
The output of these codes is identical to that of previous codes.

```python
from Korpora import ModuNewsKorpus
corpus = ModuNewsKorpus()
```

```warning
The codes assumes that the corpus has already been unzipped into `~/Korpora/NIKL_WRITTEN` within the current user's local root.
If the corpus exists in another directory, add `root_dir_or_paths=custom_path` argument in `ModuNewsKorpus` class declaration.
```

```tip
If `load_light=True`, only the paragraphs and document_id are loaded. If it it set as `False`, all metadata are loaded as well. The default value of `load_light` is `True`.
```

If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
`train` refers to the training dataset of the corpus, and you can check its first training instance as follows.

```
>>> corpus.train[0]
ModuNews(document_id='NPRW1900000010.1', title='한국경제 2018년 기사', author='김현석', publisher='한국경제신문사', date='20180101', topic='생활', original_topic='국제', paragraph=['"라니냐로 겨울 가뭄 온다"…', '...'])
```

By executing the `get_all_texts` method, you can access all paragraphs (bodies of all articles) within the corpus.

```
>>> corpus.get_all_texts()
[''"라니냐로 겨울 가뭄 온다"...', ... ]
```
47 changes: 46 additions & 1 deletion en-docs/corpuslist/modu_spoken.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,49 @@ sort: 17

# Modu: Spoken

TBD
Modu: Spoken is a dataset released by National Institute of Korean Language.
Data specification is as follows.

- author: National Institute of Korean Language
- repository: [https://corpus.korean.go.kr](https://corpus.korean.go.kr)
- references: [document](https://rlkujwkk7.toastcdn.net/NIKL_SPOKEN(v1.0).pdf)
- size:
- train: 27,920 examples

```warning
Due to the licensing issue of Modu corpus, `Korpora` does not provide any download functions for this corpus. Rather, it only offers a load function.
If you wish to use this corpus, please complete the authentication process required by the National Institue of Korean Language and manually download the corpus.
```

You can load the corpus from your Python console as follows.

```python
from Korpora import Korpora
corpus = Korpora.load("modu_spoken")
```

```warning
The code assumes that the corpus has already been unzipped into NIKL_WRITTEN directory within `~/Korpora` (`~/Korpora/NIKL_WRITTEN`).
If the root directory is not `~/Korpora`, add `root_dir=custom_path` argument to the `load` method.
```

You can also load the corpus as follows.
The output of these codes is identical to that of previous codes.

```python
from Korpora import ModuSpokenKorpus
corpus = ModuSpokenKorpus()
```

```warning
The codes assumes that the corpus has already been unzipped into `~/Korpora/NIKL_WRITTEN` within the current user's local root.
If the corpus exists in another directory, add `root_dir_or_paths=custom_path` argument in `ModuSpokenKorpus` class declaration.
```

If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
`train` refers to the training dataset of the corpus, and you can check its first training instance as follows.

```
>>> corpus.train[0]
요즘처럼 추운 날씨에는 따뜻한 라테 한잔 찾는 분들 많으실 텐데요...
```
47 changes: 46 additions & 1 deletion en-docs/corpuslist/modu_web.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,49 @@ sort: 18

# Modu: Web

TBD
Modu: Web is a dataset released by National Institute of Korean Language.
Data specification is as follows.

- author: National Institute of Korean Language
- repository: [https://corpus.korean.go.kr](https://corpus.korean.go.kr)
- references: [document](https://rlkujwkk7.toastcdn.net/NIKL_WEB(v1.0).pdf)
- size:
- train: 2,107,076 examples

```warning
Due to the licensing issue of Modu corpus, `Korpora` does not provide any download functions for this corpus. Rather, it only offers a load function.
If you wish to use this corpus, please complete the authentication process required by the National Institue of Korean Language and manually download the corpus.
```

You can load the corpus from your Python console as follows.

```python
from Korpora import Korpora
corpus = Korpora.load("modu_web")
```

```warning
The code assumes that the corpus has already been unzipped into NIKL_WRITTEN directory within `~/Korpora` (`~/Korpora/NIKL_WRITTEN`).
If the root directory is not `~/Korpora`, add `root_dir=custom_path` argument to the `load` method.
```

You can also load the corpus as follows.
The output of these codes is identical to that of previous codes.

```python
from Korpora import ModuWebKorpus
corpus = ModuWebKorpus()
```

```warning
The codes assumes that the corpus has already been unzipped into `~/Korpora/NIKL_WRITTEN` within the current user's local root.
If the corpus exists in another directory, add `root_dir_or_paths=custom_path` argument in `ModuWebKorpus` class declaration.
```

If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
`train` refers to the training dataset of the corpus, and you can check its first training instance as follows.

```
>>> corpus.train[0]
오메가3와 비타민C, 달맞이꽃종자유 등을 사려고 몇 시간을 검색하며 공부했다 ...
```
47 changes: 46 additions & 1 deletion en-docs/corpuslist/modu_written.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,49 @@ sort: 19

# Modu: Written

TBD
Modu: Written is a dataset released by National Institute of Korean Language.
Data specification is as follows.

- author: National Institute of Korean Language
- repository: [https://corpus.korean.go.kr](https://corpus.korean.go.kr)
- references: [document](https://rlkujwkk7.toastcdn.net/NIKL_WRITTEN(v1.0).pdf)
- size:
- train: 20,188 examples

```warning
Due to the licensing issue of Modu corpus, `Korpora` does not provide any download functions for this corpus. Rather, it only offers a load function.
If you wish to use this corpus, please complete the authentication process required by the National Institue of Korean Language and manually download the corpus.
```

You can load the corpus from your Python console as follows.

```python
from Korpora import Korpora
corpus = Korpora.load("modu_written")
```

```warning
The code assumes that the corpus has already been unzipped into NIKL_WRITTEN directory within `~/Korpora` (`~/Korpora/NIKL_WRITTEN`).
If the root directory is not `~/Korpora`, add `root_dir=custom_path` argument to the `load` method.
```

You can also load the corpus as follows.
The output of these codes is identical to that of previous codes.

```python
from Korpora import ModuWrittenKorpus
corpus = ModuWrittenKorpus()
```

```warning
The codes assumes that the corpus has already been unzipped into `~/Korpora/NIKL_WRITTEN` within the current user's local root.
If the corpus exists in another directory, add `root_dir_or_paths=custom_path` argument in `ModuWrittenKorpus` class declaration.
```

If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
`train` refers to the training dataset of the corpus, and you can check its first training instance as follows.

```
>>> corpus.train[0]
01범보다 무서운 곶감
```
Loading