ko-nlp · lovit · Nov 11, 2020 · Nov 5, 2020 · Nov 5, 2020 · Nov 5, 2020
diff --git a/en-docs/corpuslist/modu_ne.md b/en-docs/corpuslist/modu_ne.md
@@ -4,4 +4,57 @@ sort: 16
 
 # Modu: Named Entity
 
-TBD
+Modu: Named Entity is a dataset released by National Institute of Korean Language.
+Data specification is as follows.
+
+
+- author: National Institute of Korean Language
+- repository: [https://corpus.korean.go.kr](https://corpus.korean.go.kr)
+- references: [document](https://rlkujwkk7.toastcdn.net/NIKL_NE(v1.0).pdf)
+- size:
+  - train: 20,188 examples
+
+```warning
+Due to the licensing issue of Modu corpus, `Korpora` does not provide any download functions for this corpus. Rather, it only offers a load function.
+If you wish to use this corpus, please complete the authentication process required by the National Institue of Korean Language and manually download the corpus. 
+```
+
+You can load the corpus from your Python console as follows.
+
+```python
+from Korpora import Korpora
+corpus = Korpora.load("modu_ne")
+```
+
+```warning
+The code assumes that the corpus has already been unzipped into NIKL_NE directory within `~/Korpora` (`~/Korpora/NIKL_NE`).
+If the root directory is not `~/Korpora`, add `root_dir=custom_path` argument to the `load` method.
+```
+
+You can also load the corpus as follows.
+The output of these codes is identical to that of previous codes.
+
+```python
+from Korpora import ModuNEKorpus
+corpus = ModuNEKorpus()
+```
+
+```warning
+The codes assumes that the corpus has already been unzipped into `~/Korpora/NIKL_NE` within the current user's local root. 
+If the corpus exists in another directory, add `root_dir_or_paths=custom_path` argument in `ModuNEKorpus` class declaration.
+```
+
+If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
+`train` refers to the training dataset of the corpus, and you can check its first training instance as follows.
+
+```
+>>> corpus.train[0]
+NamedEntityExample(
+    id=NWRW1800000029.315.1.1,
+    sentence=[횡설수설/권순활]北 ‘외화벌이’ 뜯어먹기,
+    tags=['AF', 'PS', 'LC'],
+    positions=[(1, 5), (6, 9), (10, 11)]
+)
+>>> corpus.train[0].sentence
+[횡설수설/권순활]北 ‘외화벌이’ 뜯어먹기
+```
diff --git a/en-docs/corpuslist/modu_news.md b/en-docs/corpuslist/modu_news.md
@@ -4,4 +4,73 @@ sort: 13
 
 # Modu: Newspaper
 
-TBD
+Modu: Newspaper is a dataset released by National Institute of Korean Language.
+Data specification is as follows.
+
+- author: National Institute of Korean Language
+- repository: [https://corpus.korean.go.kr](https://corpus.korean.go.kr)
+- references: [document](https://rlkujwkk7.toastcdn.net/NIKL_NEWSPAPER(v1.0).pdf)
+- size:
+  - train: about 3,500,000 examples
+
+Data structure is as follows:
+
+|Attributes|Property|
+| --- | --- |
+| document_id | Unique id of the article|
+| title | Title of the metadata (not the actual title of the article) |
+| author | author of the article |
+| publisher | newspaper publisher |
+| date | published date |
+| topic | topic of the article (politics, business, social affairs, lifestyle, IT/science, entertainment, sports, culture, beauty/health) |
+| original_topic | original topic categorized by the newpaper publishers |
+| paragraph | body of the article (the first line seems to the title of the article) |
+
+```warning
+Due to the licensing issue of Modu corpus, `Korpora` does not provide any download functions for this corpus. Rather, it only offers a load function.
+If you wish to use this corpus, please complete the authentication process required by the National Institue of Korean Language and manually download the corpus.
+```
+
+You can load the corpus from your Python console as follows.
+
+```python
+from Korpora import Korpora
+corpus = Korpora.load("modu_news")
+```
+
+```warning
+The code assumes that the corpus has already been unzipped into NIKL_WRITTEN directory within `~/Korpora` (`~/Korpora/NIKL_WRITTEN`).
+If the root directory is not `~/Korpora`, add `root_dir=custom_path` argument to the `load` method.
+```
+
+You can also load the corpus as follows.
+The output of these codes is identical to that of previous codes.
+
+```python
+from Korpora import ModuNewsKorpus
+corpus = ModuNewsKorpus()
+```
+
+```warning
+The codes assumes that the corpus has already been unzipped into `~/Korpora/NIKL_WRITTEN` within the current user's local root. 
+If the corpus exists in another directory, add `root_dir_or_paths=custom_path` argument in `ModuNewsKorpus` class declaration.
+```
+
+```tip
+If `load_light=True`, only the paragraphs and document_id are loaded. If it it set as `False`, all metadata are loaded as well. The default value of `load_light` is `True`.
+```
+
+If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
+`train` refers to the training dataset of the corpus, and you can check its first training instance as follows.
+
+```
+>>> corpus.train[0]
+ModuNews(document_id='NPRW1900000010.1', title='한국경제 2018년 기사', author='김현석', publisher='한국경제신문사', date='20180101', topic='생활', original_topic='국제', paragraph=['"라니냐로 겨울 가뭄 온다"…', '...'])
+```
+
+By executing the `get_all_texts` method, you can access all paragraphs (bodies of all articles) within the corpus.
+
+```
+>>> corpus.get_all_texts()
+[''"라니냐로 겨울 가뭄 온다"...', ... ]
+```
diff --git a/en-docs/corpuslist/modu_spoken.md b/en-docs/corpuslist/modu_spoken.md
@@ -4,4 +4,49 @@ sort: 17
 
 # Modu: Spoken
 
-TBD
+Modu: Spoken is a dataset released by National Institute of Korean Language.
+Data specification is as follows.
+
+- author: National Institute of Korean Language
+- repository: [https://corpus.korean.go.kr](https://corpus.korean.go.kr)
+- references: [document](https://rlkujwkk7.toastcdn.net/NIKL_SPOKEN(v1.0).pdf)
+- size:
+  - train: 27,920 examples
+
+```warning
+Due to the licensing issue of Modu corpus, `Korpora` does not provide any download functions for this corpus. Rather, it only offers a load function.
+If you wish to use this corpus, please complete the authentication process required by the National Institue of Korean Language and manually download the corpus.
+```
+
+You can load the corpus from your Python console as follows.
+
+```python
+from Korpora import Korpora
+corpus = Korpora.load("modu_spoken")
+```
+
+```warning
+The code assumes that the corpus has already been unzipped into NIKL_WRITTEN directory within `~/Korpora` (`~/Korpora/NIKL_WRITTEN`).
+If the root directory is not `~/Korpora`, add `root_dir=custom_path` argument to the `load` method.
+```
+
+You can also load the corpus as follows.
+The output of these codes is identical to that of previous codes.
+
+```python
+from Korpora import ModuSpokenKorpus
+corpus = ModuSpokenKorpus()
+```
+
+```warning
+The codes assumes that the corpus has already been unzipped into `~/Korpora/NIKL_WRITTEN` within the current user's local root. 
+If the corpus exists in another directory, add `root_dir_or_paths=custom_path` argument in `ModuSpokenKorpus` class declaration.
+```
+
+If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
+`train` refers to the training dataset of the corpus, and you can check its first training instance as follows.
+
+```
+>>> corpus.train[0]
+요즘처럼 추운 날씨에는 따뜻한 라테 한잔 찾는 분들 많으실 텐데요...
+```
diff --git a/en-docs/corpuslist/modu_web.md b/en-docs/corpuslist/modu_web.md
@@ -4,4 +4,49 @@ sort: 18
 
 # Modu: Web
 
-TBD
+Modu: Web is a dataset released by National Institute of Korean Language.
+Data specification is as follows.
+
+- author: National Institute of Korean Language
+- repository: [https://corpus.korean.go.kr](https://corpus.korean.go.kr)
+- references: [document](https://rlkujwkk7.toastcdn.net/NIKL_WEB(v1.0).pdf)
+- size:
+  - train: 2,107,076 examples
+
+```warning
+Due to the licensing issue of Modu corpus, `Korpora` does not provide any download functions for this corpus. Rather, it only offers a load function.
+If you wish to use this corpus, please complete the authentication process required by the National Institue of Korean Language and manually download the corpus.
+```
+
+You can load the corpus from your Python console as follows.
+
+```python
+from Korpora import Korpora
+corpus = Korpora.load("modu_web")
+```
+
+```warning
+The code assumes that the corpus has already been unzipped into NIKL_WRITTEN directory within `~/Korpora` (`~/Korpora/NIKL_WRITTEN`).
+If the root directory is not `~/Korpora`, add `root_dir=custom_path` argument to the `load` method.
+```
+
+You can also load the corpus as follows.
+The output of these codes is identical to that of previous codes.
+
+```python
+from Korpora import ModuWebKorpus
+corpus = ModuWebKorpus()
+```
+
+```warning
+The codes assumes that the corpus has already been unzipped into `~/Korpora/NIKL_WRITTEN` within the current user's local root.
+If the corpus exists in another directory, add `root_dir_or_paths=custom_path` argument in `ModuWebKorpus` class declaration.
+```
+
+If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
+`train` refers to the training dataset of the corpus, and you can check its first training instance as follows.
+
+```
+>>> corpus.train[0]
+오메가3와 비타민C, 달맞이꽃종자유 등을 사려고 몇 시간을 검색하며 공부했다 ...
+```
diff --git a/en-docs/corpuslist/modu_written.md b/en-docs/corpuslist/modu_written.md
@@ -4,4 +4,49 @@ sort: 19
 
 # Modu: Written
 
-TBD
+Modu: Written is a dataset released by National Institute of Korean Language.
+Data specification is as follows.
+
+- author: National Institute of Korean Language
+- repository: [https://corpus.korean.go.kr](https://corpus.korean.go.kr)
+- references: [document](https://rlkujwkk7.toastcdn.net/NIKL_WRITTEN(v1.0).pdf)
+- size:
+  - train: 20,188 examples
+
+```warning
+Due to the licensing issue of Modu corpus, `Korpora` does not provide any download functions for this corpus. Rather, it only offers a load function.
+If you wish to use this corpus, please complete the authentication process required by the National Institue of Korean Language and manually download the corpus.
+```
+
+You can load the corpus from your Python console as follows.
+
+```python
+from Korpora import Korpora
+corpus = Korpora.load("modu_written")
+```
+
+```warning
+The code assumes that the corpus has already been unzipped into NIKL_WRITTEN directory within `~/Korpora` (`~/Korpora/NIKL_WRITTEN`).
+If the root directory is not `~/Korpora`, add `root_dir=custom_path` argument to the `load` method. 
+```
+
+You can also load the corpus as follows.
+The output of these codes is identical to that of previous codes.
+
+```python
+from Korpora import ModuWrittenKorpus
+corpus = ModuWrittenKorpus()
+```
+
+```warning
+The codes assumes that the corpus has already been unzipped into `~/Korpora/NIKL_WRITTEN` within the current user's local root.
+If the corpus exists in another directory, add `root_dir_or_paths=custom_path` argument in `ModuWrittenKorpus` class declaration.
+```
+
+If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
+`train` refers to the training dataset of the corpus, and you can check its first training instance as follows.
+
+```
+>>> corpus.train[0]
+01범보다 무서운 곶감
+```