Skip to content

Commit

Permalink
Merge pull request #869 from konbraphat51/corpus_wiki
Browse files Browse the repository at this point in the history
Add Thai wikipedia titles corpus.
Thanks @konbraphat51
  • Loading branch information
bact authored Dec 1, 2023
2 parents de098f3 + 0b52e14 commit f877567
Show file tree
Hide file tree
Showing 5 changed files with 290,170 additions and 37 deletions.
12 changes: 7 additions & 5 deletions pythainlp/corpus/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
"thai_words",
"thai_wsd_dict",
"volubilis",
"wikipedia_titles",
]

import os
Expand Down Expand Up @@ -102,22 +103,23 @@ def corpus_db_path() -> str:
get_corpus_default_db,
get_corpus_path,
get_path_folder_corpus,
remove,
path_pythainlp_corpus,
remove,
) # these imports must come before other pythainlp.corpus.* imports
from pythainlp.corpus.common import (
countries,
provinces,
thai_dict,
thai_family_names,
thai_female_names,
thai_male_names,
thai_negations,
thai_synonym,
thai_stopwords,
thai_syllables,
thai_words,
thai_synonym,
thai_orst_words,
thai_dict,
thai_wsd_dict
thai_words,
thai_wsd_dict,
)
from pythainlp.corpus.volubilis import volubilis
from pythainlp.corpus.wikipedia_titles import wikipedia_titles
47 changes: 29 additions & 18 deletions pythainlp/corpus/corpus_license.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
- Language models created by PyThaiNLP project are released under [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/) (CC-by).
- For more information about corpora that PyThaiNLP use, see [https://github.com/PyThaiNLP/pythainlp-corpus/](https://github.com/PyThaiNLP/pythainlp-corpus/).


## Dictionaries and Word Lists

The following word lists are created by the PyThaiNLP project and released under
Expand Down Expand Up @@ -35,6 +36,7 @@ https://creativecommons.org/licenses/by-sa/4.0/
| person_names_female_th.txt | List of female names in Thailand |
| person_names_male_th.txt | List of male names in Thailand |


## Models

The following language models are created by the PyThaiNLP project
Expand All @@ -50,6 +52,7 @@ https://creativecommons.org/licenses/by/4.0/
| pos_ud_unigram.json | Part-of-speech tagging model, trained from Parallel Universal Dependencies treebank, using unigram |
| sentenceseg_crfcut.model | Sentence segmentation model, trained from TED subtitles, using CRF |


## Thai WordNet

Thai WordNet (wordnet_th.db) is created by Thai Computational Linguistic
Expand Down Expand Up @@ -100,25 +103,33 @@ in Proceedings of the 7th Workshop on Asian Language Resources,
Suntec, Singapore, Aug. 2009, pp. 139–144.
https://www.aclweb.org/anthology/W09-3420.pdf


## Thai Wikipedia Titles

Thai Wikipedia titles corpus (wikipedia_titles.txt),
prepared by konbraphat51, using a Thai Wikipedia dump from
21 November 2023, and released under their original license which is
**Creative Commons Attribution-ShareAlike 4.0 International Public License**
https://creativecommons.org/licenses/by-sa/4.0/

Original data:
https://dumps.wikimedia.org/thwiki/latest/thwiki-latest-all-titles.gz

Preparation code:
https://github.com/konbraphat51/Thai_Dictionary_Cleaner/


## Volubilis

Corpus of Thai words registered in Volubilis (volubilis.txt) which was processed by konbraphat51 (https://github.com/konbraphat51/Thai_Dictionary_Cleaner/tree/main)
A corpus of Thai words registered in Volubilis dictionary
(volubilis.txt), prepared by konbraphat51,
using data from Volubilis 23.1 (Mar. 2023) by Francis Bastien,
and released under their original license which is
**Creative Commons Attribution-ShareAlike 4.0 International Public License**
https://creativecommons.org/licenses/by-sa/4.0/

The original data is VOLUBILIS 23.1 (Mar. 2023) Database from [Volubilis](https://belisan-volubilis.blogspot.com/) which Francis Bastien has created.
Original data:
https://belisan-volubilis.blogspot.com/

```
VOLUBILIS MULTILINGUAL THAI DICT. & DATABASE by Francis Bastien (Belisan) is licensed under CC BY-SA 4.0
This is a human-readable summary of (and not a substitute for) the license below.
You are free:
to Share—copy and redistribute the material in any medium or format
to Adapt—remix, transform, and build upon the material
for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
Attribution—You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Share Alike—If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
No additional restrictions—You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation. No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
```
Preparation code:
https://github.com/konbraphat51/Thai_Dictionary_Cleaner/
43 changes: 43 additions & 0 deletions pythainlp/corpus/wikipedia_titles.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# -*- coding: utf-8 -*-
# Copyright (C) 2016-2023 PyThaiNLP Project
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Provides an optional word list from Thai Wikipedia titles.
"""
from typing import FrozenSet

from pythainlp.corpus.common import get_corpus

_WIKIPEDIA_TITLES = None
_WIKIPEDIA_TITLES_FILENAME = "wikipedia_titles.txt"


def wikipedia_titles() -> FrozenSet[str]:
"""
Return a frozenset of words from Thai Wikipedia titles corpus.
They are mostly nouns and noun phrases,
including event, organization, people, place, and product names.
Commonly misspelled words are included intentionally.
More info:
https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md
:return: :class:`frozenset` containing words in Thai Wikipedia titles.
:rtype: :class:`frozenset`
"""
global _WIKIPEDIA_TITLES
if not _WIKIPEDIA_TITLES:
_WIKIPEDIA_TITLES = get_corpus(_WIKIPEDIA_TITLES_FILENAME)

return _WIKIPEDIA_TITLES
Loading

0 comments on commit f877567

Please sign in to comment.