Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add license info to /tests and README_TH.md #886

Merged
merged 2 commits into from
Dec 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 39 additions & 7 deletions README_TH.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,13 +123,11 @@ thainlp help

## การอ้างอิง

ถ้าคุณใช้ `PyThaiNLP` ในโปรเจคหรืองานวิจัยของคุณ คุณสามารถอ้างอิงได้ตามนี้
หากคุณใช้ซอฟต์แวร์ `PyThaiNLP` ในโครงงานหรืองานวิจัยของคุณ คุณสามารถอ้างอิงได้ตามนี้

```
Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, & Pattarawat Chormai. (2016, Jun 27). PyThaiNLP: Thai Natural Language Processing in Python. Zenodo. http://doi.org/10.5281/zenodo.3519354
```

หรือ BibTeX entry:
โดยสามารถใช้ BibTeX นี้:

``` bib
@misc{pythainlp,
Expand All @@ -143,6 +141,40 @@ Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Sur
}
```

บทความของเราในงานประชุมวิชาการ [NLP-OSS 2023](https://nlposs.github.io/2023/):

Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, and Can Udomcharoenchaikit. 2023. [PyThaiNLP: Thai Natural Language Processing in Python.](https://aclanthology.org/2023.nlposs-1.4) In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25–36, Singapore, Singapore. Empirical Methods in Natural Language Processing.

โดยสามารถใช้ BibTeX นี้:

```bib
@inproceedings{phatthiyaphaibun-etal-2023-pythainlp,
title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in Python",
author = "Phatthiyaphaibun, Wannaphong and
Chaovavanich, Korakot and
Polpanumas, Charin and
Suriyawongkul, Arthit and
Lowphansirikul, Lalita and
Chormai, Pattarawat and
Limkonchotiwat, Peerat and
Suntorntip, Thanathip and
Udomcharoenchaikit, Can",
editor = "Tan, Liling and
Milajevs, Dmitrijs and
Chauhan, Geeticka and
Gwinnup, Jeremy and
Rippeth, Elijah",
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
month = dec,
year = "2023",
address = "Singapore, Singapore",
publisher = "Empirical Methods in Natural Language Processing",
url = "https://aclanthology.org/2023.nlposs-1.4",
pages = "25--36",
abstract = "We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.",
}
```

## ร่วมสนับสนุน PyThaiNLP

- กรุณา fork แล้วพัฒนาต่อ จากนั้นสร้าง pull request กลับมา :)
Expand All @@ -157,10 +189,10 @@ Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Sur

| | สัญญาอนุญาต |
|:---|:----|
| PyThaiNLP Source Code and Notebooks | [Apache Software License 2.0](https://github.com/PyThaiNLP/pythainlp/blob/dev/LICENSE) |
| Corpora, datasets, and documentations created by PyThaiNLP | [Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0)](https://creativecommons.org/publicdomain/zero/1.0/)|
| ต้นรหัสซอร์สโค้ดและโน๊ตบุ๊กของ PyThaiNLP | [Apache Software License 2.0](https://github.com/PyThaiNLP/pythainlp/blob/dev/LICENSE) |
| ฐานข้อมูลภาษา ชุดข้อมูล และเอกสารที่สร้างโดยโครงการ PyThaiNLP | [Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0)](https://creativecommons.org/publicdomain/zero/1.0/)|
| Language models created by PyThaiNLP | [Creative Commons Attribution 4.0 International Public License (CC-by)](https://creativecommons.org/licenses/by/4.0/) |
| Other corpora and models that may included with PyThaiNLP | See [Corpus License](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md) |
| สำหรับฐานข้อมูลภาษาและโมเดลอื่นที่อาจมาพร้อมกับซอฟต์แวร์ PyThaiNLP | ดู [Corpus License](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md) |


## บัตรโมเดล
Expand Down
33 changes: 19 additions & 14 deletions pythainlp/tag/pos_tag.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
from typing import List, Tuple



def pos_tag(
words: List[str], engine: str = "perceptron", corpus: str = "orchid"
) -> List[Tuple[str, str]]:
Expand Down Expand Up @@ -169,10 +168,10 @@ def pos_tag_sents(


def pos_tag_transformers(
sentence: str,
sentence: str,
engine: str = "bert",
corpus: str = "blackboard",
)->List[List[Tuple[str, str]]]:
) -> List[List[Tuple[str, str]]]:
"""
Marks sentences with part-of-speech (POS) tags.

Expand Down Expand Up @@ -202,29 +201,33 @@ def pos_tag_transformers(
"""

try:
from transformers import AutoModelForTokenClassification, \
AutoTokenizer, TokenClassificationPipeline
from transformers import (
AutoModelForTokenClassification,
AutoTokenizer,
TokenClassificationPipeline,
)
except ImportError:
raise ImportError(
"Not found transformers! Please install transformers by pip install transformers")
"Not found transformers! Please install transformers by pip install transformers"
)

if not sentence:
return []

_blackboard_support_engine = {
"bert" : "lunarlist/pos_thai",
"bert": "lunarlist/pos_thai",
}

_pud_support_engine = {
"wangchanberta" : "Pavarissy/wangchanberta-ud-thai-pud-upos",
"mdeberta" : "Pavarissy/mdeberta-v3-ud-thai-pud-upos",
"wangchanberta": "Pavarissy/wangchanberta-ud-thai-pud-upos",
"mdeberta": "Pavarissy/mdeberta-v3-ud-thai-pud-upos",
}

if corpus == 'blackboard' and engine in _blackboard_support_engine.keys():
if corpus == "blackboard" and engine in _blackboard_support_engine.keys():
base_model = _blackboard_support_engine.get(engine)
model = AutoModelForTokenClassification.from_pretrained(base_model)
tokenizer = AutoTokenizer.from_pretrained(base_model)
elif corpus == 'pud' and engine in _pud_support_engine.keys():
elif corpus == "pud" and engine in _pud_support_engine.keys():
base_model = _pud_support_engine.get(engine)
model = AutoModelForTokenClassification.from_pretrained(base_model)
tokenizer = AutoTokenizer.from_pretrained(base_model)
Expand All @@ -235,8 +238,10 @@ def pos_tag_transformers(
)
)

pipeline = TokenClassificationPipeline(model=model, tokenizer=tokenizer, aggregation_strategy="simple")
pipeline = TokenClassificationPipeline(
model=model, tokenizer=tokenizer, aggregation_strategy="simple"
)

outputs = pipeline(sentence)
word_tags = [[(tag['word'], tag['entity_group']) for tag in outputs]]
return word_tags
word_tags = [[(tag["word"], tag["entity_group"]) for tag in outputs]]
return word_tags
2 changes: 2 additions & 0 deletions tests/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# -*- coding: utf-8 -*-
# SPDX-FileCopyrightText: Copyright 2016-2023 PyThaiNLP Project
# SPDX-License-Identifier: Apache-2.0
"""
Unit test.

Expand Down
28 changes: 15 additions & 13 deletions tests/test_ancient.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,22 @@
# -*- coding: utf-8 -*-
# SPDX-FileCopyrightText: Copyright 2016-2023 PyThaiNLP Project
# SPDX-License-Identifier: Apache-2.0
import unittest
from pythainlp.ancient import aksonhan_to_current


class TestAncientPackage(unittest.TestCase):
def test_aksonhan_to_current(self):
self.assertEqual(aksonhan_to_current("ก"), 'ก')
self.assertEqual(aksonhan_to_current("กก"), 'กก')
self.assertEqual(aksonhan_to_current("ถนน"), 'ถนน')
self.assertEqual(aksonhan_to_current("จกก"), 'จัก')
self.assertEqual(aksonhan_to_current("ดง่ง"), 'ดั่ง')
self.assertEqual(aksonhan_to_current("นน้น"), 'นั้น')
self.assertEqual(aksonhan_to_current("ขดด"), 'ขัด')
self.assertEqual(aksonhan_to_current("ตรสส"), 'ตรัส')
self.assertEqual(aksonhan_to_current("ขบบ"), 'ขับ')
self.assertEqual(aksonhan_to_current("วนน"), 'วัน')
self.assertEqual(aksonhan_to_current("หลงง"), 'หลัง')
self.assertEqual(aksonhan_to_current("บงงคบบ"), 'บังคับ')
self.assertEqual(aksonhan_to_current("สรรเพชญ"), 'สรรเพชญ')
self.assertEqual(aksonhan_to_current("ก"), "ก")
self.assertEqual(aksonhan_to_current("กก"), "กก")
self.assertEqual(aksonhan_to_current("ถนน"), "ถนน")
self.assertEqual(aksonhan_to_current("จกก"), "จัก")
self.assertEqual(aksonhan_to_current("ดง่ง"), "ดั่ง")
self.assertEqual(aksonhan_to_current("นน้น"), "นั้น")
self.assertEqual(aksonhan_to_current("ขดด"), "ขัด")
self.assertEqual(aksonhan_to_current("ตรสส"), "ตรัส")
self.assertEqual(aksonhan_to_current("ขบบ"), "ขับ")
self.assertEqual(aksonhan_to_current("วนน"), "วัน")
self.assertEqual(aksonhan_to_current("หลงง"), "หลัง")
self.assertEqual(aksonhan_to_current("บงงคบบ"), "บังคับ")
self.assertEqual(aksonhan_to_current("สรรเพชญ"), "สรรเพชญ")
2 changes: 2 additions & 0 deletions tests/test_augment.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# -*- coding: utf-8 -*-
# SPDX-FileCopyrightText: Copyright 2016-2023 PyThaiNLP Project
# SPDX-License-Identifier: Apache-2.0

import unittest
import nltk
Expand Down
8 changes: 6 additions & 2 deletions tests/test_benchmarks.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# -*- coding: utf-8 -*-
# SPDX-FileCopyrightText: Copyright 2016-2023 PyThaiNLP Project
# SPDX-License-Identifier: Apache-2.0

import unittest

import numpy as np
Expand Down Expand Up @@ -63,8 +67,8 @@ def test_count_correctly_tokenised_words(self):
rb = list(word_tokenization._find_word_boundaries(ref_sample))

# in binary [{0, 1}, ...]
correctly_tokenized_words = word_tokenization._find_words_correctly_tokenised(
rb, sb
correctly_tokenized_words = (
word_tokenization._find_words_correctly_tokenised(rb, sb)
)

self.assertEqual(
Expand Down
3 changes: 3 additions & 0 deletions tests/test_classify.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# -*- coding: utf-8 -*-
# SPDX-FileCopyrightText: Copyright 2016-2023 PyThaiNLP Project
# SPDX-License-Identifier: Apache-2.0

import unittest
from pythainlp.classify import GzipModel

Expand Down
8 changes: 4 additions & 4 deletions tests/test_cli.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# -*- coding: utf-8 -*-
# SPDX-FileCopyrightText: Copyright 2016-2023 PyThaiNLP Project
# SPDX-License-Identifier: Apache-2.0

import unittest
from argparse import ArgumentError
Expand Down Expand Up @@ -41,7 +43,7 @@ def test_cli_benchmark(self):
"./tests/data/input.txt",
"--test-file",
"./tests/data/test.txt",
"--save-details"
"--save-details",
]
)
)
Expand Down Expand Up @@ -117,9 +119,7 @@ def test_cli_tokenize(self):
self.assertEqual(ex.exception.code, 2)

self.assertIsNotNone(
cli.tokenize.App(
["thainlp", "tokenize", "NOT_EXIST", "ไม่มีอยู่ จริง"]
)
cli.tokenize.App(["thainlp", "tokenize", "NOT_EXIST", "ไม่มีอยู่ จริง"])
)
self.assertIsNotNone(
cli.tokenize.App(
Expand Down
2 changes: 2 additions & 0 deletions tests/test_coref.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# -*- coding: utf-8 -*-
# SPDX-FileCopyrightText: Copyright 2016-2023 PyThaiNLP Project
# SPDX-License-Identifier: Apache-2.0

import unittest
from pythainlp.coref import coreference_resolution
Expand Down
29 changes: 17 additions & 12 deletions tests/test_corpus.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# -*- coding: utf-8 -*-
# SPDX-FileCopyrightText: Copyright 2016-2023 PyThaiNLP Project
# SPDX-License-Identifier: Apache-2.0

import os
import unittest
Expand All @@ -23,14 +25,15 @@
thai_icu_words,
thai_male_names,
thai_negations,
thai_orst_words,
thai_stopwords,
thai_syllables,
thai_synonym,
thai_synonyms,
thai_volubilis_words,
thai_wikipedia_titles,
thai_words,
tnc,
ttc,
volubilis,
wikipedia_titles,
wordnet,
)
from pythainlp.corpus.util import revise_newmm_default_wordset
Expand All @@ -41,24 +44,26 @@ def test_conceptnet(self):
self.assertIsNotNone(conceptnet.edges("รัก"))

def test_corpus(self):
self.assertIsInstance(thai_icu_words(), frozenset)
self.assertGreater(len(thai_icu_words()), 0)
self.assertIsInstance(thai_negations(), frozenset)
self.assertGreater(len(thai_negations()), 0)
self.assertIsInstance(thai_stopwords(), frozenset)
self.assertGreater(len(thai_stopwords()), 0)
self.assertIsInstance(thai_syllables(), frozenset)
self.assertGreater(len(thai_syllables()), 0)
self.assertIsInstance(thai_synonym(), dict)
self.assertGreater(len(thai_synonym()), 0)
self.assertIsInstance(thai_synonyms(), dict)
self.assertGreater(len(thai_synonyms()), 0)

self.assertIsInstance(thai_icu_words(), frozenset)
self.assertGreater(len(thai_icu_words()), 0)
self.assertIsInstance(thai_orst_words(), frozenset)
self.assertGreater(len(thai_orst_words()), 0)
self.assertIsInstance(thai_volubilis_words(), frozenset)
self.assertGreater(len(thai_volubilis_words()), 0)
self.assertIsInstance(thai_wikipedia_titles(), frozenset)
self.assertGreater(len(thai_wikipedia_titles()), 0)
self.assertIsInstance(thai_words(), frozenset)
self.assertGreater(len(thai_words()), 0)

self.assertIsInstance(volubilis(), frozenset)
self.assertGreater(len(volubilis()), 0)
self.assertIsInstance(wikipedia_titles(), frozenset)
self.assertGreater(len(wikipedia_titles()), 0)

self.assertIsInstance(countries(), frozenset)
self.assertGreater(len(countries()), 0)
self.assertIsInstance(provinces(), frozenset)
Expand Down
3 changes: 3 additions & 0 deletions tests/test_el.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# -*- coding: utf-8 -*-
# SPDX-FileCopyrightText: Copyright 2016-2023 PyThaiNLP Project
# SPDX-License-Identifier: Apache-2.0

import unittest
from pythainlp.el import EntityLinker

Expand Down
2 changes: 2 additions & 0 deletions tests/test_generate.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# -*- coding: utf-8 -*-
# SPDX-FileCopyrightText: Copyright 2016-2023 PyThaiNLP Project
# SPDX-License-Identifier: Apache-2.0

import unittest

Expand Down
Loading
Loading