-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 82cb8c8
Showing
20 changed files
with
417 additions
and
0 deletions.
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
name: CLDF-validation | ||
|
||
on: | ||
push: | ||
branches: [ main ] | ||
pull_request: | ||
branches: [ main ] | ||
|
||
jobs: | ||
build: | ||
|
||
runs-on: ubuntu-latest | ||
strategy: | ||
matrix: | ||
python-version: [3.6] | ||
|
||
steps: | ||
- uses: actions/checkout@v2 | ||
- name: Set up Python ${{ matrix.python-version }} | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
- name: Install dependencies | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install pytest-cldf | ||
- name: Test with pytest | ||
run: | | ||
pytest --cldf-metadata=cldf/cldf-metadata.json test.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
env/ | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
.hypothesis/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# pyenv | ||
.python-version | ||
|
||
# celery beat schedule file | ||
celerybeat-schedule | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# dotenv | ||
.env | ||
|
||
# virtualenv | ||
.venv | ||
venv/ | ||
ENV/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
*.csv text eol=crlf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# CLDF directory | ||
|
||
This directory contains the dataset formatted as [CLDF dataset](https://cldf.clld.org). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
import pathlib | ||
|
||
from cldfbench import Dataset as BaseDataset | ||
|
||
|
||
class Dataset(BaseDataset): | ||
dir = pathlib.Path(__file__).parent | ||
id = "vrosenberg1853" | ||
|
||
def cldf_specs(self): # A dataset must declare all CLDF sets it creates. | ||
return super().cldf_specs() | ||
|
||
def cmd_download(self, args): | ||
""" | ||
Download files to the raw/ directory. You can use helpers methods of `self.raw_dir`, e.g. | ||
>>> self.raw_dir.download(url, fname) | ||
""" | ||
pass | ||
|
||
def cmd_makecldf(self, args): | ||
""" | ||
Convert the raw data to a CLDF dataset. | ||
>>> args.writer.objects['LanguageTable'].append(...) | ||
""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
source /Users/Primahadi/Documents/cldf_project/myenv/bin/activate | ||
|
||
cldfbench new | ||
# id: vrosenberg1853 | ||
# title: CLDF dataset derived from von Rosenberg's "De Mentawei-Eilanden en Hunne Bewoners" from 1853 for comparative numeral data | ||
# license: https://creativecommons.org/licenses/by-nc-sa/4.0/ | ||
# url: | ||
# citation: Rosenberg, Carl Benjamin Hermann von. 1853. De Mentawei-Eilanden en Hunne Bewoners. Tijdschrift voor Indische Taal-, Land- en Volkenkunde 1. 403–440. | ||
|
||
cd vrosenberg1853 | ||
|
||
# pre-process the raw data in `01-raw-data-pre-processing.R` | ||
|
||
# run pyconcepticon | ||
concepticon --repos "/Users/Primahadi/Documents/cldf_project/concepticon/concepticon-data" \ | ||
map_concepts "etc/gloss-to-map.tsv" --language en --output "etc/concepts.tsv" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
library(tidyverse) | ||
library(deeplr) | ||
|
||
df <- read_tsv("https://raw.githubusercontent.com/complexico/mentawai-word-list-1853/main/data/vrosenberg1853p434.tsv") | ||
|
||
df <- df |> | ||
mutate(ID = row_number()) |> | ||
select(ID, everything()) |> | ||
pivot_longer(cols = -c(Dutch, ID), names_to = "Doculect", values_to = "Forms") | ||
|
||
# prepare a translation data (run one time then save it) | ||
# source("etc/00-deeplr.R") | ||
# dutch <- df |> | ||
# select(ID, Dutch) |> | ||
# distinct() | ||
|
||
# dutch_w_english <- dutch |> | ||
# mutate(English = translate2(text = Dutch, target_lang = "EN", source_lang = "NL", preserve_formatting = TRUE, auth_key = mydeepl)) | ||
|
||
# dutch_w_english <- dutch_w_english |> | ||
# mutate(English = replace(English, English == "a", "one")) | ||
# write_tsv(dutch_w_english, file = "etc/dutch_w_english.tsv") | ||
|
||
dutch_w_english <- read_tsv("etc/dutch_w_english.tsv") | ||
|
||
doculect <- tribble( | ||
~Doculect, ~Doculect_Eng, ~Glottocode, ~Glottolog_name, | ||
"Maleisch", "Central Malay", "mala1479", "Central Malay", | ||
"Lampongsch", "Lampungic", "lamp1241", "Lampungic", | ||
"Redjangsch", "Rejang", "reja1240", "Rejang", | ||
"Battasch", "Batakic", "toba1265", "Batakic", | ||
"Atjinesch", "Acehnese", "achi1257", "Acehnese", | ||
"Niasch", "Nias", "nias1242", "Nias", | ||
"Mentaweijsch", "Mentawai", "ment1249", "Mentawai", | ||
"Engganoosch", "Enggano", "engg1245", "Enggano" | ||
) | ||
# I choose Glottolog's Central Malay because I am not really sure the specific Malay variety referred to by von Rosenberg. Glottolog describes Central Malay as the primary source of the variety of Malay spoken throughout the South East Asia: "Malay (zlm-zlm) = 3 (Wider communication). Originated in Sumatra; spoken throughout southeast Asia. With the advent of Islam, Malay became widespread in the 15th and 16th centuries. Lingua franca for Malaysia’s multiethnic population. Used in trade, literature, and story telling." | ||
|
||
df <- df |> | ||
left_join(dutch_w_english) |> | ||
select(ID, Forms, Dutch, English, Doculect) |> | ||
left_join(doculect) |> | ||
rename(Doculect_orig = Doculect, | ||
Doculect = Doculect_Eng) | ||
|
||
df |> write_tsv("raw/raw-dat.tsv") | ||
|
||
# Prepare the concept for Concepticon mapping | ||
# df |> | ||
# select(ENGLISH = English, ID) |> | ||
# distinct() |> | ||
# write_tsv("etc/gloss-to-map.tsv") | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Configuration directory | ||
|
||
This directory contains "configuration" data, i.e. data which helps with and | ||
guides the conversion of the raw data to CLDF. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
ENGLISH ID CONCEPTICON_ID CONCEPTICON_GLOSS SIMILARITY | ||
one 1 1493 ONE 2 | ||
#<<< | ||
two 2 1498 TWO 2 | ||
two 2 1384 SECOND 2 | ||
#>>> | ||
three 3 492 THREE 2 | ||
four 4 1500 FOUR 2 | ||
five 5 493 FIVE 2 | ||
six 6 1703 SIX 2 | ||
seven 7 1704 SEVEN 2 | ||
eight 8 1705 EIGHT 2 | ||
nine 9 1483 NINE 2 | ||
ten 10 1515 TEN 2 | ||
# 10/10 100% |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
ID Dutch English | ||
1 een one | ||
2 twee two | ||
3 drie three | ||
4 vier four | ||
5 vijf five | ||
6 zes six | ||
7 zeven seven | ||
8 acht eight | ||
9 negen nine | ||
10 tien ten |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
ENGLISH ID | ||
one 1 | ||
two 2 | ||
three 3 | ||
four 4 | ||
five 5 | ||
six 6 | ||
seven 7 | ||
eight 8 | ||
nine 9 | ||
ten 10 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
ID Doculect_Dutch Name Glottocode Glottolog_Name Sources | ||
1 Maleisch Central Malay mala1479 Central Malay VonRosenberg1853 | ||
2 Lampongsch Lampungic lamp1241 Lampungic VonRosenberg1853 | ||
3 Redjangsch Rejang reja1240 Rejang VonRosenberg1853 | ||
4 Battasch Batakic toba1265 Batakic VonRosenberg1853 | ||
5 Atjinesch Acehnese achi1257 Acehnese VonRosenberg1853 | ||
6 Niasch Nias nias1242 Nias VonRosenberg1853 | ||
7 Mentaweijsch Mentawai ment1249 Mentawai VonRosenberg1853 | ||
8 Engganoosch Enggano engg1245 Enggano VonRosenberg1853 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
{ | ||
"id": "vrosenberg1853", | ||
"title": "CLDF dataset derived from von Rosenberg's \"De Mentawei-Eilanden en Hunne Bewoners\" from 1853 for comparative numeral data", | ||
"description": null, | ||
"license": "https://creativecommons.org/licenses/by-nc-sa/4.0/", | ||
"url": null, | ||
"citation": "Rosenberg, Carl Benjamin Hermann von. 1853. De Mentawei-Eilanden en Hunne Bewoners. Tijdschrift voor Indische Taal-, Land- en Volkenkunde 1. 403\u2013440." | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Source directory | ||
|
||
This directory contains the "raw" source data of the dataset from which the | ||
CLDF dataset in [`cldf/`](../cldf) is derived. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
ID Forms Dutch English Doculect_orig Doculect Glottocode Glottolog_name | ||
1 satoe een one Maleisch Central Malay mala1479 Central Malay | ||
1 sije een one Lampongsch Lampungic lamp1241 Lampungic | ||
1 do een one Redjangsch Rejang reja1240 Rejang | ||
1 sada een one Battasch Batakic toba1265 Batakic | ||
1 sa een one Atjinesch Acehnese achi1257 Acehnese | ||
1 sara een one Niasch Nias nias1242 Nias | ||
1 sara een one Mentaweijsch Mentawai ment1249 Mentawai | ||
1 daheij een one Engganoosch Enggano engg1245 Enggano | ||
2 doea twee two Maleisch Central Malay mala1479 Central Malay | ||
2 rowa twee two Lampongsch Lampungic lamp1241 Lampungic | ||
2 dooij twee two Redjangsch Rejang reja1240 Rejang | ||
2 doeo twee two Battasch Batakic toba1265 Batakic | ||
2 doea twee two Atjinesch Acehnese achi1257 Acehnese | ||
2 doea twee two Niasch Nias nias1242 Nias | ||
2 doea twee two Mentaweijsch Mentawai ment1249 Mentawai | ||
2 adoea twee two Engganoosch Enggano engg1245 Enggano | ||
3 tiga drie three Maleisch Central Malay mala1479 Central Malay | ||
3 pullo drie three Lampongsch Lampungic lamp1241 Lampungic | ||
3 tellau drie three Redjangsch Rejang reja1240 Rejang | ||
3 tolo drie three Battasch Batakic toba1265 Batakic | ||
3 tlo drie three Atjinesch Acehnese achi1257 Acehnese | ||
3 feloe drie three Niasch Nias nias1242 Nias | ||
3 teloe drie three Mentaweijsch Mentawai ment1249 Mentawai | ||
3 agoloe drie three Engganoosch Enggano engg1245 Enggano | ||
4 ampat vier four Maleisch Central Malay mala1479 Central Malay | ||
4 ampa vier four Lampongsch Lampungic lamp1241 Lampungic | ||
4 mpat vier four Redjangsch Rejang reja1240 Rejang | ||
4 opat vier four Battasch Batakic toba1265 Batakic | ||
4 pat vier four Atjinesch Acehnese achi1257 Acehnese | ||
4 effa vier four Niasch Nias nias1242 Nias | ||
4 eppat vier four Mentaweijsch Mentawai ment1249 Mentawai | ||
4 aöpa vier four Engganoosch Enggano engg1245 Enggano | ||
5 lima vijf five Maleisch Central Malay mala1479 Central Malay | ||
5 lema vijf five Lampongsch Lampungic lamp1241 Lampungic | ||
5 lema vijf five Redjangsch Rejang reja1240 Rejang | ||
5 lema vijf five Battasch Batakic toba1265 Batakic | ||
5 lemoeng vijf five Atjinesch Acehnese achi1257 Acehnese | ||
5 limang vijf five Niasch Nias nias1242 Nias | ||
5 liman vijf five Mentaweijsch Mentawai ment1249 Mentawai | ||
5 alima vijf five Engganoosch Enggano engg1245 Enggano | ||
6 anam zes six Maleisch Central Malay mala1479 Central Malay | ||
6 anam zes six Lampongsch Lampungic lamp1241 Lampungic | ||
6 noom zes six Redjangsch Rejang reja1240 Rejang | ||
6 onam zes six Battasch Batakic toba1265 Batakic | ||
6 nam zes six Atjinesch Acehnese achi1257 Acehnese | ||
6 e nem zes six Niasch Nias nias1242 Nias | ||
6 ennem zes six Mentaweijsch Mentawai ment1249 Mentawai | ||
6 akiakoio zes six Engganoosch Enggano engg1245 Enggano | ||
7 toedjoe zeven seven Maleisch Central Malay mala1479 Central Malay | ||
7 peto zeven seven Lampongsch Lampungic lamp1241 Lampungic | ||
7 tojoa zeven seven Redjangsch Rejang reja1240 Rejang | ||
7 paito zeven seven Battasch Batakic toba1265 Batakic | ||
7 tojo zeven seven Atjinesch Acehnese achi1257 Acehnese | ||
7 fitoe zeven seven Niasch Nias nias1242 Nias | ||
7 pitoe zeven seven Mentaweijsch Mentawai ment1249 Mentawai | ||
7 alima-adoea zeven seven Engganoosch Enggano engg1245 Enggano | ||
8 delapan acht eight Maleisch Central Malay mala1479 Central Malay | ||
8 oallo acht eight Lampongsch Lampungic lamp1241 Lampungic | ||
8 delapon acht eight Redjangsch Rejang reja1240 Rejang | ||
8 oallo acht eight Battasch Batakic toba1265 Batakic | ||
8 dlappang acht eight Atjinesch Acehnese achi1257 Acehnese | ||
8 walloe acht eight Niasch Nias nias1242 Nias | ||
8 walloe acht eight Mentaweijsch Mentawai ment1249 Mentawai | ||
8 alima-agoloe acht eight Engganoosch Enggano engg1245 Enggano | ||
9 sembilan negen nine Maleisch Central Malay mala1479 Central Malay | ||
9 sewa negen nine Lampongsch Lampungic lamp1241 Lampungic | ||
9 sempilan negen nine Redjangsch Rejang reja1240 Rejang | ||
9 sea negen nine Battasch Batakic toba1265 Batakic | ||
9 sakorang negen nine Atjinesch Acehnese achi1257 Acehnese | ||
9 siwa negen nine Niasch Nias nias1242 Nias | ||
9 siwa negen nine Mentaweijsch Mentawai ment1249 Mentawai | ||
9 alima-aöpa negen nine Engganoosch Enggano engg1245 Enggano | ||
10 sapoeloe tien ten Maleisch Central Malay mala1479 Central Malay | ||
10 polo tien ten Lampongsch Lampungic lamp1241 Lampungic | ||
10 depoolo tien ten Redjangsch Rejang reja1240 Rejang | ||
10 sapolo tien ten Battasch Batakic toba1265 Batakic | ||
10 sapolo tien ten Atjinesch Acehnese achi1257 Acehnese | ||
10 foeloe tien ten Niasch Nias nias1242 Nias | ||
10 poeloe tien ten Mentaweijsch Mentawai ment1249 Mentawai | ||
10 tahapoeloe tien ten Engganoosch Enggano engg1245 Enggano |
Oops, something went wrong.