Skip to content

Commit

Permalink
First commit.
Browse files Browse the repository at this point in the history
  • Loading branch information
gederajeg committed Jul 7, 2024
0 parents commit 82cb8c8
Show file tree
Hide file tree
Showing 20 changed files with 417 additions and 0 deletions.
Binary file added .DS_Store
Binary file not shown.
29 changes: 29 additions & 0 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: CLDF-validation

on:
push:
branches: [ main ]
pull_request:
branches: [ main ]

jobs:
build:

runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.6]

steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pytest-cldf
- name: Test with pytest
run: |
pytest --cldf-metadata=cldf/cldf-metadata.json test.py
102 changes: 102 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# dotenv
.env

# virtualenv
.venv
venv/
ENV/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

1 change: 1 addition & 0 deletions cldf/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.csv text eol=crlf
3 changes: 3 additions & 0 deletions cldf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# CLDF directory

This directory contains the dataset formatted as [CLDF dataset](https://cldf.clld.org).
26 changes: 26 additions & 0 deletions cldfbench_vrosenberg1853.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import pathlib

from cldfbench import Dataset as BaseDataset


class Dataset(BaseDataset):
dir = pathlib.Path(__file__).parent
id = "vrosenberg1853"

def cldf_specs(self): # A dataset must declare all CLDF sets it creates.
return super().cldf_specs()

def cmd_download(self, args):
"""
Download files to the raw/ directory. You can use helpers methods of `self.raw_dir`, e.g.
>>> self.raw_dir.download(url, fname)
"""
pass

def cmd_makecldf(self, args):
"""
Convert the raw data to a CLDF dataset.
>>> args.writer.objects['LanguageTable'].append(...)
"""
16 changes: 16 additions & 0 deletions etc/00-steps-on-terminal.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
source /Users/Primahadi/Documents/cldf_project/myenv/bin/activate

cldfbench new
# id: vrosenberg1853
# title: CLDF dataset derived from von Rosenberg's "De Mentawei-Eilanden en Hunne Bewoners" from 1853 for comparative numeral data
# license: https://creativecommons.org/licenses/by-nc-sa/4.0/
# url:
# citation: Rosenberg, Carl Benjamin Hermann von. 1853. De Mentawei-Eilanden en Hunne Bewoners. Tijdschrift voor Indische Taal-, Land- en Volkenkunde 1. 403–440.

cd vrosenberg1853

# pre-process the raw data in `01-raw-data-pre-processing.R`

# run pyconcepticon
concepticon --repos "/Users/Primahadi/Documents/cldf_project/concepticon/concepticon-data" \
map_concepts "etc/gloss-to-map.tsv" --language en --output "etc/concepts.tsv"
53 changes: 53 additions & 0 deletions etc/01-raw-data-pre-processing.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
library(tidyverse)
library(deeplr)

df <- read_tsv("https://raw.githubusercontent.com/complexico/mentawai-word-list-1853/main/data/vrosenberg1853p434.tsv")

df <- df |>
mutate(ID = row_number()) |>
select(ID, everything()) |>
pivot_longer(cols = -c(Dutch, ID), names_to = "Doculect", values_to = "Forms")

# prepare a translation data (run one time then save it)
# source("etc/00-deeplr.R")
# dutch <- df |>
# select(ID, Dutch) |>
# distinct()

# dutch_w_english <- dutch |>
# mutate(English = translate2(text = Dutch, target_lang = "EN", source_lang = "NL", preserve_formatting = TRUE, auth_key = mydeepl))

# dutch_w_english <- dutch_w_english |>
# mutate(English = replace(English, English == "a", "one"))
# write_tsv(dutch_w_english, file = "etc/dutch_w_english.tsv")

dutch_w_english <- read_tsv("etc/dutch_w_english.tsv")

doculect <- tribble(
~Doculect, ~Doculect_Eng, ~Glottocode, ~Glottolog_name,
"Maleisch", "Central Malay", "mala1479", "Central Malay",
"Lampongsch", "Lampungic", "lamp1241", "Lampungic",
"Redjangsch", "Rejang", "reja1240", "Rejang",
"Battasch", "Batakic", "toba1265", "Batakic",
"Atjinesch", "Acehnese", "achi1257", "Acehnese",
"Niasch", "Nias", "nias1242", "Nias",
"Mentaweijsch", "Mentawai", "ment1249", "Mentawai",
"Engganoosch", "Enggano", "engg1245", "Enggano"
)
# I choose Glottolog's Central Malay because I am not really sure the specific Malay variety referred to by von Rosenberg. Glottolog describes Central Malay as the primary source of the variety of Malay spoken throughout the South East Asia: "Malay (zlm-zlm) = 3 (Wider communication). Originated in Sumatra; spoken throughout southeast Asia. With the advent of Islam, Malay became widespread in the 15th and 16th centuries. Lingua franca for Malaysia’s multiethnic population. Used in trade, literature, and story telling."

df <- df |>
left_join(dutch_w_english) |>
select(ID, Forms, Dutch, English, Doculect) |>
left_join(doculect) |>
rename(Doculect_orig = Doculect,
Doculect = Doculect_Eng)

df |> write_tsv("raw/raw-dat.tsv")

# Prepare the concept for Concepticon mapping
# df |>
# select(ENGLISH = English, ID) |>
# distinct() |>
# write_tsv("etc/gloss-to-map.tsv")

4 changes: 4 additions & 0 deletions etc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Configuration directory

This directory contains "configuration" data, i.e. data which helps with and
guides the conversion of the raw data to CLDF.
15 changes: 15 additions & 0 deletions etc/concepts.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
ENGLISH ID CONCEPTICON_ID CONCEPTICON_GLOSS SIMILARITY
one 1 1493 ONE 2
#<<<
two 2 1498 TWO 2
two 2 1384 SECOND 2
#>>>
three 3 492 THREE 2
four 4 1500 FOUR 2
five 5 493 FIVE 2
six 6 1703 SIX 2
seven 7 1704 SEVEN 2
eight 8 1705 EIGHT 2
nine 9 1483 NINE 2
ten 10 1515 TEN 2
# 10/10 100%
11 changes: 11 additions & 0 deletions etc/dutch_w_english.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
ID Dutch English
1 een one
2 twee two
3 drie three
4 vier four
5 vijf five
6 zes six
7 zeven seven
8 acht eight
9 negen nine
10 tien ten
11 changes: 11 additions & 0 deletions etc/gloss-to-map.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
ENGLISH ID
one 1
two 2
three 3
four 4
five 5
six 6
seven 7
eight 8
nine 9
ten 10
9 changes: 9 additions & 0 deletions etc/languages.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
ID Doculect_Dutch Name Glottocode Glottolog_Name Sources
1 Maleisch Central Malay mala1479 Central Malay VonRosenberg1853
2 Lampongsch Lampungic lamp1241 Lampungic VonRosenberg1853
3 Redjangsch Rejang reja1240 Rejang VonRosenberg1853
4 Battasch Batakic toba1265 Batakic VonRosenberg1853
5 Atjinesch Acehnese achi1257 Acehnese VonRosenberg1853
6 Niasch Nias nias1242 Nias VonRosenberg1853
7 Mentaweijsch Mentawai ment1249 Mentawai VonRosenberg1853
8 Engganoosch Enggano engg1245 Enggano VonRosenberg1853
8 changes: 8 additions & 0 deletions metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"id": "vrosenberg1853",
"title": "CLDF dataset derived from von Rosenberg's \"De Mentawei-Eilanden en Hunne Bewoners\" from 1853 for comparative numeral data",
"description": null,
"license": "https://creativecommons.org/licenses/by-nc-sa/4.0/",
"url": null,
"citation": "Rosenberg, Carl Benjamin Hermann von. 1853. De Mentawei-Eilanden en Hunne Bewoners. Tijdschrift voor Indische Taal-, Land- en Volkenkunde 1. 403\u2013440."
}
4 changes: 4 additions & 0 deletions raw/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Source directory

This directory contains the "raw" source data of the dataset from which the
CLDF dataset in [`cldf/`](../cldf) is derived.
81 changes: 81 additions & 0 deletions raw/raw-dat.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
ID Forms Dutch English Doculect_orig Doculect Glottocode Glottolog_name
1 satoe een one Maleisch Central Malay mala1479 Central Malay
1 sije een one Lampongsch Lampungic lamp1241 Lampungic
1 do een one Redjangsch Rejang reja1240 Rejang
1 sada een one Battasch Batakic toba1265 Batakic
1 sa een one Atjinesch Acehnese achi1257 Acehnese
1 sara een one Niasch Nias nias1242 Nias
1 sara een one Mentaweijsch Mentawai ment1249 Mentawai
1 daheij een one Engganoosch Enggano engg1245 Enggano
2 doea twee two Maleisch Central Malay mala1479 Central Malay
2 rowa twee two Lampongsch Lampungic lamp1241 Lampungic
2 dooij twee two Redjangsch Rejang reja1240 Rejang
2 doeo twee two Battasch Batakic toba1265 Batakic
2 doea twee two Atjinesch Acehnese achi1257 Acehnese
2 doea twee two Niasch Nias nias1242 Nias
2 doea twee two Mentaweijsch Mentawai ment1249 Mentawai
2 adoea twee two Engganoosch Enggano engg1245 Enggano
3 tiga drie three Maleisch Central Malay mala1479 Central Malay
3 pullo drie three Lampongsch Lampungic lamp1241 Lampungic
3 tellau drie three Redjangsch Rejang reja1240 Rejang
3 tolo drie three Battasch Batakic toba1265 Batakic
3 tlo drie three Atjinesch Acehnese achi1257 Acehnese
3 feloe drie three Niasch Nias nias1242 Nias
3 teloe drie three Mentaweijsch Mentawai ment1249 Mentawai
3 agoloe drie three Engganoosch Enggano engg1245 Enggano
4 ampat vier four Maleisch Central Malay mala1479 Central Malay
4 ampa vier four Lampongsch Lampungic lamp1241 Lampungic
4 mpat vier four Redjangsch Rejang reja1240 Rejang
4 opat vier four Battasch Batakic toba1265 Batakic
4 pat vier four Atjinesch Acehnese achi1257 Acehnese
4 effa vier four Niasch Nias nias1242 Nias
4 eppat vier four Mentaweijsch Mentawai ment1249 Mentawai
4 aöpa vier four Engganoosch Enggano engg1245 Enggano
5 lima vijf five Maleisch Central Malay mala1479 Central Malay
5 lema vijf five Lampongsch Lampungic lamp1241 Lampungic
5 lema vijf five Redjangsch Rejang reja1240 Rejang
5 lema vijf five Battasch Batakic toba1265 Batakic
5 lemoeng vijf five Atjinesch Acehnese achi1257 Acehnese
5 limang vijf five Niasch Nias nias1242 Nias
5 liman vijf five Mentaweijsch Mentawai ment1249 Mentawai
5 alima vijf five Engganoosch Enggano engg1245 Enggano
6 anam zes six Maleisch Central Malay mala1479 Central Malay
6 anam zes six Lampongsch Lampungic lamp1241 Lampungic
6 noom zes six Redjangsch Rejang reja1240 Rejang
6 onam zes six Battasch Batakic toba1265 Batakic
6 nam zes six Atjinesch Acehnese achi1257 Acehnese
6 e nem zes six Niasch Nias nias1242 Nias
6 ennem zes six Mentaweijsch Mentawai ment1249 Mentawai
6 akiakoio zes six Engganoosch Enggano engg1245 Enggano
7 toedjoe zeven seven Maleisch Central Malay mala1479 Central Malay
7 peto zeven seven Lampongsch Lampungic lamp1241 Lampungic
7 tojoa zeven seven Redjangsch Rejang reja1240 Rejang
7 paito zeven seven Battasch Batakic toba1265 Batakic
7 tojo zeven seven Atjinesch Acehnese achi1257 Acehnese
7 fitoe zeven seven Niasch Nias nias1242 Nias
7 pitoe zeven seven Mentaweijsch Mentawai ment1249 Mentawai
7 alima-adoea zeven seven Engganoosch Enggano engg1245 Enggano
8 delapan acht eight Maleisch Central Malay mala1479 Central Malay
8 oallo acht eight Lampongsch Lampungic lamp1241 Lampungic
8 delapon acht eight Redjangsch Rejang reja1240 Rejang
8 oallo acht eight Battasch Batakic toba1265 Batakic
8 dlappang acht eight Atjinesch Acehnese achi1257 Acehnese
8 walloe acht eight Niasch Nias nias1242 Nias
8 walloe acht eight Mentaweijsch Mentawai ment1249 Mentawai
8 alima-agoloe acht eight Engganoosch Enggano engg1245 Enggano
9 sembilan negen nine Maleisch Central Malay mala1479 Central Malay
9 sewa negen nine Lampongsch Lampungic lamp1241 Lampungic
9 sempilan negen nine Redjangsch Rejang reja1240 Rejang
9 sea negen nine Battasch Batakic toba1265 Batakic
9 sakorang negen nine Atjinesch Acehnese achi1257 Acehnese
9 siwa negen nine Niasch Nias nias1242 Nias
9 siwa negen nine Mentaweijsch Mentawai ment1249 Mentawai
9 alima-aöpa negen nine Engganoosch Enggano engg1245 Enggano
10 sapoeloe tien ten Maleisch Central Malay mala1479 Central Malay
10 polo tien ten Lampongsch Lampungic lamp1241 Lampungic
10 depoolo tien ten Redjangsch Rejang reja1240 Rejang
10 sapolo tien ten Battasch Batakic toba1265 Batakic
10 sapolo tien ten Atjinesch Acehnese achi1257 Acehnese
10 foeloe tien ten Niasch Nias nias1242 Nias
10 poeloe tien ten Mentaweijsch Mentawai ment1249 Mentawai
10 tahapoeloe tien ten Engganoosch Enggano engg1245 Enggano
Loading

0 comments on commit 82cb8c8

Please sign in to comment.