Edit R codes and configuration files + first run of CLDF.

complexico · Jul 7, 2024 · 2958ec8 · 2958ec8
1 parent 82cb8c8
commit 2958ec8
Show file tree

Hide file tree

Showing 22 changed files with 1,426 additions and 111 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/.zenodo.json b/.zenodo.json
@@ -0,0 +1,20 @@
+{
+    "title": "CLDF dataset derived from von Rosenberg's \"De Mentawei-Eilanden en Hunne Bewoners\" from 1853 for comparative numeral data",
+    "access_right": "open",
+    "keywords": [
+        "cldf:Wordlist",
+        "linguistics"
+    ],
+    "creators": [],
+    "contributors": [],
+    "communities": [
+        {
+            "identifier": "lexibank"
+        }
+    ],
+    "upload_type": "dataset",
+    "description": "<p>Cite the source of the dataset as:</p>\n\n<blockquote>\n<p>von Rosenberg, Carl Benjamin Hermann. 1853. De Mentawei-Eilanden en Hunne Bewoners. Tijdschrift voor Indische Taal-, Land- en Volkenkunde 1. 403\u2013440.</p>\n</blockquote>",
+    "license": {
+        "id": "CC-BY-NC-SA-4.0"
+    }
+}
diff --git a/FORMS.md b/FORMS.md
@@ -0,0 +1,27 @@
+## Specification of form manipulation
+
+
+Specification of the value-to-form processing in Lexibank datasets:
+
+The value-to-form processing is divided into two steps, implemented as methods:
+- `FormSpec.split`: Splits a string into individual form chunks.
+- `FormSpec.clean`: Normalizes a form chunk.
+
+These methods use the attributes of a `FormSpec` instance to configure their behaviour.
+
+- `brackets`: `{'(': ')'}`
+  Pairs of strings that should be recognized as brackets, specified as `dict` mapping opening string to closing string
+- `separators`: `,`
+  Iterable of single character tokens that should be recognized as word separator
+- `missing_data`: `('?', '-')`
+  Iterable of strings that are used to mark missing data
+- `strip_inside_brackets`: `True`
+  Flag signaling whether to strip content in brackets (**and** strip leading and trailing whitespace)
+- `replacements`: `[]`
+  List of pairs (`source`, `target`) used to replace occurrences of `source` in formswith `target` (before stripping content in brackets)
+- `first_form_only`: `False`
+  Flag signaling whether at most one form should be returned from `split` - effectively ignoring any spelling variants, etc.
+- `normalize_whitespace`: `True`
+  Flag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spaces
+- `normalize_unicode`: `NFD`
+  UNICODE normalization form to use for input of `split` (`None`, 'NFD' or 'NFC')
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -0,0 +1,32 @@
+# CLDF dataset derived from von Rosenberg's "De Mentawei-Eilanden en Hunne Bewoners" from 1853 for comparative numeral data
+
+## How to cite
+
+If you use these data please cite
+- the original source
+  > von Rosenberg, Carl Benjamin Hermann. 1853. De Mentawei-Eilanden en Hunne Bewoners. Tijdschrift voor Indische Taal-, Land- en Volkenkunde 1. 403–440.
+- the derived dataset using the DOI of the [particular released version](../../releases/) you were using
+
+## Description
+
+
+This dataset is licensed under a https://creativecommons.org/licenses/by-nc-sa/4.0/ license
+
+## Statistics
+
+
+![Glottolog: 100%](https://img.shields.io/badge/Glottolog-100%25-brightgreen.svg "Glottolog: 100%")
+![Concepticon: 100%](https://img.shields.io/badge/Concepticon-100%25-brightgreen.svg "Concepticon: 100%")
+![Source: 100%](https://img.shields.io/badge/Source-100%25-brightgreen.svg "Source: 100%")
+
+- **Varieties:** 8 (linked to 8 different Glottocodes)
+- **Concepts:** 10 (linked to 10 different Concepticon concept sets)
+- **Lexemes:** 80
+- **Sources:** 1
+- **Synonymy:** 1.00
+
+## CLDF Datasets
+
+The following CLDF datasets are available in [cldf](cldf):
+
+- CLDF [Wordlist](https://github.com/cldf/cldf/tree/master/modules/Wordlist) at [cldf/cldf-metadata.json](cldf/cldf-metadata.json)
diff --git a/TRANSCRIPTION.md b/TRANSCRIPTION.md
@@ -0,0 +1,29 @@
+
+# Detailed transcription record
+
+## Segments
+
+| Segment | Occurrence | BIPA | CLTS SoundClass |
+|-----------|--------------|--------|-------------------|
+
+(0 rows)
+
+
+
+## Unsegmentable lexemes (up to 100 only)
+
+| ID | LANGUAGE | CONCEPT | FORM |
+|------|------------|-----------|--------|
+
+(0 rows)
+
+
+
+## Words with invalid segments (up to 100 only)
+
+| ID | LANGUAGE | CONCEPT | FORM | SEGMENTS |
+|------|------------|-----------|--------|------------|
+
+(0 rows)
+
+
diff --git a/cldf/.transcription-report.json b/cldf/.transcription-report.json
@@ -0,0 +1,15 @@
+{
+    "by_language": {},
+    "stats": {
+        "bad_words": [],
+        "bad_words_count": 0,
+        "bipa_errors": [],
+        "general_errors": 0,
+        "invalid_words": [],
+        "invalid_words_count": 0,
+        "inventory_size": 0,
+        "replacements": {},
+        "sclass_errors": [],
+        "segments": {}
+    }
+}
diff --git a/cldf/README.md b/cldf/README.md
@@ -1,3 +1,89 @@
-# CLDF directory
+<a name="ds-cldfmetadatajson"> </a>
+
+# Wordlist CLDF dataset derived from von Rosenberg's "De Mentawei-Eilanden en Hunne Bewoners" from 1853 for comparative numeral data
+
+**CLDF Metadata**: [cldf-metadata.json](./cldf-metadata.json)
+
+**Sources**: [sources.bib](./sources.bib)
+
+property | value
+ --- | ---
+[dc:bibliographicCitation](http://purl.org/dc/terms/bibliographicCitation) | von Rosenberg, Carl Benjamin Hermann. 1853. De Mentawei-Eilanden en Hunne Bewoners. Tijdschrift voor Indische Taal-, Land- en Volkenkunde 1. 403–440.
+[dc:conformsTo](http://purl.org/dc/terms/conformsTo) | [CLDF Wordlist](http://cldf.clld.org/v1.0/terms.rdf#Wordlist)
+[dc:license](http://purl.org/dc/terms/license) | https://creativecommons.org/licenses/by-nc-sa/4.0/
+[dcat:accessURL](http://www.w3.org/ns/dcat#accessURL) | git@github.com:complexico/vrosenberg1853-numeral
+[prov:wasDerivedFrom](http://www.w3.org/ns/prov#wasDerivedFrom) | <ol><li><a href="git@github.com:complexico/vrosenberg1853-numeral/tree/82cb8c8">git@github.com:complexico/vrosenberg1853-numeral 82cb8c8</a></li><li><a href="glottolog-glottolog-d9da5e2">Glottolog glottolog-glottolog-d9da5e2</a></li><li><a href="https://github.com/concepticon/concepticon-data/tree/7c0b6ae3">Concepticon v3.1.0-19-g7c0b6ae3</a></li><li><a href="cldf-clts-clts-6dc73af">CLTS cldf-clts-clts-6dc73af</a></li></ol>
+[prov:wasGeneratedBy](http://www.w3.org/ns/prov#wasGeneratedBy) | <ol><li><strong>lingpy-rcParams</strong>: <a href="./lingpy-rcParams.json">lingpy-rcParams.json</a></li><li><strong>python</strong>: 3.9.6</li><li><strong>python-packages</strong>: <a href="./requirements.txt">requirements.txt</a></li></ol>
+[rdf:ID](http://www.w3.org/1999/02/22-rdf-syntax-ns#ID) | vrosenberg1853
+[rdf:type](http://www.w3.org/1999/02/22-rdf-syntax-ns#type) | http://www.w3.org/ns/dcat#Distribution
+
+
+## <a name="table-formscsv"></a>Table [forms.csv](./forms.csv)
+
+property | value
+ --- | ---
+[dc:conformsTo](http://purl.org/dc/terms/conformsTo) | [CLDF FormTable](http://cldf.clld.org/v1.0/terms.rdf#FormTable)
+[dc:extent](http://purl.org/dc/terms/extent) | 80
+
+
+### Columns
+
+Name/Property | Datatype | Description
+ --- | --- | --- 
+[ID](http://cldf.clld.org/v1.0/terms.rdf#id) | `string` | Primary key
+[Local_ID](http://purl.org/dc/terms/identifier) | `string` | 
+[Language_ID](http://cldf.clld.org/v1.0/terms.rdf#languageReference) | `string` | References [languages.csv::ID](#table-languagescsv)
+[Parameter_ID](http://cldf.clld.org/v1.0/terms.rdf#parameterReference) | `string` | References [parameters.csv::ID](#table-parameterscsv)
+[Value](http://cldf.clld.org/v1.0/terms.rdf#value) | `string` | 
+[Form](http://cldf.clld.org/v1.0/terms.rdf#form) | `string` | 
+[Segments](http://cldf.clld.org/v1.0/terms.rdf#segments) | list of `string` (separated by ` `) | 
+[Comment](http://cldf.clld.org/v1.0/terms.rdf#comment) | `string` | 
+[Source](http://cldf.clld.org/v1.0/terms.rdf#source) | list of `string` (separated by `;`) | References [sources.bib::BibTeX-key](./sources.bib)
+`Cognacy` | `string` | 
+`Loan` | `boolean` | 
+`Graphemes` | `string` | 
+`Profile` | `string` | 
+`CommonTranscription` | `string` | 
+
+## <a name="table-languagescsv"></a>Table [languages.csv](./languages.csv)
+
+property | value
+ --- | ---
+[dc:conformsTo](http://purl.org/dc/terms/conformsTo) | [CLDF LanguageTable](http://cldf.clld.org/v1.0/terms.rdf#LanguageTable)
+[dc:extent](http://purl.org/dc/terms/extent) | 8
+
+
+### Columns
+
+Name/Property | Datatype | Description
+ --- | --- | --- 
+[ID](http://cldf.clld.org/v1.0/terms.rdf#id) | `string` | Primary key
+[Name](http://cldf.clld.org/v1.0/terms.rdf#name) | `string` | 
+[Glottocode](http://cldf.clld.org/v1.0/terms.rdf#glottocode) | `string` | 
+`Glottolog_Name` | `string` | 
+[ISO639P3code](http://cldf.clld.org/v1.0/terms.rdf#iso639P3code) | `string` | 
+[Macroarea](http://cldf.clld.org/v1.0/terms.rdf#macroarea) | `string` | 
+[Latitude](http://cldf.clld.org/v1.0/terms.rdf#latitude) | `decimal` | 
+[Longitude](http://cldf.clld.org/v1.0/terms.rdf#longitude) | `decimal` | 
+`Family` | `string` | 
+`Sources` | `string` | 
+`Doculect_Dutch` | `string` | 
+
+## <a name="table-parameterscsv"></a>Table [parameters.csv](./parameters.csv)
+
+property | value
+ --- | ---
+[dc:conformsTo](http://purl.org/dc/terms/conformsTo) | [CLDF ParameterTable](http://cldf.clld.org/v1.0/terms.rdf#ParameterTable)
+[dc:extent](http://purl.org/dc/terms/extent) | 10
+
+
+### Columns
+
+Name/Property | Datatype | Description
+ --- | --- | --- 
+[ID](http://cldf.clld.org/v1.0/terms.rdf#id) | `string` | Primary key
+[Name](http://cldf.clld.org/v1.0/terms.rdf#name) | `string` | 
+[Concepticon_ID](http://cldf.clld.org/v1.0/terms.rdf#concepticonReference) | `string` | 
+`Concepticon_Gloss` | `string` | 
+`Number` | `string` | 
 
-This directory contains the dataset formatted as [CLDF dataset](https://cldf.clld.org).