Skip to content

Commit

Permalink
Add data structure description
Browse files Browse the repository at this point in the history
or at least a draft
  • Loading branch information
simar0at authored Jul 16, 2024
1 parent 1577b7f commit 1280220
Showing 1 changed file with 55 additions and 0 deletions.
55 changes: 55 additions & 0 deletions DATASTRUCTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Relations between text types and collections/DBs in BaseX

The vicav app treats different TEI text types differently and they are stored in different collections.
Each collection can be listed which means that the teiHeader metadata is retruned as JSON. This can be consumed by a frontend to show lists an filter them.
By default a virtual teiCorpus document will be created from the teiHeader data in a collection.
If there is a document containing a teiCorpus in a collection this will be used instead.
The idea is to have such a document whenever there are resources of that type that can only be presented as their metadata.
For example texts that only exist as audio recordings and were not transcribed or recordings that can not be presented due to legal and/or privacy protection reasons.

## Supported TEI text types

* Meta texts (About, News, etc.) -> vicav_texts
* Bibliographic entries -> vicav_biblio
* Language profiles -> vicav_profiles
* Sample texts -> vicav_samples
* Linguisitic feature descriptions -> vicav_lingfeatures
* Text cropora -> vicav_cropus

### Recommended structure for meta texts

For a schema example see:

### Recommeded structure for bibliographic entries

We usually collect bibliographic entries in Zotero and export them to TEI.

For a schema example see:

### Recommeded structure for language profiles

For a schema example see:

### Recommeded structure for sample texts

For a schema example see:

### Recommeded structure for linguisitic feature descriptions

For a schema example see:

### Recommeded structure for text cropora

We use NoSketchEngine as the search engine beckend.
There is a workflow that takes TEI texts or ELAN files and converts them to TEI with the text tokenized.
Search results from NoSketchEngine are resolved to w-tags in XML files that are genereated using the above workflow that also generates the NoSketchEngine verticals.
The xml:id attributes on any w-tag in the vicav_corpus collection needs to be unique within the collection. We therefore usually prefix the token ID with a document ID.

For a schema example see:

## Other collections

* vicav_projects -> settings for a particular instance of the vicav-app
* prerendered_json -> the project config including all metadata that is requested via the settings. created on build.
* dict_users -> users for the dictionary part. See the vleserver-basex API
* Dictionary collections -> See the vleserver-basex API

0 comments on commit 1280220

Please sign in to comment.