Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide data in CLDF format #2

Open
xrotwang opened this issue May 30, 2017 · 13 comments
Open

Provide data in CLDF format #2

xrotwang opened this issue May 30, 2017 · 13 comments

Comments

@xrotwang
Copy link

xrotwang commented May 30, 2017

It may be worthwhile to change the data format in this repos to CLDF. As far as I can tell, not too many changes would be required to do so:

What you would gain:

  • Some of the documentation could be offloaded to the CLDF spec.
  • Better machine readability, in particular of the metadata (while YAML has good support in many languages, support for JSON is still better; the additional benefit of using JSON-LD is machine readable info about license, proper citation, etc.).
  • A python API.
  • A first (big) step towards serving the autotyp data from a clld app.
@tzakharko
Copy link
Contributor

tzakharko commented Jun 7, 2017

For the initial export, we chose YAML because of its human-readability. This was done to make the data more accessible to a wider audience. We aim to provide supplementary JSON-LD metadata in future releases that expose more of the database internal structure.

In the meantime, if there is interest, it should be possible to create a pipeline that converts the current CSV/YAML format to CLDF CSV/JSON-LD automatically. We will gladly accept community help in setting this up.

@ALL: Please comment in this thread if you are interested in creating such a pipeline, we could use it to draft a roadmap.

@xrotwang
Copy link
Author

I will have a look into this. Could be a good example for the refactored CLDF structure dataset spec.

@xflr6
Copy link

xflr6 commented Jan 10, 2018

First stab at the conversion is here: https://github.com/clld/autotyp-data/blob/cldf/autotyp_to_cldf.py
To workaround #9 and #10, the ill-formed data was removed, cf. the commits in the issues branch
Result as a ZIP-file: autotyp-cldf.zip

@tzakharko
Copy link
Contributor

First of all, apologies that it took a while — our previous database pipeline was unmaintainable and so we had to redesign and rebuild it from scratch. With the new pipeline we are better equipped for tracking the dependencies between datasets (not explicitly part of metadata yet, but will be soonish), and so it is a good time to revisit this issue and chart a way for provide a robust solution.

One potential difficulty I see is that we decided to go with nested/repeated data for some datasets, as it simplifies handling and conceptualisation in practice. What would be a good way of mapping this kind of data model to CLDF? If I understand correctly, there is some support for repeated simple values, but what about nested records?

@xrotwang
Copy link
Author

A relatively straightforward way to handle this is using JSON serialized as string as values, and adding enough metadata to the ParameterTable to make this transparent. A complete example using pycldf looks like this:

from csvw.metadata import Datatype
from pycldf import StructureDataset

ds = StructureDataset.in_dir('ds')
ds.add_component('ParameterTable', {'name': 'datatype', 'datatype': 'json'})
ds.write(
    ParameterTable=[dict(ID='pid', datatype='json')],
    ValueTable=[dict(ID='1', Language_ID='l', Parameter_ID='pid', Value='{"a": 2}')])

dt = Datatype.fromvalue(ds.get_object('ParameterTable', 'pid').data['datatype'])
for v in ds['ValueTable']:
    v = dt.parse(v['Value'])
    assert isinstance(v, dict)
    print(v['a'])

Here, we add a column datatype to ParameterTable, and mark it as JSON column (which is understood by csvw). When reading data from ValueTable, we first instantiate a csvw.metadata.Datatype instance from the datatype spec in ParameterTable, and then use this object to parse the value accordingly.

@xrotwang
Copy link
Author

Btw. I'm in the process of putting together a conversion from the AUTOTYP v1.0 to a CLDF dataset - that's how I turn up all the issues I posted :)

@tzakharko
Copy link
Contributor

A relatively straightforward way to handle this is using JSON serialized as string as values, and adding enough metadata to the ParameterTable to make this transparent. A complete example using pycldf looks like this:

That's neat! But at this point, what is the value of using CSV at all? Why not just go all JSON?

Btw. I'm in the process of putting together a conversion from the AUTOTYP v1.0 to a CLDF dataset - that's how I turn up all the issues I posted :)

Keep them coming :) One problem is that the published YAML metadata is just a subset of the much richer metadata we maintain internally for the export pipeline, and the mapping is not perfect. There are many improvements planned here, e.g. relationships between fields, more precise types and constraints etc. — these things unfortunately didn't make it for the big release.

@xrotwang
Copy link
Author

As soon as a particular data type for values becomes more wide-spread - including standard analysis methods - it becomes a candidate for "more" standardisation in CLDF. Putting it into CLDF now basically puts it "on track" for this. Also, CSV - even if it includes smallish JSON snippets - plays nicer with version control, because it doesn't have (mostly) the "unspecified whitespace" and "attribute order" issues of JSON or XML.

I should add that CLDF comes with "built-in" validation. E.g. stuff like invalid values for categorical data, non-existent Glottocodes, etc. will be flagged "out of the box". And generating human readable metadata descriptions is easy, e.g. with cldfbench (see e.g. https://github.com/glottolog/glottolog-cldf/blob/master/cldf/README.md). So arguably, making CLDF the target release format for AUTOTYP might solve some of the issues here.

@tzakharko
Copy link
Contributor

@xrotwang could you share your CLDF conversion pipeline with me? I would like to add it to the build system, so that we have CLDF as first class target.

@xrotwang
Copy link
Author

It's here: https://github.com/cldf-datasets/autotypcldf
Using https://github.com/cldf/cldfbench
autotyp-data is pulled in as git submodule, see https://github.com/cldf-datasets/autotypcldf/tree/main/raw
And the conversion is run via

cldfbench makecldf cldfbench_autotyp.py --glottolog-version v4.5

which basically runs the code in https://github.com/cldf-datasets/autotypcldf/blob/main/cldfbench_autotypcldf.py

@tzakharko
Copy link
Contributor

tzakharko commented Apr 7, 2022

CLDF dataset is now available in the cldf-export branch

The python dataset classes are here. I have copied your code verbatim, just adjusted the file paths and removed the bibliography fix since it is not necessary anymore.

Could you have a look whether the CLDF data is ok like this? If there are no concerns I can draft a 1.1.0 release.

@xrotwang
Copy link
Author

xrotwang commented Apr 7, 2022

Looks ok:

$ cldf stats StructureDataset-metadata.json 
<cldf:v1.0:StructureDataset at .>
                     value
-------------------  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dc:conformsTo        http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
dc:source            sources.bib
prov:wasDerivedFrom  [{'rdf:about': 'new-autotyp-preview', 'rdf:type': 'prov:Entity', 'dc:created': 'v1.0.1-1-g1d0af14', 'dc:title': 'Repository'}, {'rdf:about': 'https://github.com/glottolog/glottolog', 'rdf:type': 'prov:Entity', 'dc:created': 'v4.5', 'dc:title': 'Glottolog'}, {'rdf:about': 'new-autotyp-preview', 'rdf:type': 'prov:Entity', 'dc:created': 'v1.0.1-1-g1d0af14', 'dc:title': 'Repository'}]
prov:wasGeneratedBy  [{'dc:title': 'python', 'dc:description': '3.9.10'}, {'dc:title': 'python-packages', 'dc:relation': 'requirements.txt'}]
rdf:ID               autotyp
rdf:type             http://www.w3.org/ns/dcat#Distribution

                   Type                 Rows
-----------------  -----------------  ------
values.csv         ValueTable         278536
languages.csv      LanguageTable        3053
contributions.csv  ContributionTable      46
parameters.csv     ParameterTable       1013
codes.csv          CodeTable            1402
sources.bib        Sources              5001

and creating a SQLite db from it works as well.

So, looks good to me.

@nataliacp
Copy link

I have posted a comment on closed issue #51 (which I can't reopen), so I am copying it here as it is relevant for the conversion to the cldf format. It is about the synthesis module but it could be applicable for other complex modules too.

I have a proposal to increase data reusability in cldf. Right now, the variables listed in the first comment in this thread are within a JSON format under the MaximallyInflectedVerbSynthesis umbrella variable. Most of these variables though are simple binary per-language variables and they could be incorporated straightforwardly in the CLDF format. The only problem is that the only values for these variables that can be trusted are for the languages that are TRUE for both housekeeping variables (IsVerbAgreementSurveyComplete and IsVerbInflectionSurveyComplete). What do you think @tzakharko and @xrotwang?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants