Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let's start thinking about how to document models #16

Open
alix-tz opened this issue Dec 2, 2022 · 4 comments
Open

Let's start thinking about how to document models #16

alix-tz opened this issue Dec 2, 2022 · 4 comments
Labels
question Further information is requested schema

Comments

@alix-tz
Copy link
Member

alix-tz commented Dec 2, 2022

See: HTR-United/htr-united#91

On the top of my head, properties should include (* : required):

  • Title*
  • Description*
  • Software (Name, Link, Version)*
  • DOI Link*
  • Project
  • Authors
  • Used datasets
  • Manuscript / Print / Both (Simpler than what we have for dataset)*
  • Languages*
  • Scripts*
  • Known characters
  • License*
  • Encoding*
@alix-tz alix-tz added question Further information is requested schema labels Dec 2, 2022
@alix-tz
Copy link
Member Author

alix-tz commented Dec 2, 2022

I think the software should be one of the first thing to appear, because if I'm using Transkribus, I won't care that model X or Y are able to handle French if they are Kraken models.

Now that raises an important question: given that Transkribus already provides a page listing public transcription models (https://readcoop.eu/transkribus/public-models/), do we want to also cover Transkribus models?

Personnally, I would lean in favor of it1, but it makes things a little more complicated: for example License, Ecoding and DOI2 might be impossible to fill for Transkribus models.

Footnotes

  1. Because 1) it might attract Transkribus users who didn't think of sharing their data/ground truth, 2) users might chose a software depending of the availability of models, 3) we can do better than the current metadata used by Transkribus.

  2. No DOI in Transkribus but models do have a unique ID.

@mittagessen
Copy link

Sorry for only starting to participate now. Something that is rather important is a field that indicates the type of model, e.g. transcription, segmentation, reading order, ... in addition to the software so it is possible to filter according to what one is actually looking for without having to download individual models. That would probably require changing the semantics of the known characters field to something like possible outputs.

As @PonteIneptique correctly identified models are somewhat ephemeral. In my opinion we should at least provide guidelines on how to deal with that. One (not particularly well thought out) way could be to treat the record/DOI as a 'prototype' model for that dataset(s) for a particular software and publish replacement models, e.g. a tweaked architecture improving performance, as a version linked to that original model instead of creating a completely new record. This is primarily to reduce the noise level in any model repository but might have some other benefits as well such as incentivizing early publication of models.

@alix-tz
Copy link
Member Author

alix-tz commented Dec 2, 2022

Ah your comment reminds me that we should probably include a "date of creation" property!

@tboenig
Copy link
Contributor

tboenig commented Mar 6, 2023

Hello to All,

unfortunately I could not participate in the discussion. I would now like to continue the discussion.
If I understood everything correctly, there should be

  • a schema for GT metadata and
  • a schema for models should be available.

Both schemas are strongly related to each other in terms of content but have special features.

It can be stated, the schema for GT is currently stable.
The schema for a model is under development.

My proposal for the description of metadata for a model was always based on the GT.
Example scenario.
GT was created and described with metadata. A model is created with this GT and this model is recorded in the metadata record.

Now, of course, there are other scenarios:
I use

  1. a very specific GT and create only one model
    or
  2. different GTs are combined by me, merged to one GT and a model is created.

In the first case there should be a connection between model and GT.
In the second case, I would think that it is actually new GT, which is

  1. gets an independent metadata set + model metadata set.
  2. but in the standalone metadata record it is noted that this record is based on GT.... was created.

I have expressed this now first everything naturally linguistically, since I assume that the formal writing can be realized so more simply then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested schema
Projects
None yet
Development

No branches or pull requests

3 participants