Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describing data type: what exactly to describe and what controlled vocab(s) to use #14

Open
kmexter opened this issue Dec 4, 2023 · 15 comments
Assignees
Labels
todo 2025 need more data before can do: a MUST for 2025

Comments

@kmexter
Copy link
Contributor

kmexter commented Dec 4, 2023

We think it is useful to add metadata describing the type of data that a dataset is describing, but we are not sure exactly what we want to describe here

  • the data format (i.e. the suffix)
  • a general data type (e.g. "spreadsheet")
  • something else

In this issue we need to decide on this, and decide on the semantics to use.

We have to decide whether we want this metadatum to be useful as a piece of technical information (e.g. for OceanInfoHub) or for the audience (scientists, who are also those providing the descriptions in the first place). Personally, I think the second is better, mainly because the scientists describing the data will find that easier.

I copy below the discussion we have had so far in email

@kmexter
Copy link
Contributor Author

kmexter commented Dec 4, 2023

first email
Can we chose the terms in schema.org we should use the describe data types in the dataset descriptions?

This list should include

  • text - I know there is schema:Text
  • spreadsheet: schema:SpreadSheetDigitalDocument? or application?
  • media (even images, video separately) schema:audio and schema:image and video?
  • not sure how omics files would be listed: probably text, since they are text of one sort or another
  • schema:Map (for indicators that are maps)?
  • Plots = I guess would be images?
  • any other types of data we know MBO cover?
  • DigitalDocument for documents?

@kmexter
Copy link
Contributor Author

kmexter commented Dec 4, 2023

From @pieterprovoost

I'm not entirely sure what the intention is here. Is this meant to go in Dataset.additionalType for example to categorize datasets, or rather in DataDownload.encodingFormat to indicate in what format the data are available? If it's the latter then we should use MIME types. If this is about categorizing datasets then I don't think we'll get there with schema.org classes, but I'm not sure what to suggest instead. Dublin Core has things like dcterms:StillImage, dcterms:Text, dcterms:MovingImage but maybe we are more interested in differentiating between sequence based data, imaging based data, acoustics based data, etc.

schema:Text is not applicable here as this is a DataType (as in boolean, number, date, etc).

@kmexter
Copy link
Contributor Author

kmexter commented Dec 4, 2023

From @marc-portier

there is indeed a range of different aspects in here, all of which could be useful at some point

a loosely (human like) depiction of what kind of file --> schema.org/CreativeCommons has a number of subtypes that could fit too? (schema.org is probably closer to ODIS approach, as well as to things like RO-Crate)
technical formats like mime-types, possibly also including character encodings -- > https://schema.org/encodingFormat does seem to do the trick there
deeper content-conformity and schema-descriptions --> things we have been experimenting with inside Fair-Ease

My guess is that we should keep 2 and 3 above as recommended resp optional at the time -- but make clear that as one grows in formally describing the distributions, one gets to unlock more and more useful side-effects?

But more importantly -- let us not try to mix these distinct aspects...

@kmexter
Copy link
Contributor Author

kmexter commented Dec 4, 2023

MARCO-BOLO-WP1

In the case of mime types (but I agree that it's not so easy-to-use for scientists involved in metadata creation) I would consider the IANA mime-types list (https://www.iana.org/assignments/media-types/media-types.xhtml#application)

The intent is instead to identify the definition of a dataset type “series” in general? In that case I could suggest the INSPIRE registry, but it is focussed to spatial datasets. Series is defined by http://inspire.ec.europa.eu/metadata-codelist/ResourceType/series

I try to find out something more abstract.. at the moment I have no better idea.

@kmexter
Copy link
Contributor Author

kmexter commented Dec 7, 2023

Since we need this googlesheet release asap, Marc and I have chosen for https://www.iana.org/assignments/media-types/media-types.xhtml as the place to chose the MimeType from, and that is what the column is now called. Please shout if you disagree

@pieterprovoost
Copy link
Contributor

@kmexter I wonder how useful this is if we are not collecting distribution URLs at the same time. How are we going to use this information? A MIME type is a property of a specific file, not of a dataset. Most datasets will include a variety of MIME types, Darwin Core archives for example are collections of text/csv and text/xml in application/zip.

@kmexter
Copy link
Contributor Author

kmexter commented Dec 11, 2023

well, yes and no. It is useful to the person looking at the record ("ah, these are image data, yes I want image data"), but to ODIS it may not be useful information. It is a bit like the usefulness that keywords provides, in my mind.
Yes, there could be several mimetypes, that is OK as indeed a single dataset can contain different types of data in it.
We can chose to remove this - personally I think it is useful, but I don't object to being overridden

@kmexter
Copy link
Contributor Author

kmexter commented Dec 12, 2023

Also...we could collect the distribution URLs - I mean, there is a field for it in the ODIS online example, so I am a bit uncertain why we are not asking for this from the MBO peeps also. It depends on the purpose of the ODIS record, I guess: for data already published in a catalogue, this record is a secondary one, but for data NOT already published, then this would be the primary record.....

@marc-portier
Copy link

I agree that the media-type is only meaningful when associated to a downloadURL of the distribution (and then it is also obvious there is only one)

I also agree that in many cases the mime-type has only limited value -- but better then nothing? Next level would be the a schema conformity of the dataset (as suggested as one of the other apsects)

@kmexter
Copy link
Contributor Author

kmexter commented Dec 12, 2023

More comments welcome, everyone from MBO WP1! As I do need to know which to do, ideally by end of this week

@kmexter
Copy link
Contributor Author

kmexter commented Dec 14, 2023

I am leaning away from mime type now. For me, the point of this was to allow scientists to understand what is in the dataset before they bother to download it. So I would have this field as a literal -- because we cannot accommodate via shema all the data types. I would suggest using:

  • schema:Text
  • schema:SpreadSheetDigitalDocument
  • schema:audioObject
  • schema:ImageObject
  • not sure how omics files would be listed: text is not suitable, we will have to divert from schema here....
  • schema:Map

that, or get rid of this metadatum completely.

@pieterprovoost
Copy link
Contributor

May I suggest the following, which also covers sequence data. If we can find a term for sequence data in some other ontology, it can go into additionalType.

{
    "@type": "Dataset",
    "hasPart": [
        {
            "@type": "ImageObject"
        },
        {
            "@type": "TextObject",
            "encodingFormat": "text/csv"
        },
        {
            "@type": "TextObject",
            "encodingFormat": "text/fasta"
        },
    ]
}

Alternatively, we can use distribution:

{
    "@type": "Dataset",
    "distribution": [
        {
            "@type": "DataDownload",
            "additionalType": "ImageObject"
        },
        {
            "@type": "DataDownload",
            "additionalType": "TextObject",
            "encodingFormat": "text/csv"
        },
        {
            "@type": "DataDownload",
            "additionalType": "TextObject",
            "encodingFormat": "text/fasta"
        },
    ]
}

In any case, don't use schema:Text.

@kmexter
Copy link
Contributor Author

kmexter commented Dec 14, 2023

Thumbs-up to that suggestion, Pieter.

@marc-portier
Copy link

minor glitch probably:

  • schema:audioObject

Type-names are typically Uppercase --> https://schema.org/AudioObject

@kmexter
Copy link
Contributor Author

kmexter commented Aug 22, 2024

When we have enough datasets from where we can harvest metadata (and perhaps ping those data), we can go further on this. For source data I think it is unlikely we will get this info, as it is not routinely held in metadata records (with some notable exceptions) but we should ask ourselves if we really want to push the WPs into providing this for the data that they create in MBO. We are already struggling to get info from them! TBD.

@kmexter kmexter self-assigned this Sep 4, 2024
@kmexter kmexter added the todo 2025 need more data before can do: a MUST for 2025 label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
todo 2025 need more data before can do: a MUST for 2025
Projects
None yet
Development

No branches or pull requests

3 participants