Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to pass additional metadata to codemeta that is not or cannot be automatically harvested #51

Open
broeder-j opened this issue Oct 27, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@broeder-j
Copy link
Member

broeder-j commented Oct 27, 2023

Describe the bug
somesy sync deletes existing ids from affiliations of authors and contributors.
it also removes: the '@id' from the resource itself, all keys-value pairs that codemeta allows, but codemeta somesy internal does not.

For me this is a critical bug, because of which I won't use somesy until this is not the case anymore.

Expected behavior

  • Leave all codemeta keys which are there (and not part of somesy metadata) untouched.
  • Do not delete any '@ids' unless there is support in somesy for these (for the affiliations there is no support), so these should at least be left alone.

To Reproduce
Steps to reproduce the behavior:

  1. have a rich codemeta.json
  2. run somesy, with a complete as possible toml file

(sorry for not condensing to a minimal example).
Some of the changes come because of python project metadata.
Example codemeta.json in:

{
    "@context": [
        "https://doi.org/10.5063/schema/codemeta-2.0",
        "https://w3id.org/software-iodata",
        "https://raw.githubusercontent.com/jantman/repostatus.org/master/badges/latest/ontology.jsonld",
        "https://schema.org",
        "https://w3id.org/software-types"
    ],
    "@id" : "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting",
    "@type": "SoftwareSourceCode",
    "author": [
        {
            "@id" : "https://orcid.org/0000-0001-7939-226X",
            "@type": "Person",
            "email": "j.broeder@fz-juelich.de",
            "familyName": "Bröder",
            "givenName": "Jens",
            "affiliation": {
                "@id": "https://ror.org/02nv7yv05",
                "name": "Forschungszentrum Jülich GmbH"
            }
        },
        {
        "@id":"https://orcid.org/0000-0002-0070-4337",
            "@type": "Person",
            "email": "a.strupp@fz-juelich.de",
            "familyName": "Strupp",
            "givenName": "Annika",
            "affiliation": {
                "@id": "https://ror.org/02nv7yv05",
                "name": "Forschungszentrum Jülich GmbH"
            }
        },
        {
        "@id": "https://orcid.org/0000-0003-0000-4784",
            "@type": "Person",
            "email": "p.videgain.barranco@fz-juelich.de",
            "familyName": "Videgain Barranco",
            "givenName": "Pedro",
            "affiliation": {
                "@id": "https://ror.org/02nv7yv05",
                "name": "Forschungszentrum Jülich GmbH"
            }
        },
        {
        "@id": "https://orcid.org/0000-0002-2818-5890",
            "@type": "Person",
            "email": "s.fathalla@fz-juelich.de",
            "familyName": "Fathalla",
            "givenName": "Said",
            "affiliation": {
                "@id": "https://ror.org/02nv7yv05",
                "name": "Forschungszentrum Jülich GmbH"
            }
        },
        {
        "@id": "https://orcid.org/0000-0002-3968-2446",
            "@type": "Person",
            "email": "gabriel.preuss@helmholtz-berlin.de",
            "familyName": "Preuß",
            "givenName": "Gabriel",
            "affiliation": {
                "@id": "https://ror.org/02aj13c28",
                "name": "Helmholtz-Zentrum Berlin für Materialien und Energie"
            }
        }
    ],
    "codeRepository": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting.git",
    "contributor": [
        {
            "@id" : "https://orcid.org/0000-0001-7939-226X",
            "@type": "Person",
            "email": "j.broeder@fz-juelich.de",
            "familyName": "Bröder",
            "givenName": "Jens",
            "affiliation": {
                "@type": "Organization",
                "@id": "https://ror.org/02nv7yv05",
                "name": "Forschungszentrum Jülich GmbH"
            }
        },
        {   
        "@id":"https://orcid.org/0000-0002-0070-4337",
            "@type": "Person",
            "email": "a.strupp@fz-juelich.de",
            "familyName": "Strupp",
            "givenName": "Annika",
            "affiliation": {
                "@type": "Organization",
                "@id": "https://ror.org/02nv7yv05",
                "name": "Forschungszentrum Jülich GmbH"
            }
        },
        {
            "@id": "https://orcid.org/0000-0003-0000-4784",
            "@type": "Person",
            "email": "p.videgain.barranco@fz-juelich.de",
            "familyName": "Videgain Barranco",
            "givenName": "Pedro",
            "affiliation": {
                "@type": "Organization",
                "@id": "https://ror.org/02nv7yv05",
                "name": "Forschungszentrum Jülich GmbH"
            }
        },
        {
        "@id": "https://orcid.org/0000-0002-2818-5890",
            "@type": "Person",
            "email": "s.fathalla@fz-juelich.de",
            "familyName": "Fathalla",
            "givenName": "Said",
            "affiliation": {
                "@id": "https://ror.org/02nv7yv05",
                "name": "Forschungszentrum Jülich GmbH"
            }
        },
        {
        "@id": "https://orcid.org/0000-0002-3968-2446",
            "@type": "Person",
            "email": "gabriel.preuss@helmholtz-berlin.de",
            "familyName": "Preuß",
            "givenName": "Gabriel",
            "affiliation": {
                "@id": "https://ror.org/02aj13c28",
                "name": "Helmholtz-Zentrum Berlin für Materialien und Energie"
            }
        }
    ],
    "dateCreated": "2021-07-09",
    "dateModified": "2022-12-19",
    "datePublished": "2023-03-01",
    "developmentStatus": "https://www.repostatus.org/#wip",
    "identifier": "data_harvesting",
    "maintainer": {
        "@id" : "https://orcid.org/0000-0001-7939-226X",
        "@type": "Person",
        "email": "j.broeder@fz-juelich.de",
        "familyName": "Bröder",
        "givenName": "Jens",
        "affiliation": {
            "@type": "Organization",
            "@id": "https://ror.org/02nv7yv05",
            "name": "Forschungszentrum Jülich GmbH"
        }
    },
    "programmingLanguage": [
        "Python",
        "Shell"
    ],
    "readme": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/main/README.md",
    "softwareHelp": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting",
    "license": "http://spdx.org/licenses/MIT",
    "name": "data-harvesting",
    "runtimePlatform": [
        "Python 3",
        "Python 3.7",
        "Python 3.8",
        "Python 3.9",
        "Python 3.10",
        "Python 3.11"
    ],
    "keywords": [
        "unhide", "Helmholtz association", "data mining", "HMC", "metadata", "data publications", 
        "software publication", "RSE", "FAIR", "linked data", "knowledge graph", "json-ld", "schema.org", "restruct"
    ],
    "softwareRequirements": [
        {
            "@type": "SoftwareApplication",
            "identifier": "advertools",
            "name": "advertools",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "beautifulsoup4",
            "name": "beautifulsoup4",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "pandas",
            "name": "pandas",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "requests",
            "name": "requests",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "typer",
            "name": "typer",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "rich",
            "name": "rich",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "progressbar2",
            "name": "progressbar2",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "rdflib",
            "name": "rdflib",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "codemetapy",
            "name": "codemetapy",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "pyld",
            "name": "pyld",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "extruct",
            "name": "extruct",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "pyld",
            "name": "pyld",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "jq",
            "name": "jq",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "jsonschema",
            "name": "jsonschema",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "xmljson",
            "name": "xmljson",
            "runtimePlatform": "Python 3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "codemeta-harvester",
            "name": "codemeta-harvester",
            "runtimePlatform": "Shell"
        }
    ],
    "targetProduct": {
        "@type": "CommandLineApplication",
        "executableName": "hmc_unhide",
        "name": "hmc_unhide",
        "runtimePlatform": ["Shell", "Python 3"]
    },
    "applicationCategory": "Library, Data science",
    "downloadUrl": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/archive/main/data_harvesting-main.zip",
    "fileSize": "57MB",
    "operatingSystem": ["OSX", "Linux"],
    "releaseNotes": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/main/CHANGELOG.md",    
    "url": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting",
    "description": "Set of tools to harvest, process and uplift (meta)data from metadata providers within the Helmholtz association to be included in the Helmholtz Knowledge Graph (Helmholtz-KG). The harvested linked data in the form of schema.org jsonld is aggregated and uplifted in data pipelines to be included into a single large knowledge graph (KG). The tool set and harvesters can be used as a python library or over a commandline interface (CLI, hmc-unhide). Provenance of metadata changes is tracked rudimentary by saving graph patches of changes on rdflib Graph data structures on the semantic triple level. Harvesters support extracting data via sitemap, gitlab API, datacite API and OAI-PMH endpoints.",
    "copyrightHolder": [
        {"@id" : "Helmholtz Metadata Collaboration (HMC)",
             "@type": "Organization",
             "name": "Helmholtz Metadata Collaboration (HMC)"},
         {"@type": "Organization",
         "name": "Forschungszentrum Jülich GmbH (Materials Data Science and Informatics (IAS-9), Institute for Advanced Simulation (IAS)), Jülich, Germany"}],
    "copyrightYear" : "2022",
    "softwareVersion": "1.1.0",
    "isAccessibleForFree": "True",
    "contIntegration": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/pipelines",
    "buildInstructions": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/main/pyproject.toml",
    "issueTracker": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/issues"
}

somesy.toml in

[project]
name = "data-harvesting"
version = "1.1.0"
description = "Set of tools to harvest, process and uplift (meta)data from metadata providers within the Helmholtz association to be included in the Helmholtz Knowledge Graph (Helmholtz-KG). The harvested linked data in the form of schema.org jsonld is aggregated and uplifted in data pipelines to be included into a single large knowledge graph (KG). The tool set and harvesters can be used as a python library or over a commandline interface (CLI, hmc-unhide). Provenance of metadata changes is tracked rudimentary by saving graph patches of changes on rdflib Graph data structures on the semantic triple level. Harvesters support extracting data via sitemap, gitlab API, datacite API and OAI-PMH endpoints."

keywords = [
        "unhide", "Helmholtz association", "data mining", "HMC", "metadata", "data publications",
        "software publication", "RSE", "FAIR", "linked data", "knowledge graph", "json-ld", "schema.org", "restruct"
    ]
license = "MIT"
repository = "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting.git"
homepage = "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting"

[[project.people]]
given-names = "Jens"
family-names = "Bröder"
email = "j.broeder@fz-juelich.de"
orcid = "https://orcid.org/0000-0001-7939-226X"
author = true      # is a full author of the project (i.e. appears in citations)
maintainer = true  # currently maintains the project (i.e. is a contact person)
# publication_author = true

[[project.people]]
given-names = "Annika"
family-names = "Strupp"
email = "a.strupp@fz-juelich.de"
orcid = "https://orcid.org/0000-0002-0090-4337"
author = true

[[project.people]]
orcid = "https://orcid.org/0000-0003-0000-4784"
email = "p.videgain.barranco@fz-juelich.de"
family-names = "Videgain Barranco"
given-names = "Pedro"
affiliation = "Forschungszentrum Jülich GmbH"
author = true
# publication_author = true

[[project.people]]
orcid = "https://orcid.org/0000-0002-2818-5890"
email = "s.fathalla@fz-juelich.de"
family-names = "Fathalla"
given-names = "Said"
affiliation = "Forschungszentrum Jülich GmbH"
author = true
# publication_author = true

[[project.people]]
orcid = "https://orcid.org/0000-0002-3968-2446"
email = "gabriel.preuss@helmholtz-berlin.de"
family-names = "Preuß"
given-names = "Gabriel"
affiliation = "Helmholtz-Zentrum Berlin für Materialien und Energie"
author = true
maintainer = true
# publication_author = true


[config]
verbose = true     # show detailed information about what somesy is doing

'crippled' codemeta.json out.

{
    "@context": [
        "https://doi.org/10.5063/schema/codemeta-2.0",
        "https://w3id.org/software-iodata",
        "https://raw.githubusercontent.com/jantman/repostatus.org/master/badges/latest/ontology.jsonld",
        "https://schema.org",
        "https://w3id.org/software-types"
    ],
    "@type": "SoftwareSourceCode",
    "applicationCategory": [
        "Database",
        "Education",
        "Scientific/Engineering",
        "Scientific/Engineering > Information Analysis",
        "Scientific/Engineering > Visualization",
        "Text Processing"
    ],
    "audience": [
        {
            "@type": "Audience",
            "audienceType": "Science/Research"
        },
        {
            "@type": "Audience",
            "audienceType": "Information Technology"
        }
    ],
    "author": [
        {
            "@id": "https://orcid.org/0000-0001-7939-226X",
            "@type": "Person",
            "familyName": "Bröder",
            "givenName": "Jens"
        },
        {
            "@id": "https://orcid.org/0000-0002-0070-4337",
            "@type": "Person",
            "familyName": "Strupp",
            "givenName": "Annika"
        },
        {
            "@id": "https://orcid.org/0000-0003-0000-4784",
            "@type": "Person",
            "affiliation": {
                "@type": "Organization",
                "legalName": "Forschungszentrum Jülich GmbH"
            },
            "familyName": "Videgain Barranco",
            "givenName": "Pedro"
        },
        {
            "@id": "https://orcid.org/0000-0002-2818-5890",
            "@type": "Person",
            "affiliation": {
                "@type": "Organization",
                "legalName": "Forschungszentrum Jülich GmbH"
            },
            "familyName": "Fathalla",
            "givenName": "Said"
        },
        {
            "@id": "https://orcid.org/0000-0002-3968-2446",
            "@type": "Person",
            "affiliation": {
                "@type": "Organization",
                "legalName": "Helmholtz-Zentrum Berlin für Materialien und Energie"
            },
            "familyName": "Preuß",
            "givenName": "Gabriel"
        }
    ],
    "codeRepository": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting.git",
    "description": "Set of tools to harvest, process and uplift (meta)data from metadata providers within the Helmholtz association to be included in the Helmholtz Knowledge Graph (Helmholtz-KG). The harvested linked data in the form of schema.org jsonld is aggregated and uplifted in data pipelines to be included into a single large knowledge graph (KG). The tool set and harvesters can be used as a python library or over a commandline interface (CLI, hmc-unhide). Provenance of metadata changes is tracked rudimentary by saving graph patches of changes on rdflib Graph data structures on the semantic triple level. Harvesters support extracting data via sitemap, gitlab API, datacite API and OAI-PMH endpoints.",
    "developmentStatus": "https://www.repostatus.org/#wip",
    "identifier": "data-harvesting",
    "keywords": [
        "FAIR",
        "HMC",
        "Helmholtz association",
        "RSE",
        "data mining",
        "data publications",
        "json-ld",
        "knowledge graph",
        "linked data",
        "metadata",
        "restruct",
        "schema.org",
        "software publication",
        "unhide"
    ],
    "license": "https://spdx.org/licenses/MIT",
    "name": "data-harvesting",
    "operatingSystem": [
        "MacOS > MacOS X",
        "POSIX > Linux"
    ],
    "runtimePlatform": [
        "Python",
        "Python 3",
        "Python 3.10",
        "Python 3.11",
        "Python 3.12",
        "Python 3.9"
    ],
    "softwareRequirements": [
        {
            "@type": "SoftwareApplication",
            "identifier": "SPARQLWrapper",
            "name": "SPARQLWrapper",
            "runtimePlatform": "Python 3",
            "version": "^2.0.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "advertools",
            "name": "advertools",
            "runtimePlatform": "Python 3",
            "version": "^0.13.2"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "beautifulsoup4",
            "name": "beautifulsoup4",
            "runtimePlatform": "Python 3",
            "version": "^4.11.1"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "codemetapy",
            "name": "codemetapy",
            "runtimePlatform": "Python 3",
            "version": "^2.3.3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "crontab",
            "name": "crontab",
            "runtimePlatform": "Python 3",
            "version": "^1.0.1"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "extruct",
            "name": "extruct",
            "runtimePlatform": "Python 3",
            "version": "^0.13.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "jq",
            "name": "jq",
            "runtimePlatform": "Python 3",
            "version": "^1.3.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "jsondiff",
            "name": "jsondiff",
            "runtimePlatform": "Python 3",
            "version": "^2.0.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "jsonschema",
            "name": "jsonschema",
            "runtimePlatform": "Python 3",
            "version": "^4.17.3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "oaiharvest",
            "name": "oaiharvest",
            "runtimePlatform": "Python 3",
            "version": "^3.0.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "pandas",
            "name": "pandas",
            "runtimePlatform": "Python 3",
            "version": "^1.4.1"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "pathos",
            "name": "pathos",
            "runtimePlatform": "Python 3",
            "version": "^0.3.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "progressbar2",
            "name": "progressbar2",
            "runtimePlatform": "Python 3",
            "version": "^4.2.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "pydantic",
            "name": "pydantic",
            "runtimePlatform": "Python 3",
            "version": "^2.3.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "pyld",
            "name": "pyld",
            "runtimePlatform": "Python 3",
            "version": "^2.0.3"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "pylint",
            "name": "pylint",
            "runtimePlatform": "Python 3",
            "version": "^2.17.5"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "pyoai",
            "name": "pyoai",
            "runtimePlatform": "Python 3",
            "version": "2.5.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "python",
            "name": "python",
            "runtimePlatform": "Python 3",
            "version": "^3.9"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "python-crontab",
            "name": "python-crontab",
            "runtimePlatform": "Python 3",
            "version": "^3.0.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "python-dateutil",
            "name": "python-dateutil",
            "runtimePlatform": "Python 3",
            "version": "^2.8.2"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "rdflib",
            "name": "rdflib",
            "runtimePlatform": "Python 3",
            "version": "^6.2.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "requests",
            "name": "requests",
            "runtimePlatform": "Python 3",
            "version": "^2.28.1"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "rich",
            "name": "rich",
            "runtimePlatform": "Python 3",
            "version": "^12.6.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "shapely",
            "name": "shapely",
            "runtimePlatform": "Python 3",
            "version": "^2.0.1"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "typer",
            "name": "typer",
            "runtimePlatform": "Python 3",
            "version": "^0.9.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "wrapt",
            "name": "wrapt",
            "runtimePlatform": "Python 3",
            "version": "^1.15.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "xmljson",
            "name": "xmljson",
            "runtimePlatform": "Python 3",
            "version": "^0.2.1"
        }
    ],
    "targetProduct": {
        "@type": "CommandLineApplication",
        "executableName": "hmc-unhide",
        "name": "hmc-unhide",
        "runtimePlatform": "Python 3"
    },
    "url": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting",
    "version": "1.1.0"
}
@broeder-j broeder-j added the bug Something isn't working label Oct 27, 2023
@broeder-j
Copy link
Member Author

If I understand the somesy code correctly.

I think this issue arises mainly because the original codemeta.json file is not used/parsed as an additional input to codemetapy, when the new codemeta.json is generated. So there is only a new codemeta file generated from the given sources and any additional information in the existing codemeta.json is ignored.

When somesy creates the new codemeta.json via codemetapy, one do not want conflicting information in the sources.
Therefore, one needs to think about how to solve this, without conflicts. Maybe merge the original codemeta in afterwards to get all keys which somesy did not generate from the source. i.e could be a simple dict update of the original one with the new one (not optimal, but ok).

This merge will not solve the affiliation id issues, or any other nested things. Maybe one should do a rekursive @id merge. so a special treatment just for @ids. or even a rekursive dict merge.

Also there should be a test with a rich codemeta, where 1. nothing changes, and 2. something changes, and in both cases only the given changes should happen, like a version update or any other update.

@apirogov
Copy link
Contributor

apirogov commented Oct 30, 2023

All somesy does for codemeta is basically collecting relevant sources and input files, calling codemetapy and saving the result if it changed. Somesy will not do anything more fancy than that.

What we could do is introduce an extra flag for a "supplemental" codemeta file to provide additional triples to be merged, which also would be passed to codemetapy. It would be the responsibility of codemetapy to overwrite or merge that extra information correctly.

We could also think about adding metadata fields to somesy that are passed to codemetapy using its interface (such as the base ID or certain specific codemeta fields not currently covered/inferred)

@apirogov apirogov removed the bug Something isn't working label Oct 31, 2023
@apirogov
Copy link
Contributor

I removed the 'bug' tag, because overwriting codemeta.json is expected and documented behavior.

@apirogov apirogov added the enhancement New feature or request label Oct 31, 2023
@apirogov apirogov changed the title somesy deletes record and affilation ids from codemeta, and also other keys with values Allow to pass additional metadata to codemeta that is not or cannot be automatically harvested Oct 31, 2023
@broeder-j
Copy link
Member Author

broeder-j commented Nov 2, 2023

I must say, I do not agree.

Every file is 'overwritten' by somesy, including pyproject.toml and CITATION.cff, but as a user I do not expect my specifications to be overwritten (like ignoring all other tool sections or metadata keys in pyproject.toml). i.e somesy's overwrite of codemeta.json in a completely different worse way, is unwanted and unexpected behavior by any user. The behavior of somesy is therefore inconsistent for the different files.
And this is as I said above, in my view critical and until this is fixed I will not use somesy.
So yes, it is current 'expecedt' behavior, but completely wrong. I would have put the label critical bug if existent.

@apirogov
Copy link
Contributor

apirogov commented Nov 2, 2023

Because CodeMeta is based on RDF / linked data, for technical reasons it is practically impossible to provide an equivalent behavior like we can for TOML, YAML and XML files (i.e. more or less conservative patching).

Concerning codemeta specifically, you could disable codemeta synchronization (--no-sync-codemeta) if you dislike the behavior.

However, due to other idiosyncracies of your project setup (e.g. mixing setuptools and poetry in one pyproject.toml), it looks like your use case and requirements are outside of the target group and scope of somesy.

The main target audience for somesy is people who:

  • have a reasonably simple project setup
  • do not really care too much about 'perfect metadata'
  • might even not have heard of Citation Files or CodeMeta before
  • care just enough to use somesy in order to get them 70-80% of the way without too much work

I would claim that:

  • having the codemeta support in the way somesy provides it right now is still better than none at all
  • the number of users who already use codemeta and hand-edit their codemeta.json for enrichment is vanishingly small

Therefore, unfortunately, I think we have to "agree to disagree" here.

@broeder-j
Copy link
Member Author

broeder-j commented Nov 2, 2023

How about doing an rdf patch instead? i.e.

  1. Run through what somesy does now, if no codemeta.json file exists.

  2. Otherwise, if a codemeta.json exists generate the new triples, from the somesy.toml. Read in the existing codemeta.json, patch the graph then serialize it back, if the triples from the somesy.toml changed the graph. One would need to delete the triples that somesy can patch first from the graph.
    Would that not be a way? (whole Ids changes could be a problem, but one could apply them to some extent earlier.)
    Besides that it is technical hard. This type of issue is all over the place when dealing with linked data.

The current support is ONLY good if there is no codemeta.json in the first place. People who currently use somesy care about metadata, so there is likely a codemeta.json, because they like start from here: https://codemeta.github.io/codemeta-generator/

It is good that there is already a feature to disable the codemeta-sync.
That makes it useable for me in the other context. Thanks!

An other solution could be:

Where all the additional information is also in the somesy,toml, i.e where somesy.toml can be as rich as codemeta.json. Then one could go with the full recreation like no and not change any logic expect piping the information through to codemetapy.
i.e allow to provide any key codemeta also allows in the richness that codemeta.json allows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants