Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support structured funding information #266

Closed
mbjones opened this issue Mar 24, 2017 · 13 comments
Closed

support structured funding information #266

mbjones opened this issue Mar 24, 2017 · 13 comments
Assignees
Milestone

Comments

@mbjones
Copy link
Member

mbjones commented Mar 24, 2017

EML 2.0.1 supports a funding field, but it is completely unstructured, and prevents effective linking to awards due to lack of standardization. The current funding field in eml-project is of type txt:TextType, which allows it to have sections and paragraphs. Some groups use multiple paragraphs for different awards, but there is still no structure to understand which agency provided the funding, nor a well-delimited award number, nor links to web services, fundRefIDs, etc. I propose to extend EML to support this structured information to allow more parseable funding information, particularly:

  • award agency name
  • award number
  • award title
  • agency fundref Identifier (or related)
  • program fundref Identifier (or related)
  • award URL

I also considered adding Principal investigators, but chose to not do so as they are in the containing project description. Its arguable that award title should simply be in the containing project title. Given that 'funding' already exists as a description of the funding, I propose that a new optional and repeatable award element be added that can list the machine parseable part of the funding info. Here's a proposed new structure for a funding section.

<project>
  ...
  <funding><para>General description goes here for backwards compatibility</para></funding>
  <award>
    <funder_name>National Science Foundation</funder_name >
    <funder_identifier>https://doi.org/10.13039/00000001<funder_identifier>
    <award_number>1546024</award_number>
    <title>Scientia Arctica: A Knowledge Archive for Discovery and Reproducible Science in the Arctic</title>
    <awardURL>https://www.nsf.gov/awardsearch/showAward?AWD_ID=1546024</awardURL>
  </award>
  <!-- Note award is repeatable -->
</project>

I chose the funder_name, funder_identifier, and award_number fields (despite not following EML naming conventions) to match the CrossRef fields from their Funding Data initiative, which includes their Funder Registry with formal identifiers for funding programs and a CrossRef API /funders endpoint, which can be linked to papers and resources following their overview for inclusion in CrossRef. I matched their particular fundref.xsd schema (see documentation), but we could also consider importing it directly and using their namespace.

Please add thoughts/feedback to this ticket and we'll work up a proposal.

@mbjones mbjones self-assigned this Mar 24, 2017
@mbjones mbjones added this to the EML2.2.0 milestone Mar 24, 2017
@csjx
Copy link
Member

csjx commented Mar 24, 2017

While I like the idea of importing CrossRef's fundref.xsd schema to eliminate redundancy, I don't see any strong namespace versioning, which might mean that the schema could change on us. For this reason, I'd suggest we either work with CrossRef to strongly version their schemas by namespace and import it (preferable), or we add the elements directly to EML. Otherwise, I like this proposal, particularly since it's backwards-compatible with our current schema.

@amoeba
Copy link
Contributor

amoeba commented Mar 24, 2017

I think putting the funding information inside the EML in like this is great as I'm a fan of putting as much as possible inside a single EML doc. And making it look like or be imported from the fundref schema is great too. Is there another option here where we include a fundref XML doc in the package instead of this or in addition to this?

mbjones added a commit that referenced this issue Jul 25, 2017
Added first version of a new `award` field to the eml-project schema to add info that can be
used to reference funding information for the projet.  This provides the first implementation
of the `award` field, including the award title, identifier, and open funder registry identifier.
These proposed additions are partial implementations for issue #266.
@mbjones
Copy link
Member Author

mbjones commented Jul 25, 2017

Draft award element has now been added to eml-project.xsd in the 2.2 branch. Still need to add an example to the eml-sample.xml file.

@mbjones
Copy link
Member Author

mbjones commented Sep 9, 2017

Draft award sample has now been added to the eml-sample.xml document in sha 5ce01c9, and it validates.

@mbjones
Copy link
Member Author

mbjones commented Oct 30, 2017

See related discussions for CodeMeta and schema.org on how to incorporate funding info:

codemeta/codemeta#137
codemeta/codemeta#160

@mbjones
Copy link
Member Author

mbjones commented Dec 3, 2017

@mfenner outlines the schema.org proposal in codemeta/codemeta#160 describing the structure they will use as:

{ 
  "award: [
     {
        "@type": "Award",
        "name": "MOTivational strength of ecosystem services and alternative ways to express the value of BIOdiversity",
        "identifier": "282625",
        "url": "http://cordis.europa.eu/project/rcn/100180_en.html",
        "funder": {
          "@id": "http://doi.org/10.13039/501100000780",
          "@type": "Organization",
          "name": "European Commission"
        }
    },
    {
        "@type": "Award",
        "name": "Institutionalizing global genetic-resource commons. Global Strategies for accessing and using essential public knowledge assets in the life science",
        "identifier": "284382",
        "url": "http://cordis.europa.eu/project/rcn/100603_en.html",
        "funder": {
          "@id": "http://doi.org/10.13039/501100000780",
          "@type": "Organization",
          "name": "European Commission"
        }
    }]
}

Thus, we could consider modifying the EML structure to be compatible, which mainly means shifting to camelCase to match both EML and schema.org conventions, and shifting field name conventions to match schema.org. Thus, the new EML structure for awards would look like this:

<project>
  ...
  <funding><para>General description goes here for backwards compatibility</para></funding>
  <award>
    <funderName>National Science Foundation</funderName >
    <funderIdentifier>https://doi.org/10.13039/00000001<funderIdentifier>
    <awardNumber>1546024</awardNumber>
    <title>Scientia Arctica: A Knowledge Archive for Discovery and Reproducible Science in the Arctic</title>
    <awardURL>https://www.nsf.gov/awardsearch/showAward?AWD_ID=1546024</awardURL>
  </award>
  <!-- Note award is repeatable -->
</project>

Alternatively, we could also make a new Funder type to more explicitly parallel the schema.org structure, in which case we would have:

<project>
  ...
  <funding><para>General description goes here for backwards compatibility</para></funding>
  <award>
    <funder>
        <name>National Science Foundation</name >
        <identifier>https://doi.org/10.13039/00000001<identifier>
    </funder>
    <awardNumber>1546024</awardNumber>
    <title>Scientia Arctica: A Knowledge Archive for Discovery and Reproducible Science in the Arctic</title>
    <awardURL>https://www.nsf.gov/awardsearch/showAward?AWD_ID=1546024</awardURL>
  </award>
  <!-- Note award is repeatable -->
</project>

The advantage of the former is that it is shallower and easier to parse. The advantage of the latter is that it allows funder to become a type of Organization, and disambiguates the name and identifier fields.

@mbjones
Copy link
Member Author

mbjones commented Jan 4, 2018

To fully match the schema.org structures, we could continue to deviate from FundingData field names, and use schema.org names, and reorder fields to correspond with the DataCite proposal. This would look like:

<project>
  ...
  <funding><para>General description goes here for backwards compatibility</para></funding>
  <award>
    <name>Scientia Arctica: A Knowledge Archive for Discovery and Reproducible Science in the Arctic</name>
    <identifier>1546024</identifier>
    <url>https://www.nsf.gov/awardsearch/showAward?AWD_ID=1546024</url>
    <funder>
        <name>National Science Foundation</name >
        <identifier>https://doi.org/10.13039/00000001<identifier>
    </funder>
  </award>
  <!-- Note award is repeatable -->
</project>

This schema deviates from EML semantics, as it would be more consistent to type funder as an eml:ResponsibleParty. Which would work well by allowing the funder to use organizationName or individualName, but would be awkward because the responsible party field for the identifier is called userId, which works well for people but not so well for organizations.

So, it seems we have three possible options:

Option 1: Follow FundingData naming conventions

This is how things are currently implemented in the 2.2 branch. It is also consistent with how many journals are providing funding info for their journal articles.

Option 2: Follow schema.org conventions

This is what is represented in the DataCite examples. As schema.org doesn't have an award type yet, this is really making a totally new path. But schema.org is a nice, generic vocabulary that Google recognizes.

Option 3: Follow EML naming conventions

In this case, we re-use existing EML types when they exist (namely for funder), and make up new types as needed. And deal with the oddities of EML, but benefit from the consistency of types in EML.

Comments please on these pros and cons so we can wrap this up and make a decision. @cboettig, @csjx, @mobb, @amoeba?

@amoeba
Copy link
Contributor

amoeba commented Jan 5, 2018

Thanks for laying out the three options. I'm a fan of Option 3 at this point. I'd like to see us make as much use of existing EML types as possible when adding features to the spec in an effort to keep EML looking like EML. This has downstream benefits in terms of application development. I think this preference is different than my previous preference.

With the other solutions, even though we'd be making use of names from other standards, it seems like a crosswalk between the metadata standards would still be needed, no matter how similar we are. If we really want to use another schema, I'd rather see us embed the entire FundRef metadata record in the EML (which is probably super gross).

I think as long as we maintain a low-loss semantic equivalent between EML's funding information and the currently used and not-yet-invented schemas, we're good.

@cboettig
Copy link
Member

cboettig commented Jan 5, 2018

How ugly is it to embed the full FundRef? Or a subset of FundRef properties (without altering the nesting structure).

There seems to be significant momentum behind improving schema.org award support; since the crossref folks are already thinking along these lines I imagine they'll define a mapping between FundingData and whatever schema.org modifications for Award emerge, but not worth us hacking this now.

@mbjones
Copy link
Member Author

mbjones commented Jan 5, 2018

Its pretty ugly in its normal form, and specific to crossref's approach to assertions. I can't see importing the fundref.xsd directly. Check it out.

I think I agree with @amoeba that whatever we choose will require an EML conversion anyways, because the fields won't be in either the fundref or scehma.org namespaces. So I think all we need is isomorphism; exact name matches are not critical.

@cboettig
Copy link
Member

cboettig commented Jan 5, 2018

Sounds good to me. 👍

@mbjones
Copy link
Member Author

mbjones commented Feb 9, 2018

I've re-implemented this following the guidelines of Option 3, and checked it in SHA eb4ed60. Here's an example funding section that validates.

<project>
  ...
  <funding><para>Funding is from a grant from the National Science Foundation.</para></funding>
  <award>
     <funderName>National Science Foundation</funderName>
     <funderIdentifier>https://doi.org/10.13039/00000001</funderIdentifier>
     <awardNumber>1546024</awardNumber>
      <title>Scientia Arctica: A Knowledge Archive for Discovery and Reproducible Science in the Arctic</title>
      <awardUrl>https://www.nsf.gov/awardsearch/showAward?AWD_ID=1546024</awardUrl>
  </award>
</project

Unless there are further requests for modification, this enhancement is complete, and I will close this ticket.

@mobb
Copy link
Contributor

mobb commented Feb 20, 2018

I agree with @amoeba , on option 3 - follow EML naming and typing conventions. Although, at present, I don't see the EML ResponsiblePartyType containing funderName and ID (in Branch_2_2).

As was said, there would need to be a crosswalk anyway, between these elements and other systems. The extra structure of using that Type) isn't much of a hardship, since code for any responsibleParty can be reused to insert it, rather than having to create a function just for funderName and funderID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants