-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copyrights owner and licenses identification models #1078
Conversation
I've done some tests, and updated the Grobid.odd to add the copyrightOnwners. I've added some tests (one is failing, I'm not sure it's a bug). |
@lfoppiano I changed |
return processHeaderDocumentReturnXml_post(inputStream, consolidate, includeRawAffiliations); | ||
@DefaultValue("0") @FormDataParam(INCLUDE_RAW_AFFILIATIONS) String includeRawAffiliations, | ||
@DefaultValue("0") @FormDataParam(INCLUDE_RAW_COPYRIGHTS) String includeRawCopyrights) { | ||
return processHeaderDocumentReturnXml_post(inputStream, consolidate, includeRawAffiliations, includeRawCopyrights); |
Check warning
Code scanning / CodeQL
Information exposure through a stack trace Medium
Error information
Error information
Error information
Update of XML schema (also for the latest Pub2TEI version) -> #1084 1084 |
This PR integrates two new models to identify the copyrights' owner of a document (publisher, authors or unknown) and to identify the license, if provided, for sharing the document file (e.g. CC-BY, CC-BY-NC, etc.). The models currently only work if the
"delft"
engine is selected. If this engine is not selected, the identification is currently skipped.In the TEI, the result is serialized as followed - example is https://peerj.com/articles/cs-1022/
To encode the copyrights' owner, we use an attribute
@resp
("responsible party") and add a comment explaining how to interpret it. Note that the standard @resp in TEI should be a pointer, here we customize it to 2 possible values to avoid overcomplicating it. When the copyright owner is undecided by the classifier or unknown, there is no@resp
attribute at the element<availability>
.In addition, the service now includes a boolean parameter
includeRawCopyrights
to include or not in the<availability>
part the full copyright/license section that has been extracted (under added element<p type="raw">
). This section is used by the classifier to determine the copyrights owner and the license.To have it working, edit
grobid-home/config/grobid.yaml
to indicatedelft
as engine for the two new models:Latest evaluations:
TODO:
@resp