Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copyrights owner and licenses identification models #1078

Merged
merged 20 commits into from
Feb 10, 2024

Conversation

kermitt2
Copy link
Owner

@kermitt2 kermitt2 commented Jan 29, 2024

This PR integrates two new models to identify the copyrights' owner of a document (publisher, authors or unknown) and to identify the license, if provided, for sharing the document file (e.g. CC-BY, CC-BY-NC, etc.). The models currently only work if the "delft" engine is selected. If this engine is not selected, the identification is currently skipped.

In the TEI, the result is serialized as followed - example is https://peerj.com/articles/cs-1022/

Screenshot from 2024-01-29 18-36-35

            <publicationStmt>
                <publisher>PeerJ</publisher>
                <availability resp="authors" status="restricted">
                    <!-- the @rest attribute above gives the document copyrights owner (publisher, authors), if known -->
                    <licence>CC-BY</licence>
                </availability>
                <date type="published" when="2022-07-25">25 July 2022</date>
            </publicationStmt>

To encode the copyrights' owner, we use an attribute @resp ("responsible party") and add a comment explaining how to interpret it. Note that the standard @resp in TEI should be a pointer, here we customize it to 2 possible values to avoid overcomplicating it. When the copyright owner is undecided by the classifier or unknown, there is no @resp attribute at the element <availability>.

In addition, the service now includes a boolean parameter includeRawCopyrights to include or not in the <availability> part the full copyright/license section that has been extracted (under added element <p type="raw">). This section is used by the classifier to determine the copyrights owner and the license.

               <publicationStmt>
			<publisher>PeerJ</publisher>
			<availability resp="authors" status="restricted">
				<!-- the @rest attribute above gives the document copyrights owner (publisher, authors), if known -->
				<licence>CC-BY</licence>
				<p type="raw">Copyright 2022 Du et al. Distributed under Creative Commons CC-BY 4.0</note>
			</availability>
			<date type="published" when="2022-07-25">25 July 2022</date>
		</publicationStmt>

To have it working, edit grobid-home/config/grobid.yaml to indicate delft as engine for the two new models:

    - name: "copyright"
      # at this time, we only have a DeLFT implementation, 
      # use "wapiti" if the deep learning library JNI is not available and model will then be ignored
      engine: "delft"
      #engine: "wapiti"
      delft:
        # deep learning parameters
        architecture: "gru"
        #architecture: "bert"
        #transformer: "allenai/scibert_scivocab_cased"

    - name: "license"
      # at this time, must always be DeLFT, not other implementation
      # use "wapiti" if the deep learning library JNI is not available and model will then be ignored
      engine: "delft"
      #engine: "wapiti"
      delft:
        # deep learning parameters
        architecture: "gru"
        #architecture: "bert"
        #transformer: "allenai/scibert_scivocab_cased"

Latest evaluations:

GRU ensemble 10, glove-840B
===========================

* Copyrights owner

Evaluation on 76 instances:
                   precision        recall       f-score       support
     publisher        0.9310        1.0000        0.9643            27
       authors        1.0000        1.0000        1.0000            24
     undecided        1.0000        0.9200        0.9583            25

* License identification

Evaluation on 92 instances:
                   precision        recall       f-score       support
          CC-0        0.0000        0.0000        0.0000             0
         CC-BY        1.0000        1.0000        1.0000            26
      CC-BY-NC        1.0000        0.8000        0.8889             5
   CC-BY-NC-ND        0.8000        1.0000        0.8889             8
      CC-BY-SA        1.0000        1.0000        1.0000             6
   CC-BY-NC-SA        1.0000        1.0000        1.0000             2
      CC-BY-ND        1.0000        0.5000        0.6667             2
     copyright        1.0000        0.9091        0.9524            11
         other        0.0000        0.0000        0.0000             0
     undecided        0.9697        1.0000        0.9846            32

SciBERT, base cased
===================

* Copyrights owner

Evaluation on 76 instances:
                   precision        recall       f-score       support
     publisher        0.9000        1.0000        0.9474            27
       authors        1.0000        1.0000        1.0000            24
     undecided        1.0000        0.8800        0.9362            25

* License identification

Evaluation on 83 instances:
                   precision        recall       f-score       support
          CC-0        0.0000        0.0000        0.0000             0
         CC-BY        0.7857        1.0000        0.8800            22
      CC-BY-NC        0.6000        0.7500        0.6667             4
   CC-BY-NC-ND        0.8182        0.5625        0.6667            16
      CC-BY-SA        0.2500        0.5000        0.3333             2
   CC-BY-NC-SA        0.0000        0.0000        0.0000             2
      CC-BY-ND        0.0000        0.0000        0.0000             1
     copyright        1.0000        1.0000        1.0000             8
         other        0.0000        0.0000        0.0000             1
     undecided        1.0000        1.0000        1.0000            27


TODO:

  • update the TEI ODD schema for the customized attribute @resp
  • think about a lighter CPU only classifier maybe, not requiring all the Deep Learning libraries JNI and installation
  • think about getting license version, because this is necessary to create a target URL associated to the license element (e.g. @target="https://creativecommons.org/licenses/by/3.0/"). Without the version, there is no URL possible for the CC license :/

@coveralls
Copy link

coveralls commented Jan 29, 2024

Coverage Status

coverage: 39.907% (+0.1%) from 39.771%
when pulling 261f975 on copyrights-licenses
into bcce229 on master.

@kermitt2 kermitt2 marked this pull request as draft January 29, 2024 18:49
@lfoppiano
Copy link
Collaborator

lfoppiano commented Feb 1, 2024

I've done some tests, and updated the Grobid.odd to add the copyrightOnwners.

I've added some tests (one is failing, I'm not sure it's a bug).

@kermitt2
Copy link
Owner Author

kermitt2 commented Feb 1, 2024

@lfoppiano I changed @copyrightsOwner to @resp after comments by Laurent.

@kermitt2 kermitt2 self-assigned this Feb 2, 2024
return processHeaderDocumentReturnXml_post(inputStream, consolidate, includeRawAffiliations);
@DefaultValue("0") @FormDataParam(INCLUDE_RAW_AFFILIATIONS) String includeRawAffiliations,
@DefaultValue("0") @FormDataParam(INCLUDE_RAW_COPYRIGHTS) String includeRawCopyrights) {
return processHeaderDocumentReturnXml_post(inputStream, consolidate, includeRawAffiliations, includeRawCopyrights);

Check warning

Code scanning / CodeQL

Information exposure through a stack trace Medium

Error information
can be exposed to an external user.
Error information
can be exposed to an external user.
Error information
can be exposed to an external user.
@kermitt2 kermitt2 marked this pull request as ready for review February 4, 2024 17:00
@kermitt2
Copy link
Owner Author

kermitt2 commented Feb 9, 2024

Update of XML schema (also for the latest Pub2TEI version) -> #1084 1084

@kermitt2 kermitt2 merged commit ed9fef7 into master Feb 10, 2024
9 checks passed
@lfoppiano lfoppiano added this to the 0.8.1 milestone Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants