Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify URLs and output them in TEI #1099

Merged
merged 19 commits into from
Jun 9, 2024
Merged

Conversation

lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Apr 15, 2024

This PR is a continuation of #1097. In this PR the URLs are output in the XML-TEI under the tags <ref type="url" target="cleaned_version">url appearing from the data extraction</ref>.

As with #1097 we exploit the PDF annotations for clickable URLs to correct the extracted information from the regular expressions (I tried to improve them but did not obtained anything more robust).

The extracted URLs are aggregated to the notes and processed in the same data flow, thus with different formatting rules, as they are appearing in the text. This change will help in future if we need to add the identification of more elements. The implementation could be improved with more specialised data structures instead of the Triple object, to carry out the information.

Here some output examples:

<div type="availability">
    <div
            xmlns="http://www.tei-c.org/ns/1.0">
        <head>Data availability statements</head>
        <p>
            <s>Google Earth Engine applications to visualize the datasets:
                <ref type="url" target="https://github.com/shijuanchen/shift_cult">
                    https://github.com/shijuanchen/shift_cult
                </ref>
                Map products visualization:
                <ref type="url" target="https://sites.google.com/view/shijuanchen/research/shift_cult">
                    https://sites.google. com/view/shijuanchen/research/shift_cult
                </ref>
            </s>
        </p>
        <p>
            <s>The data that support the findings of this study are openly available at the following URL/DOI:
                <ref type="url" target="https://doi.org/10.5281/zenodo.7782782">https://
                    doi.org/10.5281/zenodo.7782782</ref>.
            </s>
        </p>
    </div>
</div>

Here an example with sentence segmentation:

<p>
    <s>We compared our area estimates of New Shifting Cultivation with the official forest change statistics from Laos (table
        <ref type="table" target="#tab_1">S1</ref>).
    </s>
    <s>The Laos official forest change maps (
        <ref type="url" target="https://nfms.maf.gov.la/">https://nfms.maf.gov.la/</ref>) are created from the land cover classification maps from the start year and end year for each period (see the periods in table
        <ref type="table" target="#tab_1">S1</ref>).
    </s>
    <s>Since shifting cultivation is the major driver of forest degradation and deforestation in Laos, we expect that there are some consistencies between the areas of New Shifting cultivation and the areas of forest degradation and deforestation.</s>
    <s>There are consistencies in the period
        <ref type="bibr">2006-2010 and 2011-2015, with</ref> the differences between our estimates and the official statistics both less than 1% of Laos.
    </s>
    <s>Our estimates of New Shifting Cultivation are generally higher than the Laos official estimates of deforestation and forest degradation, except for 2006-2010.</s>
    <s>This was partly due to the different monitoring approaches.</s>
    <s>Without using dense time series, the shifting cultivation events that occurred over five years may be difficult to detect using two classification maps from the start and the end.</s>
    <s>In the period
        <ref type="bibr">2001-2005 and 2016-2020, our</ref> estimates are about 2%-3% higher than the official estimates.
    </s>
    <s>For 2016-2020, the discrepancy is partly because the 2019 and 2020 changes are included in our estimates but not in the official statistics.</s>
    <s>Overall, our results and area estimates provide valuable information regarding the forest dynamics of Laos.</s>
</p>
<p>This work is available at
    <ref type="url" target="https://github.com/lfoppiano/supercon2">https://github.com/lfoppiano/ supercon2</ref>. The
    repository contains the code of the SuperCon 2 interface, the curation workflow, and the Table
    <ref type="table">2</ref>. Data support, the number of entities for each label in each of the datasets used for
    evaluating the ML models. The base dataset is the original dataset described in
    <ref type="bibr" target="#b17">[18]</ref>, and the curation dataset is automatically collected based on the database
    corrections by the interface and manually corrected.
</p>

with sentence segmentation

<div
        xmlns="http://www.tei-c.org/ns/1.0">
    <head n="5.">Code availability</head>
    <p>
        <s>This work is available at
            <ref type="url" target="https://github.com/lfoppiano/supercon2">https://github.com/lfoppiano/
                supercon2</ref>.
        </s>
        <s>The repository contains the code of the SuperCon 2 interface, the curation workflow, and the Table
            <ref type="table">2</ref>. Data support, the number of entities for each label in each of the datasets used
            for evaluating the ML models.
        </s>
        <s>The base dataset is the original dataset described in
            <ref type="bibr" target="#b17">[18]</ref>, and the curation dataset is automatically collected based on the
            database corrections by the interface and manually corrected.
        </s>
    </p>
</div>

TODO:
- fix some corner cases, where the . or the ) are wrongly retained, these are cases when the regex catches too much (usually one or a few characters)
- sometimes the URL refs are attached to the following text, it seems that a space is missing somehow
- validate the XML-TEI schema

@lfoppiano lfoppiano changed the title Identify and output URLs in output TEI Identify URLs and output them in TEI Apr 15, 2024
@coveralls
Copy link

coveralls commented Apr 15, 2024

Coverage Status

coverage: 40.236% (+0.1%) from 40.116%
when pulling 4d4c1e3 on feature/identify-urls
into 5bcb8b1 on feature/preserve-urls.

@lfoppiano lfoppiano marked this pull request as ready for review April 16, 2024 12:05
@lfoppiano
Copy link
Collaborator Author

I've solved additional corner cases:

  1. there are clickable links but no annotation are extracted by grobid, sometimes the regex alone catches too much, e.g. end parenthesis. This was fixed
  2. Spaces before the annotation are only added in the case of URLs, if there is a trail space in the layout token (but not in the dehypenised text)
  3. Some other case where the PDF document annotation are wrong (e.g. cut over the breaking line, or incorrect, the resulting URL might be incorrect

@lfoppiano lfoppiano added this to the 0.8.1 milestone May 21, 2024
Base automatically changed from feature/preserve-urls to master June 9, 2024 20:54
@lfoppiano lfoppiano merged commit cb7118d into master Jun 9, 2024
4 checks passed
@lfoppiano lfoppiano deleted the feature/identify-urls branch June 9, 2024 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants