Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulkload publication citations #1565

Closed
KyndallH opened this issue Jun 12, 2018 · 11 comments
Closed

bulkload publication citations #1565

KyndallH opened this issue Jun 12, 2018 · 11 comments
Assignees
Labels
Blocked Issue cannot be addressed until another Issue (which should be linked) is addressed.

Comments

@KyndallH
Copy link

It would be great if we could bulkload publication.

And import publications from a bibliography.

Attached is an example of the publications we need to get entered. Some have a DOI listed and some do not.

Papers using UAM birds by date working Apr 2018.txt

@KyndallH KyndallH assigned AJLinn and dustymc and unassigned dustymc and AJLinn Jun 12, 2018
@Jegelewicz
Copy link
Member

Agree! I have a similar list that I have been plowing through as time permits (Ha! Ha!), which means they never get done...A bulkloader would be fabulous!

@dustymc
Copy link
Contributor

dustymc commented Jun 13, 2018

I need two things:

  • format
  • rules

Here's the target.


UAM@ARCTOS> desc publication
 Name								   Null?    Type
 ----------------------------------------------------------------- -------- --------------------------------------------
 PUBLICATION_ID 						   NOT NULL NUMBER
 PUBLISHED_YEAR 							    NUMBER
 PUBLICATION_TYPE						   NOT NULL VARCHAR2(21)
 PUBLICATION_LOC							    VARCHAR2(255)
 PUBLICATION_REMARKS							    VARCHAR2(4000)
 IS_PEER_REVIEWED_FG						   NOT NULL NUMBER(1)
 FULL_CITATION							   NOT NULL VARCHAR2(4000)
 SHORT_CITATION 						   NOT NULL VARCHAR2(4000)
 DOI									    VARCHAR2(4000)
 PMID									    VARCHAR2(4000)


UAM@ARCTOS> desc publication_agent
 Name								   Null?    Type
 ----------------------------------------------------------------- -------- --------------------------------------------
 PUBLICATION_AGENT_ID						   NOT NULL NUMBER
 PUBLICATION_ID 						   NOT NULL NUMBER
 AGENT_ID							   NOT NULL NUMBER
 AUTHOR_ROLE							   NOT NULL VARCHAR2(255)

Code tables are

The NOT NULL things are required. Agents are optional (so NOT NULL is "required if there's an agent"). We do have a "require at least one" rule in the UI, I believe implemented to facilitate search. (I can't find the Issue - maybe it's in some AWG notes?)

DOIs are critical in detecting duplicates and linking to funding and all that jazz. I'm not sure how successfully I could extract them from those example data - there is a LOT of variation in formatting. I checked a few that don't have DOIs, and the publications all seem to have DOI. Can someone run this through some bibliography tool and see if there's any magic there? @mkoo

Dealing with duplicate publications is a huge mess; they almost inevitable each end up holding part of the citations and authors never quite line up and etc. I'm not quite sure how to avoid that with a bulkloader.

Given DOIs I can pull publication details from CrossRef as well. It's also a mess, but it's the mess everyone uses. In the data above:

Sonsthagen, S.A., R. E. Wilson, R. Terry Chesser, J-M. Pons, P-A. Crochet, A. Driskell, C. Dove. 2016. Recurrent hybridization and recent origin obscure phylogenetic relationships within the Ôwhite-headedÕ gull (Larus sp.) complex. Molecular Phylogenetics and Evolution 103:41-54. doi: 10.1016/j.ympev.2016.06.008

(Note "Ôwhite-headedÕ" - there's some sort of characterset conversion failure in these data.)

vs. from CrossRef via DOI:

<?xml version="1.0" encoding="UTF-8"?>
<doi_records>
  <doi_record owner="10.1016" timestamp="2017-10-07 03:37:06">
    <crossref>
      <journal>
        <journal_metadata language="en">
          <full_title>Molecular Phylogenetics and Evolution</full_title>
          <abbrev_title>Molecular Phylogenetics and Evolution</abbrev_title>
          <issn media_type="print">10557903</issn>
        </journal_metadata>
        <journal_issue>
          <publication_date media_type="print">
            <month>10</month>
            <year>2016</year>
          </publication_date>
          <journal_volume>
            <volume>103</volume>
          </journal_volume>
          <special_numbering>C</special_numbering>
        </journal_issue>
        <journal_article publication_type="full_text">
          <titles>
            <title>Recurrent hybridization and recent origin obscure phylogenetic relationships within the ‘white-headed’ gull ( Larus  sp.) complex</title>
          </titles>
          <contributors>
            <person_name contributor_role="author" sequence="first">
              <given_name>Sarah A.</given_name>
              <surname>Sonsthagen</surname>
            </person_name>
            <person_name contributor_role="author" sequence="additional">
              <given_name>Robert E.</given_name>
              <surname>Wilson</surname>
            </person_name>
            <person_name contributor_role="author" sequence="additional">
              <given_name>R. Terry</given_name>
              <surname>Chesser</surname>
            </person_name>
            <person_name contributor_role="author" sequence="additional">
              <given_name>Jean-Marc</given_name>
              <surname>Pons</surname>
            </person_name>
            <person_name contributor_role="author" sequence="additional">
              <given_name>Pierre-Andre</given_name>
              <surname>Crochet</surname>
            </person_name>
            <person_name contributor_role="author" sequence="additional">
              <given_name>Amy</given_name>
              <surname>Driskell</surname>
            </person_name>
            <person_name contributor_role="author" sequence="additional">
              <given_name>Carla</given_name>
              <surname>Dove</surname>
            </person_name>
          </contributors>
          <publication_date media_type="print">
            <month>10</month>
            <year>2016</year>
          </publication_date>
          <pages>
            <first_page>41</first_page>
            <last_page>54</last_page>
          </pages>
          <publisher_item>
            <item_number item_number_type="sequence-number">S1055790316301464</item_number>
            <identifier id_type="pii">S1055790316301464</identifier>
          </publisher_item>
          <crossmark>
            <crossmark_policy>10.1016/elsevier_cm_policy</crossmark_policy>
            <crossmark_domains>
              <crossmark_domain>
                <domain>elsevier.com</domain>
              </crossmark_domain>
              <crossmark_domain>
                <domain>sciencedirect.com</domain>
              </crossmark_domain>
            </crossmark_domains>
            <crossmark_domain_exclusive>true</crossmark_domain_exclusive>
            <custom_metadata>
              <assertion label="This article is maintained by " name="publisher">Elsevier</assertion>
              <assertion label="Article Title" name="articletitle">Recurrent hybridization and recent origin obscure phylogenetic relationships within the ‘white-headed’ gull (Larus sp.) complex</assertion>
              <assertion label="Journal Title" name="journaltitle">Molecular Phylogenetics and Evolution</assertion>
              <assertion label="CrossRef DOI link to publisher maintained version" name="articlelink">http://dx.doi.org/10.1016/j.ympev.2016.06.008</assertion>
              <assertion label="Content Type" name="content_type">article</assertion>
              <assertion label="Copyright" name="copyright">Published by Elsevier Inc.</assertion>
              <program name="fundref">
                <assertion name="fundgroup">
                  <assertion name="funder_name">
                    <assertion name="funder_identifier">http://dx.doi.org/10.13039/100006282</assertion>
                    Federal Aviation Administration (FAA)
                  </assertion>
                </assertion>
                <assertion name="fundgroup">
                  <assertion name="funder_name">
                    <assertion name="funder_identifier">http://dx.doi.org/10.13039/100006271</assertion>
                    Laboratories of Analytical Biology and Division of Birds, National Museum of Natural History, Smithsonian Institution
                  </assertion>
                </assertion>
              </program>
              <program name="AccessIndicators">
                <license_ref applies_to="tdm">http://www.elsevier.com/tdm/userlicense/1.0/</license_ref>
                <license_ref applies_to="am" start_date="2017-07-15">http://www.elsevier.com/open-access/userlicense/1.0/</license_ref>
              </program>
            </custom_metadata>
          </crossmark>
          <doi_data>
            <doi>10.1016/j.ympev.2016.06.008</doi>
            <resource>http://linkinghub.elsevier.com/retrieve/pii/S1055790316301464</resource>
            <collection property="text-mining">
              <item>
                <resource mime_type="text/xml">http://api.elsevier.com/content/article/PII:S1055790316301464?httpAccept=text/xml</resource>
              </item>
              <item>
                <resource mime_type="text/plain">http://api.elsevier.com/content/article/PII:S1055790316301464?httpAccept=text/plain</resource>
              </item>
            </collection>
          </doi_data>
        </journal_article>
      </journal>
    </crossref>
  </doi_record>
</doi_records>

I think the first step is probably experimenting with bibliography tools - can anything deal with these data, and how does it format the output from that?

@Jegelewicz
Copy link
Member

Dealing with duplicate publications is a huge mess; they almost inevitable each end up holding part of the citations and authors never quite line up and etc. I'm not quite sure how to avoid that with a bulkloader.

One way to potentially reduce duplicates would be to parse out some of the citation. We could leave the full citation field, but if we added something like:

PUBLICATION_TITLE
PUBLICATION_PAGES
PUBLICATION_JOURNAL

Maybe it would be easier to pick out when someone is attempting to add a duplicate?

@dustymc
Copy link
Contributor

dustymc commented Jun 13, 2018

We had that structure WAY back in the day and got rid of it for simplicity. I don't think it ever did anything very useful (there are still about 800 ways to represent most titles, especially those with formatting), it was constant work to add to the code tables (journal name etc), and even that got duplicates fairly often, there were long discussions about publishers changing names and what to do with "gray literature," putting it back together into a "citation" that might be found outside of Arctos was near impossible, etc. I'm less than enthusiastic about reintroducing any of that.

Those kinds of data from CrossRef are often a mess too, but with DOIs that's also (mostly) irrelevant - the DOI itself gets you where you need to be.

I REALLY like DOIs, and I really dislike dealing with publications without them. Maybe "we" (whoever that is!) should explore a partnership with BHL. They obviously have some relationship with crossref, they're assigning DOIs to old publications (https://doi.org/10.5962/bhl.title.327), perhaps we could enter (maybe via webservice - I have no idea what's possible) publications without DOIs there and require DOIs on the Arctos side?

@Jegelewicz
Copy link
Member

Maybe "we" (whoever that is!) should explore a partnership with BHL

I like this idea a lot, unfortunately "we" are already overtaxed! This seems like a natural partnership though. I will see if I can find a BHL contact and just ask the question...

@campmlc
Copy link

campmlc commented Jun 14, 2018 via email

@Jegelewicz
Copy link
Member

We could do a joint grant application to IMLS....

Email sent today:
I am the treasurer for the Arctos Collaborative Collections Management Solution Consortium.(http://arctos.database.museum/) We have worked hard to include citations related to our specimen records (http://arctos.database.museum/SpecimenUsage.cfm) and we wonder if BHL might be interested in a partnership that would help enrich biological specimen records as well as BHL records. One issue we would like to resolve is lack of doi's for some of the literature citing our specimens. If you are interested in a discussion about how we could work together, please email me at arctos.treasurer@gmail.com

Thank you,

Teresa Mayfield-Meyer
Arctos Treasurer

@dustymc
Copy link
Contributor

dustymc commented Jun 14, 2018

A data entry form (at BHL - could be webservice, API, form, WHATEVER - I don't think that matters at all) that returns a DOI would be amazing. I'd absolutely push for that to be our only option. (See below - I'd push harder now!) Duplicates can still happen, but from there it's crossref's problem (we'd just have multiple DOIs that point to the same publication - not ideal, but still works). Some of those ~6K DOIless publications are field notes and such, but I think we could offer them at least a couple thousand publications. All we'd need from them is access to whatever they use to apply DOIs to old publications, and as above I think there's a great deal of flexibility in how we do that.

Recent publications are a complete mess. DOIs aren't being entered - the proportion with keeps shrinking.

UAM@ARCTOS> select count(*) from publication;

  COUNT(*)
----------
      7533

1 row selected.

Elapsed: 00:00:00.03
UAM@ARCTOS> select count(*) from publication where doi is not null;

  COUNT(*)
----------
      1191

1 row selected.

I found https://arctos.database.museum/publication/10006284 yesterday - I have no idea what's going on there, but it shouldn't have happened.

https://arctos.database.museum/publication/10007115 has no DOI, but a (flaky) URL to a page with a DOI in "storage location."

Annnd there are hundreds more, including malformed DOIs, good DOIs, things that lead to but are not DOIs, and even some actual storage location data! Can we talk about who has access to publications, or training, or something? This should not happen. I'm scared to look in remarks.

I'm not sure if a bulkloader might make it better or worse. (Unless BHL saves the day, maybe we should route everything through a bulkloader - like specimen records - so we can check this sort of thing before we let it in??)

UAM@ARCTOS> select PUBLICATION_LOC from publication where PUBLICATION_LOC is not null group by PUBLICATION_LOC order by publication_loc;

PUBLICATION_LOC
------------------------------------------------------------------------------------------------------------------------
Accession file UAM:EH 0236
Accession file UAM:EH UA2017-011
Arctos
BioSciences Library
DOI: 10.1002/jmor.10986
Grinnell-Miller Library
Grinnell-Miller Library; Grinnell-Miller Library
Herp Library
Herp Reprint
Herp Reprints
ISBN 1-884549-33-0
In, Mammal Collection Management(H. H. Genoways, C. Jones , and O. L. Rossolimo, eds.).
Kenai National Wildlife Refuge
MVZ
MVZ Mammal library
Mammal library
No suitable DOI could be found - D. Piquard 2018-06-13
On file in the UAM Ethnology & History department
Publication in Wake's office.
Special Publication of The Museum of Texas Tech University.
UAM Earth Science Department
UC/Jepson Herbaria
University of Alaska Museum of the North, Ethnology & HIstory collections library
Zootaxa 1948: 57–68 (2008).
bird reprints
curatorial area
doi:10.103812251155aO
doi:10.1098/rspb.2010.0171
herp reprints
http//digitalcommons.unl.edu/parasitologyfacpubs/547
http://advances.sciencemag.org/content/advances/1/5/e1500055.full.pdf
http://books.google.com/books?id=EEtAAAAAIAAJ&printsec=frontcover&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false
http://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1029&context=parasitologyfacpubs
http://digitalcommons.unl.edu/parasitologyfacpubs/548
http://digitalcommons.unl.edu/parasitologyfacpubs/549
http://digitalcommons.unl.edu/parasitologyfacpubs/721
http://dx.doi.org/10.5169/seals-88790
http://dx.doi.org/10.5169/seals-88801
http://mesoamericanherpetology.com/uploads/3/4/7/9/34798824/othercontributionsdec2015.pdf
http://mormyrids.lifedesks.org/node/20
http://pubs.usgs.gov/sir/2013/5048/
http://www.bione.org/doi/full/10.3398/064.070.0202
http://www.bioone.org/doi/full10.1644/09-MAMM-A-018.1
http://www.checklist.org.br/getpdf?NGD057-12
http://www.jstor.org
http://www.jstor.org/stable/1373780
http://www.jstor.org/stable/1373791
http://www.jstor.org/stable/1375116
http://www.jstor.org/stable/1375235
http://www.jstor.org/stable/1375270
http://www.jstor.org/stable/1375419
http://www.jstor.org/stable/1375438
http://www.jstor.org/stable/1375688
http://www.jstor.org/stable/1375847
http://www.jstor.org/stable/1375978
http://www.jstor.org/stable/1376091
http://www.jstor.org/stable/1376127
http://www.jstor.org/stable/1376557
http://www.jstor.org/stable/1376561
http://www.jstor.org/stable/1377027
http://www.jstor.org/stable/1377086
http://www.jstor.org/stable/1377090
http://www.jstor.org/stable/1377509
http://www.jstor.org/stable/1377856
http://www.jstor.org/stable/1378352
http://www.jstor.org/stable/1378548
http://www.jstor.org/stable/1378684
http://www.jstor.org/stable/1378748
http://www.jstor.org/stable/1379716
http://www.jstor.org/stable/1379850
http://www.jstor.org/stable/1381544
http://www.jstor.org/stable/1382279
http://www.jstor.org/stable/1382729
http://www.jstor.org/stable/1382831
http://www.jstor.org/stable/1437506
http://www.jstor.org/stable/1678391
http://www.jstor.org/stable/1748401
http://www.jstor.org/stable/1929729
http://www.jstor.org/stable/1930109
http://www.jstor.org/stable/2387658
http://www.jstor.org/stable/241778
http://www.jstor.org/stable/2421532
http://www.jstor.org/stable/2421574
http://www.jstor.org/stable/2421593
http://www.jstor.org/stable/2421613
http://www.jstor.org/stable/2421874
http://www.jstor.org/stable/2421931
http://www.jstor.org/stable/2425072
http://www.jstor.org/stable/2459788
http://www.jstor.org/stable/30055132
http://www.jstor.org/stable/30092316
http://www.jstor.org/stable/3223178
http://www.jstor.org/stable/3223251
http://www.jstor.org/stable/3223322
http://www.jstor.org/stable/3223396
http://www.jstor.org/stable/3223500
http://www.jstor.org/stable/3223545
http://www.jstor.org/stable/3223549
http://www.jstor.org/stable/3224557
http://www.jstor.org/stable/3272259
http://www.jstor.org/stable/3272599
http://www.jstor.org/stable/3272716
http://www.jstor.org/stable/3272717
http://www.jstor.org/stable/3273159
http://www.jstor.org/stable/3273168
http://www.jstor.org/stable/3273266
http://www.jstor.org/stable/3273306
http://www.jstor.org/stable/3273389
http://www.jstor.org/stable/3273578
http://www.jstor.org/stable/3273658
http://www.jstor.org/stable/3273681
http://www.jstor.org/stable/3273835
http://www.jstor.org/stable/3274128
http://www.jstor.org/stable/3503885
http://www.jstor.org/stable/3535866
http://www.jstor.org/stable/3669029
http://www.jstor.org/stable/3795565
http://www.jstor.org/stable/3795820
http://www.jstor.org/stable/3796325
http://www.jstor.org/stable/4070425
http://www.jstor.org/stable/4523630
http://www.mcz.harvard.edu/Publications/search_pubs.html?publication=bullmcz
http://www.mesoamericanherpetology.com/previous-issues.html
http://www.sciencedirect.com/science/article/pii/S105579030500357X
https://archive.org/details/BulletinV6N8
https://archive.org/details/NaturalHistorySurveyN6
https://archive.org/details/birdandallnature06naturich
https://archive.org/details/birdsnaturemagaz03marbrich
https://ib.berkeley.edu/labs/wake/369_newWormSalamander.pdf
https://repository.unm.edu/bitstream/handle/1928/26728/MSB-Special-Pub-N09-HopeParmenter2007.pdf?sequence=1&isAllowed=y
https://www.biodiversitylibrary.org/page/10444921
https://www.biodiversitylibrary.org/page/15506909#page/605/mode/1up
https://www.biodiversitylibrary.org/part/18781
https://www.dropbox.com/s/c68mbgk2990rruh/HR%2BDec%2B2010%2Bebook%2Bv2.pdf?dl=1
https://www.fs.fed.us/pnw/pubs/journals/pnw_2005_parmenter001.pdf
https://www.penn.museum/sites/expedition/the-vanishing-art-of-the-arctic/
journal=Z. Parasitenk.
mammal reprints
reprint library
www.mapress.com/zootaxa/

140 rows selected.

More problem pubs at #1570

@campmlc
Copy link

campmlc commented Jul 3, 2018 via email

@dustymc dustymc added this to the Needs Discussion milestone Jul 17, 2018
@dustymc dustymc added Blocked Issue cannot be addressed until another Issue (which should be linked) is addressed. and removed Blocked: Needs Discussion labels Mar 17, 2021
@dustymc
Copy link
Contributor

dustymc commented Feb 16, 2022

Is there still interest in this? I still need what's requested in #1565 (comment) if so, if not please close.

@dustymc
Copy link
Contributor

dustymc commented Aug 29, 2024

If this gets revived, it should not proceed before ArctosDB/dev#41.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocked Issue cannot be addressed until another Issue (which should be linked) is addressed.
Projects
None yet
Development

No branches or pull requests

7 participants