Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvesting geonode (pycsw) by OAI-PMH does not work #6242

Closed
kamil386 opened this issue Oct 2, 2019 · 16 comments
Closed

Harvesting geonode (pycsw) by OAI-PMH does not work #6242

kamil386 opened this issue Oct 2, 2019 · 16 comments

Comments

@kamil386
Copy link

kamil386 commented Oct 2, 2019

Harvesting geonode (pycsw) by OAI-PMH does not work, the details:

XML response from geonode (pycsw):
http://master.demo.geonode.org/catalogue/csw?mode=oaipmh

<!-- pycsw 2.4.0 --><oai:OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><oai:responseDate>2019-10-01T18:51:35Z</oai:responseDate><oai:request>http://master.demo.geonode.org/catalogue/csw?mode=oaipmh</oai:request><oai:error code="badArgument">Missing 'verb' parameter</oai:error></oai:OAI-PMH>

On the second hand, valid XML response from dataverse that works for harvesting without any problem:
https://demo.dataverse.org/oai

<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2019-10-01T23:52:01Z</responseDate><request>https://demo.dataverse.org/oai</request><error code="badVerb">Illegal verb</error></OAI-PMH>

Logs:
[2019-10-02T01:58:18.164+0200] [glassfish 4.1] [INFO] [] [edu.harvard.iq.dataverse.HarvestingClientsPage] [tid: _ThreadID=51 _ThreadName=jk-connector(4)] [timeMillis: 1569974298164] [levelValue: 800] [[ metadataformats: failed;received empty list from ListMetadataFormats]]

Dataverse version: 4.16

The screenshot from the dashboard on create harvesting client window:
image

Maybe it is related to the oai: in the tag (it is the only difference in both xml responses) and dataverse can't process that xml.

Providing full url of oai in geonode also does not work:
http://master.demo.geonode.org/catalogue/csw?mode=oaipmh&verb=ListRecords&set=citable&metadataPrefix=oai_dc

@djbrooke
Copy link
Contributor

djbrooke commented Oct 2, 2019

Hi @kamil386, I'm not sure this is a Dataverse issue. When I click "Metadata search via OAI-PMH:" on the right side of the page here:

http://master.demo.geonode.org/developer/

It takes me to http://master.demo.geonode.org/catalogue/csw?mode=oaipmh&verb=Identify which shows what appears to be an error.

Can you reach out to the Geonode team to make sure this is working as expected before we start troubleshooting here? Thanks!

@djbrooke djbrooke closed this as completed Oct 2, 2019
@kamil386
Copy link
Author

kamil386 commented Oct 7, 2019

@djbrooke Geonode team has just fixed that bug, but still the same error appears.
All of the six OAIMH protocol requests are valid and works in geonode/pycsw. Valid url:
http://master.demo.geonode.org/catalogue/csw?mode=oaipmh&verb=ListMetadataFormats
gives an error in Dataverse logs:
[2019-10-07T12:34:04.775+0200] [glassfish 4.1] [INFO] [] [edu.harvard.iq.dataverse.HarvestingClientsPage] [tid: _ThreadID=52 _ThreadName=jk-connector(5)] [timeMillis: 1570444444775] [levelValue: 800] [[ metadataformats: failed;received empty list from ListMetadataFormats]]
what is not true.

Maybe the problem is with bad interpreting the oai in the tag or additional necessary parameter "mode=oaipmh" in geonode/pycsw is not included in the Dataverse queries?

@pdurbin
Copy link
Member

pdurbin commented Oct 7, 2019

@kamil386 is right, there are no more errors at http://master.demo.geonode.org/catalogue/csw?mode=oaipmh&verb=Identify (weird Django errors last week) so I'm re-opening this issue.

@pdurbin pdurbin reopened this Oct 7, 2019
@JingMa87
Copy link
Contributor

JingMa87 commented Jun 19, 2020

Error
The error I'm getting when trying to harvest the GeoNode repository is an SSLHandshakeException. After hacking in the code, I found out that a faulty SSL certificate is not actually the cause and is thrown by a library which handles the OAI-PMH HTTP requests. Anyhow, the real issue is that the URL for requesting isn't constructed properly. The 3rd party library XOAI adds
"?verb=ListMetadataFormats" to the base URL ("https://master.demo.geonode.org/catalogue/csw?mode=oaipmh") which results in the faulty URL: "https://master.demo.geonode.org/catalogue/csw?mode=oaipmh?verb=ListMetadataFormats" because of double question marks.

Fix
There's 2 ways of fixing this issue:

  1. GeoNode needs to change their base URL so there's no question mark in there.
  2. XOAI needs to fix the code so that it checks if the URL has a question mark and if so, they add an ampersand to the request instead.

@pdurbin
Copy link
Member

pdurbin commented Jun 22, 2020

@JingMa87 thanks for all the investigation. Do you think the bug is on the Dataverse side or the Geonode side (or both)?

@JingMa87
Copy link
Contributor

@pdurbin I did some more investigation and updated my last comment. The issue is definitely not with Dataverse.

@pdurbin
Copy link
Member

pdurbin commented Jun 22, 2020

@JingMa87 thanks, from looking at our pom.xml we seem to be running a fork:

<!-- EXPERIMENTAL: -->
<!-- lyncode xoai OAI-PMH implementation: -->
<!-- unfortunately, their 4.10 version -->
<!-- is still buggy. As an experiment, I'm using -->
<!-- a patched version I built locally. -->
<!-- (pull requests pending - L.A. -->
<dependency>
    <groupId>com.lyncode</groupId>
    <artifactId>xoai-common</artifactId>
    <version>4.1.0-header-patch</version>
</dependency>
<dependency>
    <groupId>com.lyncode</groupId>
    <artifactId>xoai-data-provider</artifactId>
    <version>4.1.0-header-patch</version>
</dependency>
<dependency>
    <groupId>com.lyncode</groupId>
    <artifactId>xoai-service-provider</artifactId>
    <version>4.1.0-header-patch</version>
</dependency>

It looks like the code moved to https://github.com/DSpace/xoai and there's a version called 4.2.0.

@JingMa87
Copy link
Contributor

JingMa87 commented Jun 22, 2020

@pdurbin I can test the newest version of the library with Dataverse but I'm wondering what changes L.A. patched? It might be that the newest version of XOAI doesn't have the features that L.A. patched in so I would have to test for that.

@pdurbin
Copy link
Member

pdurbin commented Jun 22, 2020

@JingMa87 I mean, you certainly could but is this issue a high priority for you? If so, I can ask L.A. about that patch. If not, are there other issues we could re-direct your energy into? We really appreciate all the pull requests!

@JingMa87
Copy link
Contributor

@pdurbin If any issues have more priority just let me know so I can discuss it with my coordinator. Do note that I'm a recently hired engineer for Data Archiving and Networked Services (DANS) in the Netherlands so I'm quite new to Dataverse. My current goal is to get to know the app more and in particular the harvesting client feature.

@pdurbin
Copy link
Member

pdurbin commented Jun 23, 2020

@JingMa87 welcome to the Dataverse community! Here are a few harvesting-related issues you might want to read through:

@jggautier thinks a lot about harvesting and might have some other issues in mind. I'd also like to bring it to @landreev 's attention that we may have a future harvesting hacker in our midst. 😄 Thanks!

@jggautier
Copy link
Contributor

Thanks @pdurbin and hi @JingMa87. The issue IQSS/dataverse.harvard.edu#72 - about special characters in dataset metadata preventing other repositories from harvesting from Harvard Dataverse - is the harvested-related issue most pressing to me right now. It's in Harvard Dataverse's GitHub repo, but as far as I know it's possible that other repositories would be or are being affected by it. That is, other repositories have characters in their metadata exports that are preventing others from harvesting from them. I did as much digging as I could, but wouldn't know how to proceed.

@JingMa87
Copy link
Contributor

@jggautier Sounds like something I can look into! Also do you or @pdurbin think I can get permissions to add issues and PRs to a project? Otherwise I would have to ask a colleague to do it every time I do something.

@pdurbin
Copy link
Member

pdurbin commented Jun 24, 2020

@JingMa87 I just invited you to join https://github.com/orgs/IQSS/teams/dataverse-readonly . I hope that helps.

@JingMa87
Copy link
Contributor

@pdurbin I validated GeoNode's repo URL on https://www.openarchives.org/Register/ValidateSite and they're giving me the same error caused by double question marks. I asked our functional manager to report this issue and a fix to GeoNode. Since the original poster of this issue is not commenting, I'd like to close it. Agreed?

@pdurbin
Copy link
Member

pdurbin commented Jun 29, 2020

@JingMa87 thanks for the leg work. @kamil386 presented at our Dataverse Community Meeting just a couple weeks ago so he should be around. I'm getting the sense that the feeling is that there isn't a bug in Dataverse after all so yes, I'll close it, at least for now. Thanks.

@pdurbin pdurbin closed this as completed Jun 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants