Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8372 gdcc xoai library #8753

Merged
merged 40 commits into from
Oct 13, 2022
Merged

8372 gdcc xoai library #8753

merged 40 commits into from
Oct 13, 2022

Conversation

landreev
Copy link
Contributor

@landreev landreev commented May 27, 2022

What this PR does / why we need it:
Big technical debt PR; eliminates the custom jars of the old lyncode XOAI library; replaces them with dependencies for the new io.gdcc.xoai packages; fixes some known issues in the process.
Still a draft; basic functionality is there; still working on more tests, although I'm considering splitting that out as a separate issue.

Which issue(s) this PR closes:

Special notes for your reviewer:

The scope of this PR has ballooned A LOT. It was initially started under the assumption that it was simply to accommodate the move of the XOAI libraries from local_lib into a gdcc repo as they were, with only some minimal changes, leaving any further improvements and cleanup of the modifications in the original fork for later, as separate issues/PRs. Instead a very thorough refactoring of the entire xoai package was done before a stable 5.0 RC release was made on gdcc/xoai, reimplementing and/or getting rid of the early dataverse hacks and solving many issues in the process. Huge thanks to @poikilotherm for all the work there.
This is all a "good thing, not a bad thing"; it was all very real technical debt that we needed to address rather sooner than later. This has allowed to very much streamline the dataverse-side OAI implementation and get rid of most of the code that was maintained there. There's more that still needs to be done on the dataverse side for sure. But since this has already been a much larger effort than was originally scheduled, I really want to move the PR along and then proceed to handle the remaining improvements as separate tasks.

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@landreev
Copy link
Contributor Author

landreev commented Jun 1, 2022

@poikilotherm Hi, Harvard just gave us all an extra mini-vacation, but I'm back at work. I made this draft PR last week that finally gets rid of the old custom jars in local_lib. Once again, I cannot fully express my gratitude for your leading of this effort.
I was able to easily get rid of the extra xoai classes on the dataverse side; and to implement almost everything using the new mechanisms provided in gdcc.xoai, as we planned. I created a local Condition in order to pass the MetadataFormat to the getItems() methods in the ItemRepository implementation via a ScopedFilter. Etc. etc.
There are extra things that need to be cleaned up: the Date -> Instant change is currently handled via quick conversions between the two (i.e., the underlying entities are left unchanged, using Date for the stamps); it clearly needs more API tests. I'm working on this, but considering splitting the task into a separate issue.

There is one thing I could not implement 1:1 as it was: the handling of the proprietary "Dataverse JSON harvesting". I wanted to consult with you about this. The way the JSON api line has been encoded in the original implementation:

<metadata directApiCall="..."/>

- This doesn't seem to be possible, to implement this using the gdcc.xoai framework without adding ugly hacks on the library side again (?). And it was a somewhat unfortunate choice in the first place. I feel like if we want to keep this "json harvesting" going forward, it should be something like

<metadata>
   <dataverseCustom directApiCall="..."/>
</metadata>

instead. This I was able to easily implement in the item repository; but it obviously introduces a backward incompatibility. Do you have any thoughts on this? I'm also going to talk to the core dev. team, about whether we want to continue supporting this functionality; and/or if we want to change it.

@landreev landreev self-assigned this Jul 11, 2022
pom.xml Outdated Show resolved Hide resolved
pom.xml Show resolved Hide resolved
@kcondon kcondon self-assigned this Jul 27, 2022
@kcondon
Copy link
Contributor

kcondon commented Jul 27, 2022

Issues so far:

OK, editing to better spell out the client and server versions used. Using v5.12 to refer to this branch. Testing against both prod and demo for v5.11, will identify which host, too.

  1. Using v5.12 on Internal as client harvesting from v5.11 as the server, Datacite, oai_dataset, and dataverse_json do not work. Both datacite formats log import format not supported and json reports, edu.harvard.iq.dataverse.util.json.JsonParseException (Invalid license: CC0) when v5.11 server is either prod or demo, Failed to import harvested dataset: class edu.harvard.iq.dataverse.util.json.JsonParseException (Specified metadatalanguage not allowed.) appears to be another part of the stack traces.
  2. Using v5.12 on Internal acting as the OAI server with Demo as the v5.11 client, same formats fail but json reports a different error: unable to find valid certification path to requested target.

@landreev
Copy link
Contributor Author

As for the first 2 formats under 1., we really don't support them (at least for now), so we should probably hide them from the pulldown and/or not allow people to select any formats except the supported ones.
As for dataverse_json, just to confirm, are you saying that it's failing with the same license issue on develop/5.11 as well, not just on my branch?

@landreev
Copy link
Contributor Author

As for

  1. Harvesting as a server using a v5.11 client, same formats fail but json reports a different error: unable to find valid certification path to requested target

dataverse_json does require an additional http call (directly to the export API, to obtain json), outside of the OAI library. So it does not surprise me that the code that makes this call may be more sensitive to whether the cert is valid or not. But if the client is 5.11 in this context, then this is the existing behavior.

@kcondon
Copy link
Contributor

kcondon commented Jul 27, 2022

@landreev I wasn't saying that but since you asked I have now tested json using v5.11 as both client (demo) and server (prod) and get the same license error.

@poikilotherm
Copy link
Contributor

poikilotherm commented Jul 28, 2022

@kcondon sounds like this reveals some internal network problems?! If there is anything I can do, please reach out (here/Matrix/Slack/Issue in XOAI/...)

@landreev
Copy link
Contributor Author

@poikilotherm

sounds like this reveals some internal network problems?! If there is anything I can do, please reach out (here/Matrix/Slack/Issue in XOAI/...)

I don't think it's network problems. It looks like it literally is what it says - a problem trying to validate the cert of the remote server. Which in turn could have happened for various reasons... I'm working with Kevin on taking these cert issues out of the equation, so that we can focus on testing the XOAI functionality proper.
Regardless, if we need anything from you, we absolutely know where you live, yes!

@landreev
Copy link
Contributor Author

I wasn't saying that but since you asked I have now tested json using v5.11 as both client (demo) and server (prod) and get the same license error.

OK, good to know. I'll figure this out, even if it's not specific to my branch. Again, to be able to take this error out of the equation/be able to test before vs. after.
We were having various license-related import issues right after that PR that refactored everything license-related; with somewhat similar errors in the exception stack traces. Most of those were trivial - like, there were places in the code where the license server bean was not properly passed to the json parser (the new requirement). Although I thought we had fixed them all. Anyway, I'll figure it out, will talk to Jim if necessary.

@landreev
Copy link
Contributor Author

So, about those dataverse_json import errors:
The Invalid license: CC0 error was there because some of the prod. datasets in the set you were testing did not get re-exported after 5.10, when the license was changed. All the datasets have now been re-exported, so this should no longer be an issue.
However, every single prod. dataset is going to be failing to import/harvest in dataverse_json by 5.11 clients now, with a different error: Specified metadatalanguage not allowed. This is on account of #8868.

I'm guessing this means that the only way to test harvesting dataverse_json now is to have the new metadataLanguages setting defined on both the server and client; and have every dataset in the OAI set configured with a metadataLanguage value recognized by the client... which frankly feels like serious pain. So waiting to have a fix for #8868 merged may be a better idea.

On the [somewhat] plus side, I still think that the very fact that these dataverse_json harvests are failing with the same import errors in my branch does mean that the harvesting framework is working as it should be. By the nature of the changes there, if it were broken, it would be failing to obtain the json, and never getting to the import stage.

@landreev landreev self-assigned this Aug 15, 2022
@landreev
Copy link
Contributor Author

(Just put my name on the PR to update the status, as requested in standup)
The last comment above is still an accurate description of the status of the PR: the serious issues with harvesting JSON that were identified during the first QA pass are unrelated to the PR itself. The issues in question should be addressed in #8868, which appears to be close to being done. Once it's merged, this PR can be re-tested. To the best of my knowledge everything else in it should be working; at least working as well, or better than it has been up until now, which is the goal of this PR. (for example, one small thing identified above - that we allow an admin to select a format from the pulldown that Dataverse does not know how to import - is a pre-existing issue; that I'm assuming we don't want to try to address during this pass).

@kcondon kcondon removed their assignment Aug 15, 2022
@landreev
Copy link
Contributor Author

landreev commented Sep 7, 2022

Since the json export PR has been merged, this issue is ready to be re-tested.
Re-testing the functionality directly involving json export (i.e., harvesting in dataverse_json format) is fairly straightforward (just needs going through all the combinations of "old" and "new" on the client and server ends). But then I'm not sure if whoever ends up working on wrapping up this PR will want to re-test everything from scratch (?).
I will work with this person to provide guidance/answer questions.

@landreev landreev removed their assignment Sep 7, 2022
@landreev landreev self-assigned this Sep 22, 2022
@coveralls
Copy link

coveralls commented Sep 23, 2022

Coverage Status

Coverage increased (+0.09%) to 19.974% when pulling e448b68 on 8372-gdcc-xoai-library into 7c1683b on develop.

@landreev
Copy link
Contributor Author

OK, everything has been re-tested very carefully.
Additionally, harvesting dataverse_json forrmat, that could not be fully tested during the first QA attempt mainly because of the export bug in 5.11 was thoroughly tested as follows:

This branch simplified the process of harvesting of our proprietary json records; it skips the step where the client gets a hacked GetRecord entry from the server and extracts the url for obtaining the JSON record. It just calls the export API directly, since the url for it can be derived from knowing the server address. However, any remaining Dataverse clients not yet upgraded to this version will continue using the old model. So for backward compatibility, this new version of xoai-server still knows how to produce the old-style, custom version of GetRecord with the url embedded (which is unfortunate, because it's really hacky. But we will get rid of it in a future version, once it is safe to assume that few if any old-style clients harvesting this format are still out there). So all three possible combinations of harvesting dataverse_json had to be tested: new-to-new; old-to-new; new-to-old. Testing was done using my own dev. build as the client, and either dataverse-internal or the perf. cluster as the server.

@landreev landreev merged commit da6db57 into develop Oct 13, 2022
@pdurbin pdurbin added this to the 5.12.1 milestone Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update and reorganize the XOAI dependencies under local_lib
5 participants