Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7052 Fix infinite IN PROGRESS and DELETE IN PROGRESS labels #7053

Closed
wants to merge 2 commits into from
Closed

7052 Fix infinite IN PROGRESS and DELETE IN PROGRESS labels #7053

wants to merge 2 commits into from

Conversation

JingMa87
Copy link
Contributor

@JingMa87 JingMa87 commented Jul 4, 2020

What this PR does / why we need it: Fixes infinite IN PROGRESS and DELETE IN PROGRESS labels

Which issue(s) this PR closes: 7052

Closes #7052

Special notes for your reviewer: People will be happy about this

Suggestions on how to test this: Run a harvesting client and delete another one at the same time, then restart Glassfish immediately when you see the IN PROGRESS and DELETE IN PROGRESS labels. After you're done restarting you'll see the RUN FAILED AND DELETE FAILED labels instead of the old ones and you can edit the client again. So you can rerun or delete like normally.

Does this PR introduce a user interface change? If mockups are available, please link/include them here: Yes

image

Is there a release notes update needed for this change?: Yes, it's a noticeable change.

Additional documentation: N/A

@coveralls
Copy link

coveralls commented Jul 4, 2020

Coverage Status

Coverage decreased (-0.006%) to 19.647% when pulling e017f6c on JingMa87:7052-fix-hanging-harvesting-run into 6daf219 on IQSS:develop.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in a great position to review this code (having not worked on harvesting) but from a quick look it seems ok. It would be nice for @landreev or another developer who has worked on harvesting to take a look.

@@ -5,16 +5,9 @@
*/
package edu.harvard.iq.dataverse.harvest.client;

import javax.persistence.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised it isn't already written at http://guides.dataverse.org/en/4.20/developers/coding-style.html but we tend not to use wildcard imports like this.

import javax.ejb.Singleton;
import javax.ejb.Startup;

@Singleton
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm just being paranoid but we've had trouble with Singleton beans recently. See IQSS/dataverse.harvard.edu#73

Copy link
Contributor Author

@JingMa87 JingMa87 Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the issue you linked very closely since the last thing I want to do is introduce new errors. In the issue there's an authentication bean that's a singleton and when the bean gets held up by a process, no other clients can use it to authenticate because there's only one instance. This is bad. However, in my use case the singleton is only used during the startup of Dataverse to reset any hanging harvesting clients. After that there's no use for it, so a singleton bean seems the right approach here. Moreover, in Dataverse there's many singleton beans with an additional @Startup tag to initialize a certain process and those don't seem to be causing issues either. So I'm fairly confident that a singleton won't cause issues here.

Copy link
Contributor

@landreev landreev Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not worried about it being a singleton - seeing how it has nothing in it but the @PostConstruct code.
But there's another reason why this is not going to work, specifically for us here at Harvard. We have 2 production servers. So if one of the two is being restarted, it'll reset the flags even though the actions in question may still be in progress on the other node.

It's easy to address for the most part: in a multi-server situation only one node is designated to do harvesting, and other timer-run jobs. This "master node" is the one that has the JVM option -Ddataverse.timerServer=true. And you can use if (systemConfig.isTimerServer()) to check if this node is the timer node in your code, before proceeding to clear the flags.

HOWEVER, there is an exception to the rule - when a harvesting run is started not by the timer, but manually, through the harvesting client page and/or API, the job can end up running on either of the multiple nodes.

Admittedly, this will not affect many other dataverse installations. (I'm actually not aware of another production installation at this point that is using a multiple server setup). But this would affect us, so I am a bit hesitant about doing this.
And I'm not really positive as to what the best reliable solution is. That ClientHarvestRun object would need an extra entry identifying the node actually running the harvest, maybe?

Copy link
Contributor

@landreev landreev Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another solution would be to change how the manually-run harvests are handled. I.e., instead of actually running the harvest it could create a one-time timer entry for 5 sec. from now. So it will then be picked up by the timer service on the master node - and we are guaranteed to have all the harvesting runs staying on one server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure if being able to have these harvest-related flags cleared automatically is worth adding any complicated code to address the multiple server scenario...
So yet another, much simpler, solution would be to implement it as above - but with an option to disable this mechanism; so on startup the service above would check the setting before proceeding - if (enableClearingHarvestJobsOnStartup) { ... }. It will be enabled by default; so it would work for most installations. But then in the guide it will say "make sure to disable this mechanism if you are running multiple servers!". Somewhere in the same part of the guide where we mention that dataverse.timerServersetting.

Copy link
Contributor Author

@JingMa87 JingMa87 Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landreev I think it's best to take the simplest approach and only build in support for single server scenario's. My plan is to add a sane default of "true" for a configurable DB key "ClearingHarvestJobsOnStartupEnabled" using the SettingsServiceBean, do you have any objections?

@@ -40,12 +40,13 @@ public void setId(Long id) {
this.id = id;
}

public enum RunResultType { SUCCESS, FAILURE, INPROGRESS };
public enum RunResultType { SUCCESS, RUN_FAILED, RUN_IN_PROGRESS, DELETE_FAILED };
Copy link
Contributor

@landreev landreev Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but what is the value of adding this status condition for a failure to delete a client? It has some value for a dataverse admin, to know that the last attempt to harvest from a certain server resulted in a failure. Is it really useful to know that an attempt to delete a client failed? - should the startup check simply remove the "delete in progress" flag quietly instead?
After all, seeing how the client is still there makes it somewhat clear that the attempt to get rid of it didn't work out, so they should try again?

I may be missing some situation where it could actually be useful - so I'm open to hearing it.

Copy link
Contributor Author

@JingMa87 JingMa87 Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landreev I guess handling it quietly would be ok. So instead of DELETE FAILED or FAILED we just show a success label (those are the only options)?

@pdurbin
Copy link
Member

pdurbin commented Jan 3, 2023

I'm not sure of the status of this PR. @JingMa87 recently explained that he switched to another job: #7412 (comment)

@janvanmansum has been helping explain which issues and PRs are a priority for DANS.

For this one, @landreev recently added the ability to stop a harvest in progress...

... which seems related, at least. So I'm not sure if we need this PR or if it should be rewritten (the branch is gone from GitHub, by the way).

@mreekie mreekie removed the bk2211 label Jan 11, 2023
@pdurbin pdurbin added the Size: 3 A percentage of a sprint. 2.1 hours. label Feb 28, 2024
@pdurbin
Copy link
Member

pdurbin commented Feb 28, 2024

@JingMa87 I'm closing this PR but if you're still interested in it, please reply! Thanks!

@pdurbin pdurbin closed this Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 3 A percentage of a sprint. 2.1 hours.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Harvesting run infinitely stays at IN PROGRESS and DELETE IN PROGRESS
5 participants