-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
7052 Fix infinite IN PROGRESS and DELETE IN PROGRESS labels #7053
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not in a great position to review this code (having not worked on harvesting) but from a quick look it seems ok. It would be nice for @landreev or another developer who has worked on harvesting to take a look.
@@ -5,16 +5,9 @@ | |||
*/ | |||
package edu.harvard.iq.dataverse.harvest.client; | |||
|
|||
import javax.persistence.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised it isn't already written at http://guides.dataverse.org/en/4.20/developers/coding-style.html but we tend not to use wildcard imports like this.
import javax.ejb.Singleton; | ||
import javax.ejb.Startup; | ||
|
||
@Singleton |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm just being paranoid but we've had trouble with Singleton beans recently. See IQSS/dataverse.harvard.edu#73
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the issue you linked very closely since the last thing I want to do is introduce new errors. In the issue there's an authentication bean that's a singleton and when the bean gets held up by a process, no other clients can use it to authenticate because there's only one instance. This is bad. However, in my use case the singleton is only used during the startup of Dataverse to reset any hanging harvesting clients. After that there's no use for it, so a singleton bean seems the right approach here. Moreover, in Dataverse there's many singleton beans with an additional @Startup
tag to initialize a certain process and those don't seem to be causing issues either. So I'm fairly confident that a singleton won't cause issues here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not worried about it being a singleton - seeing how it has nothing in it but the @PostConstruct
code.
But there's another reason why this is not going to work, specifically for us here at Harvard. We have 2 production servers. So if one of the two is being restarted, it'll reset the flags even though the actions in question may still be in progress on the other node.
It's easy to address for the most part: in a multi-server situation only one node is designated to do harvesting, and other timer-run jobs. This "master node" is the one that has the JVM option -Ddataverse.timerServer=true
. And you can use if (systemConfig.isTimerServer())
to check if this node is the timer node in your code, before proceeding to clear the flags.
HOWEVER, there is an exception to the rule - when a harvesting run is started not by the timer, but manually, through the harvesting client page and/or API, the job can end up running on either of the multiple nodes.
Admittedly, this will not affect many other dataverse installations. (I'm actually not aware of another production installation at this point that is using a multiple server setup). But this would affect us, so I am a bit hesitant about doing this.
And I'm not really positive as to what the best reliable solution is. That ClientHarvestRun object would need an extra entry identifying the node actually running the harvest, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another solution would be to change how the manually-run harvests are handled. I.e., instead of actually running the harvest it could create a one-time timer entry for 5 sec. from now. So it will then be picked up by the timer service on the master node - and we are guaranteed to have all the harvesting runs staying on one server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure if being able to have these harvest-related flags cleared automatically is worth adding any complicated code to address the multiple server scenario...
So yet another, much simpler, solution would be to implement it as above - but with an option to disable this mechanism; so on startup the service above would check the setting before proceeding - if (enableClearingHarvestJobsOnStartup) { ... }
. It will be enabled by default; so it would work for most installations. But then in the guide it will say "make sure to disable this mechanism if you are running multiple servers!". Somewhere in the same part of the guide where we mention that dataverse.timerServer
setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@landreev I think it's best to take the simplest approach and only build in support for single server scenario's. My plan is to add a sane default of "true" for a configurable DB key "ClearingHarvestJobsOnStartupEnabled" using the SettingsServiceBean, do you have any objections?
@@ -40,12 +40,13 @@ public void setId(Long id) { | |||
this.id = id; | |||
} | |||
|
|||
public enum RunResultType { SUCCESS, FAILURE, INPROGRESS }; | |||
public enum RunResultType { SUCCESS, RUN_FAILED, RUN_IN_PROGRESS, DELETE_FAILED }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but what is the value of adding this status condition for a failure to delete a client? It has some value for a dataverse admin, to know that the last attempt to harvest from a certain server resulted in a failure. Is it really useful to know that an attempt to delete a client failed? - should the startup check simply remove the "delete in progress" flag quietly instead?
After all, seeing how the client is still there makes it somewhat clear that the attempt to get rid of it didn't work out, so they should try again?
I may be missing some situation where it could actually be useful - so I'm open to hearing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@landreev I guess handling it quietly would be ok. So instead of DELETE FAILED or FAILED we just show a success label (those are the only options)?
I'm not sure of the status of this PR. @JingMa87 recently explained that he switched to another job: #7412 (comment) @janvanmansum has been helping explain which issues and PRs are a priority for DANS. For this one, @landreev recently added the ability to stop a harvest in progress... ... which seems related, at least. So I'm not sure if we need this PR or if it should be rewritten (the branch is gone from GitHub, by the way). |
@JingMa87 I'm closing this PR but if you're still interested in it, please reply! Thanks! |
What this PR does / why we need it: Fixes infinite IN PROGRESS and DELETE IN PROGRESS labels
Which issue(s) this PR closes: 7052
Closes #7052
Special notes for your reviewer: People will be happy about this
Suggestions on how to test this: Run a harvesting client and delete another one at the same time, then restart Glassfish immediately when you see the IN PROGRESS and DELETE IN PROGRESS labels. After you're done restarting you'll see the RUN FAILED AND DELETE FAILED labels instead of the old ones and you can edit the client again. So you can rerun or delete like normally.
Does this PR introduce a user interface change? If mockups are available, please link/include them here: Yes
Is there a release notes update needed for this change?: Yes, it's a noticeable change.
Additional documentation: N/A