-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with the EJB Timer (in production environment, specifically) #3672
Comments
I noticed this in the notes from yesterday's community call. Odum's production Dataverse seems to be in the same boat (as best my untrained eyes can tell). We'll do some investigating on this end. |
I found this https://dennis.gesker.com/2014/07/25/glassfish-4-0-1-expunge-timer/ and as a blind stab added this setting to a test server. Will report back. |
In Sprint Planning 10/11, we decided to estimate the investigation of this as a 3. @donsizemore any info about this? |
Hi @djbrooke , I added
to a test server, and honestly didn't see a timer die, or anything otherwise interesting in the Glassfish logs. This particular server was used for test ingests and actively harvests, but is otherwise pretty quiet. So... nothing to report, yet anyway? |
Hmm, interesting. What we are seeing in our production now, the timer isn't even dying, it doesn't even start anymore. We restart Glassfish, or redeploy the app on the "master" server, and we don't get the "I am the master timer..." in the logs at all. And yes, it has to be somehow specific to this prod. server of ours. Because it looks like timers are working properly on our test boxes. |
@landreev I hadn't seen (or at least, noticed) any problems with our EJB timers, in production or on test servers. I just popped the property above in place to see if I did catch anything and, for the past two weeks... nothing. How much RAM do your production VMs have, what's the JVM heap, etc? (anything I can do on Odum's test machines to help troubleshoot further?) |
Mystery solved - at least with our prod. server. It was happening simply because the version of the Postgres jdbc driver (on the Glassfish side) got seriously out of sync with the actual version of Postgres. Based on our experience with the rest of the app, we had assumed this driver version didn't really matter. As it appeared that you could use Postgres 9.3 with say the version 8.4 of the driver, and everything was working ok. The timer app (it's an EJB application of its own) however relies on storing serialized Java "timer info" objects as byte arrays; and the serialization format may differ between versions. So the app could not de-serialize and read the objects back from the timer table. Upgrading the driver to the same version as the production database has solved this. I'll update the installer script to match drivers to the running database more strictly. And we'll add a line to the next release notes, advising other installations to check and upgrade their drivers. |
Moving into review. the installer now comes with the specific versions of the JDBC driver for Postgres versions 9.2, 9.3 and 9.4. Plus the driver version 42.1.4 that covers Postgres 9.5 and 9.6. The installer will automatically install the driver version that matches that of PostgresQL running. Added some extra text to the "Dataverse Application Timers" and "Troubleshooting" of the Admin guide. |
PR: #4222 |
Looks great. Moving to QA. I made a couple tweaks, including changing PostgresQL to PostgreSQL. 😄 |
This must be/probably is related to the issue where the EJB Timer's lock on the database prevents Glassfish to start and/or the application to deploy (for example, #3669)
The issue with the timer, currently observed on our prod. dedicated timer service (dvn-sum1-app-2/picard), is that it just stops working completely. Even the top-level, master timer stops firing - and then none of the scheduled harvests and exports are happening.
We need to finally figure out what is going on with that timer. Part of the difficulty with diagnosing it is that the timer is a standalone EJB app, supplied with Glassfish. But it should still be possible to obtain its source and see what's going on there - if everything else fails.
An interesting observation is that nobody has ever seen these timer issues in their dev. environments. It only happens on "real" servers... but what does that mean exactly? - could be as trivial as Mac OS vs. Linux. Or is it something about running the database on loacalhost vs. over a non-local network, with more ports firewalled?
The text was updated successfully, but these errors were encountered: