Problems with the EJB Timer (in production environment, specifically) #3672

landreev · 2017-03-07T19:26:32Z

This must be/probably is related to the issue where the EJB Timer's lock on the database prevents Glassfish to start and/or the application to deploy (for example, #3669)
The issue with the timer, currently observed on our prod. dedicated timer service (dvn-sum1-app-2/picard), is that it just stops working completely. Even the top-level, master timer stops firing - and then none of the scheduled harvests and exports are happening.

We need to finally figure out what is going on with that timer. Part of the difficulty with diagnosing it is that the timer is a standalone EJB app, supplied with Glassfish. But it should still be possible to obtain its source and see what's going on there - if everything else fails.

An interesting observation is that nobody has ever seen these timer issues in their dev. environments. It only happens on "real" servers... but what does that mean exactly? - could be as trivial as Mac OS vs. Linux. Or is it something about running the database on loacalhost vs. over a non-local network, with more ports firewalled?

donsizemore · 2017-09-27T11:54:17Z

I noticed this in the notes from yesterday's community call. Odum's production Dataverse seems to be in the same boat (as best my untrained eyes can tell). We'll do some investigating on this end.

donsizemore · 2017-09-27T12:14:24Z

I found this https://dennis.gesker.com/2014/07/25/glassfish-4-0-1-expunge-timer/ and as a blind stab added this setting to a test server. Will report back.

djbrooke · 2017-10-11T20:13:57Z

In Sprint Planning 10/11, we decided to estimate the investigation of this as a 3.

@donsizemore any info about this?

donsizemore · 2017-10-11T20:26:15Z

Hi @djbrooke , I added

<ejb-timer-service> <property name="reschedule-failed-timer" value="true"></property> </ejb-timer-service>

to a test server, and honestly didn't see a timer die, or anything otherwise interesting in the Glassfish logs. This particular server was used for test ingests and actively harvests, but is otherwise pretty quiet. So... nothing to report, yet anyway?

landreev · 2017-10-11T23:21:55Z

Hmm, interesting.
@donsizemore could you please clarify, what are the symptoms you were seeing in your prod server? Was the timer working for a while, and then died? (It had to be working at some point, because you were harvesting from us - correct?).

What we are seeing in our production now, the timer isn't even dying, it doesn't even start anymore. We restart Glassfish, or redeploy the app on the "master" server, and we don't get the "I am the master timer..." in the logs at all.

And yes, it has to be somehow specific to this prod. server of ours. Because it looks like timers are working properly on our test boxes.

donsizemore · 2017-10-11T23:51:22Z

@landreev I hadn't seen (or at least, noticed) any problems with our EJB timers, in production or on test servers. I just popped the property above in place to see if I did catch anything and, for the past two weeks... nothing.

How much RAM do your production VMs have, what's the JVM heap, etc? (anything I can do on Odum's test machines to help troubleshoot further?)

landreev · 2017-10-17T20:12:25Z

Mystery solved - at least with our prod. server. It was happening simply because the version of the Postgres jdbc driver (on the Glassfish side) got seriously out of sync with the actual version of Postgres.

Based on our experience with the rest of the app, we had assumed this driver version didn't really matter. As it appeared that you could use Postgres 9.3 with say the version 8.4 of the driver, and everything was working ok. The timer app (it's an EJB application of its own) however relies on storing serialized Java "timer info" objects as byte arrays; and the serialization format may differ between versions. So the app could not de-serialize and read the objects back from the timer table.

Upgrading the driver to the same version as the production database has solved this.

I'll update the installer script to match drivers to the running database more strictly. And we'll add a line to the next release notes, advising other installations to check and upgrade their drivers.

…rsions of PostgresQL. (ref. #3672)

landreev · 2017-10-19T22:34:53Z

Moving into review.
This is what was done:

the installer now comes with the specific versions of the JDBC driver for Postgres versions 9.2, 9.3 and 9.4. Plus the driver version 42.1.4 that covers Postgres 9.5 and 9.6. The installer will automatically install the driver version that matches that of PostgresQL running.

Added some extra text to the "Dataverse Application Timers" and "Troubleshooting" of the Admin guide.

landreev · 2017-10-19T22:37:14Z

PR: #4222

pdurbin · 2017-10-20T14:07:01Z

Looks great. Moving to QA. I made a couple tweaks, including changing PostgresQL to PostgreSQL. 😄

Small typo and grammar fixes.

landreev self-assigned this Mar 7, 2017

djbrooke added the ready label Mar 8, 2017

djbrooke unassigned landreev Mar 8, 2017

djbrooke removed the ready label Mar 8, 2017

pdurbin added Feature: Harvesting Component: Code Infrastructure formerly "Feature: Code Infrastructure" labels Apr 25, 2017

djbrooke added Status: Backlog and removed Component: Code Infrastructure formerly "Feature: Code Infrastructure" Feature: Harvesting labels Sep 22, 2017

pdurbin added Component: Code Infrastructure formerly "Feature: Code Infrastructure" Feature: Harvesting and removed Component: Code Infrastructure labels Sep 27, 2017

djbrooke added Status: This/Next Sprint and removed Status: Backlog labels Sep 27, 2017

pameyer mentioned this issue Oct 6, 2017

document monitoring of EJB timers #4180

Closed

djbrooke added Status: Development and removed Status: This/Next Sprint labels Oct 16, 2017

djbrooke assigned landreev Oct 16, 2017

landreev added a commit that referenced this issue Oct 19, 2017

Added postgres jdbc drivers to the installer specific to the newer ve…

277190b

…rsions of PostgresQL. (ref. #3672)

landreev mentioned this issue Oct 19, 2017

installer and guide fixes for the timer problem (3672) #4222

Merged

5 tasks

landreev added Status: Code Review and removed Status: Development labels Oct 19, 2017

landreev removed their assignment Oct 19, 2017

pdurbin self-assigned this Oct 20, 2017

djbrooke added this to the 4.8.2 - Docker Images, Dataset Locking Updates milestone Oct 20, 2017

pdurbin added a commit that referenced this issue Oct 20, 2017

%s/PostgresQL/PostgreSQL/g #3672

e227e77

pdurbin added a commit that referenced this issue Oct 20, 2017

more links between pages, small tweaks #3672

4d9b1df

pdurbin added Status: QA and removed Status: Code Review labels Oct 20, 2017

pdurbin removed their assignment Oct 20, 2017

djbrooke assigned dlmurphy Oct 20, 2017

dlmurphy added a commit that referenced this issue Oct 20, 2017

Review - typo fixes (#3672)

fd5ddd3

Small typo and grammar fixes.

dlmurphy removed their assignment Oct 20, 2017

kcondon self-assigned this Oct 20, 2017

kcondon closed this as completed Oct 23, 2017

kcondon removed the Status: QA label Oct 23, 2017

djbrooke mentioned this issue May 21, 2018

Scheduled harvests aren't running #4686

Closed

poikilotherm mentioned this issue Nov 27, 2018

Proposal: make persistent EJB timers non-persistent or not rely on database #5345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with the EJB Timer (in production environment, specifically) #3672

Problems with the EJB Timer (in production environment, specifically) #3672

landreev commented Mar 7, 2017 •

edited by djbrooke

Loading

donsizemore commented Sep 27, 2017

donsizemore commented Sep 27, 2017

djbrooke commented Oct 11, 2017

donsizemore commented Oct 11, 2017

landreev commented Oct 11, 2017

donsizemore commented Oct 11, 2017

landreev commented Oct 17, 2017 •

edited

Loading

landreev commented Oct 19, 2017

landreev commented Oct 19, 2017

pdurbin commented Oct 20, 2017

Problems with the EJB Timer (in production environment, specifically) #3672

Problems with the EJB Timer (in production environment, specifically) #3672

Comments

landreev commented Mar 7, 2017 • edited by djbrooke Loading

donsizemore commented Sep 27, 2017

donsizemore commented Sep 27, 2017

djbrooke commented Oct 11, 2017

donsizemore commented Oct 11, 2017

landreev commented Oct 11, 2017

donsizemore commented Oct 11, 2017

landreev commented Oct 17, 2017 • edited Loading

landreev commented Oct 19, 2017

landreev commented Oct 19, 2017

pdurbin commented Oct 20, 2017

landreev commented Mar 7, 2017 •

edited by djbrooke

Loading

landreev commented Oct 17, 2017 •

edited

Loading