Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: 5345: Nonpersistent EJB timers #5371

Closed
wants to merge 10 commits into from
7 changes: 7 additions & 0 deletions doc/sphinx-guides/source/admin/harvestclients.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,13 @@ Clients are managed on the "Harvesting Clients" page accessible via the :doc:`da

The process of creating a new, or editing an existing client, is largely self-explanatory. It is split into logical steps, in a way that allows the user to go back and correct the entries made earlier. The process is interactive and guidance text is provided. For example, the user is required to enter the URL of the remote OAI server. When they click *Next*, the application will try to establish a connection to the server in order to verify that it is working, and to obtain the information about the sets of metadata records and the metadata formats it supports. The choices offered to the user on the next page will be based on this extra information. If the application fails to establish a connection to the remote archive at the address specified, or if an invalid response is received, the user is given an opportunity to check and correct the URL they entered.

Known issues
~~~~~~~~~~~~
When running harvest clients, you should validate from the logs if all of your harvesters complete their job.
"Trouble" and incomplete harvests might await you, when your harvests take longer than one hour or stack up when grouped
at start times lying just an hour or two away from each other. If you suffer from this, please open an issue referencing
the :doc:`../developers/timers` part of the docs.

New in Dataverse 4, vs. DVN 3
-----------------------------

Expand Down
15 changes: 14 additions & 1 deletion doc/sphinx-guides/source/admin/harvestserver.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,14 @@ harvesting protocol. Note that the terms "Harvesting Server" and "OAI
Server" are being used interchangeably throughout this guide and in
the inline help text.

If you want to learn more about OAI-PMH, you could take a look at
`DataCite OAI-PMH guide <https://support.datacite.org/docs/datacite-oai-pmh>`_
or the `OAI-PMH protocol definition <https://www.openarchives.org/OAI/openarchivesprotocol.html>`_.

You might consider adding your OAI-enabled production instance of Dataverse to
`this shared list <https://docs.google.com/spreadsheets/d/12cxymvXCqP_kCsLKXQD32go79HBWZ1vU_tdG4kvP5S8/>`_
of such instances.

How does it work?
-----------------

Expand All @@ -28,6 +36,10 @@ Harvesting server can be enabled or disabled on the "Harvesting
Server" page accessible via the :doc:`dashboard`. Harvesting server is by
default disabled on a brand new, "out of the box" Dataverse.

The OAI-PMH endpoint can be accessed at ``http(s)://<Your Dataverse FQDN>/oai``.
If you want other services to harvest your repository, point them to this URL.
*Example URL to 'Identify' verb*: `Harvard Dataverse OAI <https://dataverse.harvard.edu/oai?verb=Identify>`_

OAI Sets
--------

Expand Down Expand Up @@ -124,7 +136,8 @@ runs every night (at 2AM, by default). This export timer is created
and activated automatically every time the application is deployed
or restarted. Once again, this is new in Dataverse 4, and unlike DVN
v3, where export jobs had to be scheduled and activated by the admin
user. See the "Export" section of the Admin guide, for more information on the automated metadata exports.
user. See the :doc:`/admin/metadataexport` section of the Admin guide,
for more information on the automated metadata exports.

It is still possible however to make changes like this be immediately
reflected in the OAI server, by going to the *Harvesting Server* page
Expand Down
26 changes: 24 additions & 2 deletions doc/sphinx-guides/source/admin/metadataexport.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,36 @@ Metadata Export
Automatic Exports
-----------------

Publishing a dataset automatically starts a metadata export job, that will run in the background, asynchronously. Once completed, it will make the dataset metadata exported and cached in all the supported formats:
Publishing a dataset automatically starts a metadata export job, that will run in the background, asynchronously.
Once completed, it will make the dataset metadata exported and cached in all the supported formats:

- Dublin Core
- Data Documentation Initiative (DDI)
- Schema.org JSON-LD
- native JSON (Dataverse-specific)

A scheduled timer job that runs nightly will attempt to export any published datasets that for whatever reason haven't been exported yet. This timer is activated automatically on the deployment, or restart, of the application. So, again, no need to start or configure it manually. (See the "Application Timers" section of this guide for more information)
Scheduled Timer Export
----------------------

A scheduled timer job that runs nightly will attempt to export any published datasets in all supported metadata formats
that for whatever reason haven't been exported yet and cache the results on the filesystem.

**Note** that normally an export will happen automatically whenever a dataset is published. This scheduled job is there
to catch any datasets for which that export did not succeed, for one reason or another. Also, since this functionality
has been added in version 4.5: if you are upgrading from a previous version, none of your datasets are exported yet.

This daily job will also update all the harvestable OAI sets configured on your server, adding new and/or newly
published datasets or marking deaccessioned datasets as "deleted" in the corresponding sets as needed.

This timer is activated automatically on the deployment, or restart, of the application. So, again, no need to start or
configure it manually. (See alse :doc:`timers` section of this guide for more information about timer usage in Dataverse.)
There is no admin user-accessible configuration for this timer.

This job is automatically scheduled to run at 2AM local time every night.

Before Dataverse 4.10 it is possible (for an advanced and adventureous user) to change that time by directly editing
the EJB timer application table in the database. From 4.10 onward, timers are not persisted any longer. If you have
a desperate need for a configurable time, please open an issue on GitHub, describing your use case.

Batch exports through the API
-----------------------------
Expand Down
72 changes: 44 additions & 28 deletions doc/sphinx-guides/source/admin/timers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,50 +3,66 @@
Dataverse Application Timers
============================

Dataverse uses timers to automatically run scheduled Harvest and Metadata export jobs.
Dataverse uses timers to automatically run scheduled jobs for:

.. contents:: |toctitle|
:local:

Dedicated timer server in a Dataverse server cluster
----------------------------------------------------

When running a Dataverse cluster - i.e. multiple Dataverse application
servers talking to the same database - **only one** of them must act
as the *dedicated timer server*. This is to avoid starting conflicting
batch jobs on multiple nodes at the same time.
* Harvesting metadata
* See :doc:`/admin/harvestserver` and :doc:`/admin/harvestclients`
* Created only when scheduling enabled by admin (via "Manage Harvesting Clients" page) and canceled when disabled.
* :doc:`/admin/metadataexport`
* Enabled by default, non configurable.

This does not affect a single-server installation. So you can safely skip this section unless you are running a multi-server cluster.
All timers are created on application startup and are not configurable when to go off. Since Dataverse 4.10 they are not
persisted to a database, as they had been deleted and re-created on every startup before.

The following JVM option instructs the application to act as the dedicated timer server:
.. contents:: |toctitle|
:local:

``-Ddataverse.timerServer=true``
Dataverse server clusters and EJB timers
----------------------------------------

**IMPORTANT:** Note that this option is automatically set by the Dataverse installer script. That means that when **configuring a multi-server cluster**, it will be the responsibility of the installer to remove the option from the :fixedwidthplain:`domain.xml` of every node except the one intended to be the timer server. We also recommend that the following entry in the :fixedwidthplain:`domain.xml`: ``<ejb-timer-service timer-datasource="jdbc/VDCNetDS">`` is changed back to ``<ejb-timer-service>`` on all the non-timer server nodes. Similarly, this option is automatically set by the installer script. Changing it back to the default setting on a server that doesn't need to run the timer will prevent a potential race condition, where multiple servers try to get a lock on the timer database.
In a multi-node cluster, all timers will be created on a dedicated timer node (see below). This is not necessarily on the
node where configuration of harvesting clients or metadata export has been done by an admin.

**Note** that for the timer to work, the version of the PostgreSQL JDBC driver your instance is using must match the version of your PostgreSQL database. See the 'Timer not working' section of the :doc:`/admin/troubleshooting` guide.
Dedicated timer server node
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Harvesting Timers
-----------------
When running a "cluster" with multiple instances of Dataverse connected to the same database, **only one** of them must
act as the *dedicated timer server*. This is to avoid starting conflicting batch jobs on multiple nodes at the same time.
(Might get addressed for automation in a later Dataverse version using cluster support from the application server.)

These timers are created when scheduled harvesting is enabled by a local admin user (via the "Manage Harvesting Clients" page).
This does not affect a single-server installation. So you can safely skip this section unless you are running a multi-server cluster.

In a multi-node cluster, all these timers will be created on the dedicated timer node (and not necessarily on the node where the harvesting clients were created and/or saved).
The following system property instructs the application to act as the dedicated timer server:

A timer will be automatically removed when a harvesting client with an active schedule is deleted, or if the schedule is turned off for an existing client.
``dataverse.timerServer=true``

Metadata Export Timer
---------------------
**Note** that when using JVM options to set system properties, please use ``-Ddataverse.timerServer=true``. You should
prefer using ``asadmin`` system properties commands.

This timer is created automatically whenever the application is deployed or restarted. There is no admin user-accessible configuration for this timer.
**IMPORTANT:** This is automatically set by the Dataverse installer script on every node.

This timer runs a daily job that tries to export all the local, published datasets that haven't been exported yet, in all supported metadata formats, and cache the results on the filesystem. (Note that normally an export will happen automatically whenever a dataset is published. This scheduled job is there to catch any datasets for which that export did not succeed, for one reason or another). Also, since this functionality has been added in version 4.5: if you are upgrading from a previous version, none of your datasets are exported yet. So the first time this job runs, it will attempt to export them all.
That means that *when configuring a multi-server cluster*, it will be the responsibility of the sysadmin to remove
the option from every node except the one intended to be the timer server. Easiest way to achieve this is by running
``asadmin delete-system-property "dataverse.timerServer"``.
(This option will not be set to ``true`` in future Docker images of Dataverse, it needs to be configured.)

This daily job will also update all the harvestable OAI sets configured on your server, adding new and/or newly published datasets or marking deaccessioned datasets as "deleted" in the corresponding sets as needed.
As we don't use persistent timers from Dataverse 4.10 onward, when upgrading, it is up to you to follow the former
recommendation or not. In new installations, this will not be necessary.

This job is automatically scheduled to run at 2AM local time every night. If really necessary, it is possible (for an advanced user) to change that time by directly editing the EJB timer application table in the database.
We also recommend that the following entry in the :fixedwidthplain:`domain.xml`:
``<ejb-timer-service timer-datasource="jdbc/VDCNetDS">`` is changed back to ``<ejb-timer-service>``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember experimenting with this myself (although it was a while ago). I thought leaving the timer-datasource attribute blank would result in the default data source (in practice that would be the local instance of Derby db) being used to persist the timers... I'm assuming I was wrong.
If leaving it blank results in not persisting the timers at all, let's stick with it going forward.
But should this PR then modify the installer script accordingly too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure we are talking about the same: this instruction has been included in the docs already. Once this PR is merged, IMHO the described steps are not necessary anymore.

A blank option does not mean "non-persistent" by default, a blank option means any new persistent timers introduced later on will be stored in the default datasource (local H2 or Hazelcast cache in Payara). (You need to explicitly declare timers as "non-persistent".)

Yes, the installer (and other places) should be changed, too. The asadmin commands from setup-glassfish.sh regarding JDBC timers can be pruned. Will put that in a separate commit to have more logical chunks.

on all the non-timer server nodes. Similarly, this option is automatically set by the installer script.
Changing it back to the default setting on a server that doesn't need to run the timer will prevent a potential
race condition, where multiple servers try to get a lock on the timer database.

Known Issues
------------

We've received several reports of an intermittent issue where the application fails to deploy with the error message "EJB Timer Service is not available." Please see the :doc:`/admin/troubleshooting` section of this guide for a workaround.
Prior to Dataverse 4.10, we've received several reports of an intermittent issue where the application fails to deploy
with the error message "EJB Timer Service is not available." Please see the :doc:`/admin/troubleshooting` section of
this guide for a workaround.

When running harvest clients, you should validate from the logs if all of your harvesters complete their job. "Trouble"
and incomplete harvests might await you, when your harvests take longer than one hour or stack up when grouped at start
times lying just an hour or two away from each other. If you suffer from this, please open an issue referencing the
:doc:`../developers/timers` part of the docs.
4 changes: 4 additions & 0 deletions doc/sphinx-guides/source/developers/big-data-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -396,3 +396,7 @@ Available variables are:
* ``minorVersion``
* ``majorVersion``
* ``releaseStatus``

----

Previous: :doc:`selinux` | Next: :doc:`timers`
1 change: 1 addition & 0 deletions doc/sphinx-guides/source/developers/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,4 @@ Developer Guide
geospatial
selinux
big-data-support
timers
24 changes: 24 additions & 0 deletions doc/sphinx-guides/source/developers/timers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
==========
EJB Timers
==========

As described in :doc:`../admin/timers`, Dataverse uses EJB timers for scheduled jobs. This section is about the
techniques used for scheduling.

* :doc:`../admin/metadataexport` is done via ``@Schedule`` annotation on ``OAISetServiceBean.exportAllSets()`` and
poikilotherm marked this conversation as resolved.
Show resolved Hide resolved
``DatasetServiceBean.exportAll()``. Fixed to 2AM local time every day, non persistent.
* Harvesting is a bit more complicated. The timer is attached to ``HarvesterServiceBean.harvestEnabled()`` via
``@Schedule`` annotation every hour, non-persistent.
That method collects all enabled ``HarvestingClient`` and runs them if time from client config matches.

**NOTE:** the timers for Harvesting might cause trouble, when harvesting takes longer than one hour or multiple
harvests configured for the same starting hour stack up. There is a lock in place to prevent "bad things", but that
might result in lost harvest. If this really causes trouble in the future, the code should be refactored to use either
a proper task scheduler, JBatch API or asynchronous execution. A *TODO* message has been left in the code.

.. contents:: |toctitle|
:local:

----

Previous: :doc:`big-data-support`
5 changes: 2 additions & 3 deletions scripts/installer/glassfish-setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -122,9 +122,8 @@ function final_setup(){
./asadmin $ASADMIN_OPTS create-jdbc-resource --connectionpoolid dvnDbPool jdbc/VDCNetDS

###
# Set up the data source for the timers

./asadmin $ASADMIN_OPTS set configs.config.server-config.ejb-container.ejb-timer-service.timer-datasource=jdbc/VDCNetDS
# Obsolete since merge of GH-5345, using only non-persistent timers from now on.
#./asadmin $ASADMIN_OPTS set configs.config.server-config.ejb-container.ejb-timer-service.timer-datasource=jdbc/VDCNetDS

./asadmin $ASADMIN_OPTS create-jvm-options "\-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl"

Expand Down
32 changes: 25 additions & 7 deletions src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,7 @@
import java.util.logging.FileHandler;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.ejb.Asynchronous;
import javax.ejb.EJB;
import javax.ejb.EJBException;
import javax.ejb.Stateless;
import javax.ejb.TransactionAttribute;
import javax.ejb.TransactionAttributeType;
import javax.ejb.*;
import javax.inject.Named;
import javax.persistence.EntityManager;
import javax.persistence.NoResultException;
Expand Down Expand Up @@ -576,10 +571,33 @@ public void exportAllAsync() {
exportAllDatasets(false);
}

/**
* Scheduled function triggering the export of all local & published datasets,
* but only on the node which is configured as master timer.
*
* TODO: this is not unit testable as long as dependent functions aren't.
*/
@Lock(LockType.READ)
@Schedule(hour = "2", persistent = false)
public void exportAll() {
exportAllDatasets(false);
if (systemConfig.isTimerServer()) {
logger.info("DatasetService: Running a scheduled export job.");
exportAllDatasets(false);
}
}

/**
* TODO: this code needs refactoring to be unit testable:
* 1) Move the Logger/FileHandler stuff to a factory in a Service
* (Export or Logging service) a) to make it mockable and
* b) to have common, reusable code.
* 2) Move this to OAIRecordServiceBean. The additional pieces for a
* complete OAI export is in OAISetServiceBean, so it makes more
* sense to live there and use this service as a service.
* 3) Moving this to OAIRecordServiceBean makes findAllLocalDatasetIds(), etc
* mockable, so this class (DatasetServiceBean) does not need immediate action.
* @param forceReExport
*/
public void exportAllDatasets(boolean forceReExport) {
Integer countAll = 0;
Integer countSuccess = 0;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,12 @@
import edu.harvard.iq.dataverse.engine.command.DataverseRequest;
import edu.harvard.iq.dataverse.engine.command.exception.CommandException;
import edu.harvard.iq.dataverse.engine.command.impl.CreateHarvestingClientCommand;
import edu.harvard.iq.dataverse.engine.command.impl.DeleteHarvestingClientCommand;
import edu.harvard.iq.dataverse.engine.command.impl.UpdateHarvestingClientCommand;
import edu.harvard.iq.dataverse.harvest.client.HarvesterServiceBean;
import edu.harvard.iq.dataverse.harvest.client.HarvestingClient;
import edu.harvard.iq.dataverse.harvest.client.HarvestingClientServiceBean;
import edu.harvard.iq.dataverse.harvest.client.oai.OaiHandler;
import edu.harvard.iq.dataverse.search.IndexServiceBean;
import edu.harvard.iq.dataverse.timer.DataverseTimerServiceBean;
import edu.harvard.iq.dataverse.util.BundleUtil;
import edu.harvard.iq.dataverse.util.JsfHelper;
import static edu.harvard.iq.dataverse.util.JsfHelper.JH;
Expand Down Expand Up @@ -65,8 +63,6 @@ public class HarvestingClientsPage implements java.io.Serializable {
IndexServiceBean indexService;
@EJB
EjbDataverseEngine engineService;
@EJB
DataverseTimerServiceBean dataverseTimerService;
@Inject
DataverseRequestServiceBean dvRequestService;
@Inject
Expand Down Expand Up @@ -453,9 +449,6 @@ public void saveClient(ActionEvent ae) {

configuredHarvestingClients = harvestingClientService.getAllHarvestingClients();

if (!harvestingClient.isScheduled()) {
dataverseTimerService.removeHarvestTimer(harvestingClient);
}
JsfHelper.addSuccessMessage(BundleUtil.getStringFromBundle("harvest.update.success") + harvestingClient.getName());

} catch (CommandException ex) {
Expand Down
Loading