diff --git a/doc/sphinx-guides/source/admin/harvestclients.rst b/doc/sphinx-guides/source/admin/harvestclients.rst index cd841aeba85..8d4eec50ad6 100644 --- a/doc/sphinx-guides/source/admin/harvestclients.rst +++ b/doc/sphinx-guides/source/admin/harvestclients.rst @@ -22,6 +22,13 @@ Clients are managed on the "Harvesting Clients" page accessible via the :doc:`da The process of creating a new, or editing an existing client, is largely self-explanatory. It is split into logical steps, in a way that allows the user to go back and correct the entries made earlier. The process is interactive and guidance text is provided. For example, the user is required to enter the URL of the remote OAI server. When they click *Next*, the application will try to establish a connection to the server in order to verify that it is working, and to obtain the information about the sets of metadata records and the metadata formats it supports. The choices offered to the user on the next page will be based on this extra information. If the application fails to establish a connection to the remote archive at the address specified, or if an invalid response is received, the user is given an opportunity to check and correct the URL they entered. +Known issues +~~~~~~~~~~~~ +When running harvest clients, you should validate from the logs if all of your harvesters complete their job. +"Trouble" and incomplete harvests might await you, when your harvests take longer than one hour or stack up when grouped +at start times lying just an hour or two away from each other. If you suffer from this, please open an issue referencing +the :doc:`../developers/timers` part of the docs. + New in Dataverse 4, vs. DVN 3 ----------------------------- diff --git a/doc/sphinx-guides/source/admin/harvestserver.rst b/doc/sphinx-guides/source/admin/harvestserver.rst index c952a1f17e7..f98eae252ac 100644 --- a/doc/sphinx-guides/source/admin/harvestserver.rst +++ b/doc/sphinx-guides/source/admin/harvestserver.rst @@ -14,6 +14,14 @@ harvesting protocol. Note that the terms "Harvesting Server" and "OAI Server" are being used interchangeably throughout this guide and in the inline help text. +If you want to learn more about OAI-PMH, you could take a look at +`DataCite OAI-PMH guide `_ +or the `OAI-PMH protocol definition `_. + +You might consider adding your OAI-enabled production instance of Dataverse to +`this shared list `_ +of such instances. + How does it work? ----------------- @@ -28,6 +36,10 @@ Harvesting server can be enabled or disabled on the "Harvesting Server" page accessible via the :doc:`dashboard`. Harvesting server is by default disabled on a brand new, "out of the box" Dataverse. +The OAI-PMH endpoint can be accessed at ``http(s):///oai``. +If you want other services to harvest your repository, point them to this URL. +*Example URL to 'Identify' verb*: `Harvard Dataverse OAI `_ + OAI Sets -------- @@ -124,7 +136,8 @@ runs every night (at 2AM, by default). This export timer is created and activated automatically every time the application is deployed or restarted. Once again, this is new in Dataverse 4, and unlike DVN v3, where export jobs had to be scheduled and activated by the admin -user. See the "Export" section of the Admin guide, for more information on the automated metadata exports. +user. See the :doc:`/admin/metadataexport` section of the Admin guide, +for more information on the automated metadata exports. It is still possible however to make changes like this be immediately reflected in the OAI server, by going to the *Harvesting Server* page diff --git a/doc/sphinx-guides/source/admin/metadataexport.rst b/doc/sphinx-guides/source/admin/metadataexport.rst index 8efb100f003..047bf0d4c60 100644 --- a/doc/sphinx-guides/source/admin/metadataexport.rst +++ b/doc/sphinx-guides/source/admin/metadataexport.rst @@ -7,14 +7,36 @@ Metadata Export Automatic Exports ----------------- -Publishing a dataset automatically starts a metadata export job, that will run in the background, asynchronously. Once completed, it will make the dataset metadata exported and cached in all the supported formats: +Publishing a dataset automatically starts a metadata export job, that will run in the background, asynchronously. +Once completed, it will make the dataset metadata exported and cached in all the supported formats: - Dublin Core - Data Documentation Initiative (DDI) - Schema.org JSON-LD - native JSON (Dataverse-specific) -A scheduled timer job that runs nightly will attempt to export any published datasets that for whatever reason haven't been exported yet. This timer is activated automatically on the deployment, or restart, of the application. So, again, no need to start or configure it manually. (See the "Application Timers" section of this guide for more information) +Scheduled Timer Export +---------------------- + +A scheduled timer job that runs nightly will attempt to export any published datasets in all supported metadata formats +that for whatever reason haven't been exported yet and cache the results on the filesystem. + +**Note** that normally an export will happen automatically whenever a dataset is published. This scheduled job is there +to catch any datasets for which that export did not succeed, for one reason or another. Also, since this functionality +has been added in version 4.5: if you are upgrading from a previous version, none of your datasets are exported yet. + +This daily job will also update all the harvestable OAI sets configured on your server, adding new and/or newly +published datasets or marking deaccessioned datasets as "deleted" in the corresponding sets as needed. + +This timer is activated automatically on the deployment, or restart, of the application. So, again, no need to start or +configure it manually. (See alse :doc:`timers` section of this guide for more information about timer usage in Dataverse.) +There is no admin user-accessible configuration for this timer. + +This job is automatically scheduled to run at 2AM local time every night. + +Before Dataverse 4.10 it is possible (for an advanced and adventureous user) to change that time by directly editing +the EJB timer application table in the database. From 4.10 onward, timers are not persisted any longer. If you have +a desperate need for a configurable time, please open an issue on GitHub, describing your use case. Batch exports through the API ----------------------------- diff --git a/doc/sphinx-guides/source/admin/timers.rst b/doc/sphinx-guides/source/admin/timers.rst index 3c1ff40f935..26a612e8f1c 100644 --- a/doc/sphinx-guides/source/admin/timers.rst +++ b/doc/sphinx-guides/source/admin/timers.rst @@ -3,50 +3,66 @@ Dataverse Application Timers ============================ -Dataverse uses timers to automatically run scheduled Harvest and Metadata export jobs. +Dataverse uses timers to automatically run scheduled jobs for: -.. contents:: |toctitle| - :local: - -Dedicated timer server in a Dataverse server cluster ----------------------------------------------------- - -When running a Dataverse cluster - i.e. multiple Dataverse application -servers talking to the same database - **only one** of them must act -as the *dedicated timer server*. This is to avoid starting conflicting -batch jobs on multiple nodes at the same time. +* Harvesting metadata + * See :doc:`/admin/harvestserver` and :doc:`/admin/harvestclients` + * Created only when scheduling enabled by admin (via "Manage Harvesting Clients" page) and canceled when disabled. +* :doc:`/admin/metadataexport` + * Enabled by default, non configurable. -This does not affect a single-server installation. So you can safely skip this section unless you are running a multi-server cluster. +All timers are created on application startup and are not configurable when to go off. Since Dataverse 4.10 they are not +persisted to a database, as they had been deleted and re-created on every startup before. -The following JVM option instructs the application to act as the dedicated timer server: +.. contents:: |toctitle| + :local: -``-Ddataverse.timerServer=true`` +Dataverse server clusters and EJB timers +---------------------------------------- -**IMPORTANT:** Note that this option is automatically set by the Dataverse installer script. That means that when **configuring a multi-server cluster**, it will be the responsibility of the installer to remove the option from the :fixedwidthplain:`domain.xml` of every node except the one intended to be the timer server. We also recommend that the following entry in the :fixedwidthplain:`domain.xml`: ```` is changed back to ```` on all the non-timer server nodes. Similarly, this option is automatically set by the installer script. Changing it back to the default setting on a server that doesn't need to run the timer will prevent a potential race condition, where multiple servers try to get a lock on the timer database. +In a multi-node cluster, all timers will be created on a dedicated timer node (see below). This is not necessarily on the +node where configuration of harvesting clients or metadata export has been done by an admin. -**Note** that for the timer to work, the version of the PostgreSQL JDBC driver your instance is using must match the version of your PostgreSQL database. See the 'Timer not working' section of the :doc:`/admin/troubleshooting` guide. +Dedicated timer server node +~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Harvesting Timers ------------------ +When running a "cluster" with multiple instances of Dataverse connected to the same database, **only one** of them must +act as the *dedicated timer server*. This is to avoid starting conflicting batch jobs on multiple nodes at the same time. +(Might get addressed for automation in a later Dataverse version using cluster support from the application server.) -These timers are created when scheduled harvesting is enabled by a local admin user (via the "Manage Harvesting Clients" page). +This does not affect a single-server installation. So you can safely skip this section unless you are running a multi-server cluster. -In a multi-node cluster, all these timers will be created on the dedicated timer node (and not necessarily on the node where the harvesting clients were created and/or saved). +The following system property instructs the application to act as the dedicated timer server: -A timer will be automatically removed when a harvesting client with an active schedule is deleted, or if the schedule is turned off for an existing client. +``dataverse.timerServer=true`` -Metadata Export Timer ---------------------- +**Note** that when using JVM options to set system properties, please use ``-Ddataverse.timerServer=true``. You should +prefer using ``asadmin`` system properties commands. -This timer is created automatically whenever the application is deployed or restarted. There is no admin user-accessible configuration for this timer. +**IMPORTANT:** This is automatically set by the Dataverse installer script on every node. -This timer runs a daily job that tries to export all the local, published datasets that haven't been exported yet, in all supported metadata formats, and cache the results on the filesystem. (Note that normally an export will happen automatically whenever a dataset is published. This scheduled job is there to catch any datasets for which that export did not succeed, for one reason or another). Also, since this functionality has been added in version 4.5: if you are upgrading from a previous version, none of your datasets are exported yet. So the first time this job runs, it will attempt to export them all. +That means that *when configuring a multi-server cluster*, it will be the responsibility of the sysadmin to remove +the option from every node except the one intended to be the timer server. Easiest way to achieve this is by running +``asadmin delete-system-property "dataverse.timerServer"``. +(This option will not be set to ``true`` in future Docker images of Dataverse, it needs to be configured.) -This daily job will also update all the harvestable OAI sets configured on your server, adding new and/or newly published datasets or marking deaccessioned datasets as "deleted" in the corresponding sets as needed. +As we don't use persistent timers from Dataverse 4.10 onward, when upgrading, it is up to you to follow the former +recommendation or not. In new installations, this will not be necessary. -This job is automatically scheduled to run at 2AM local time every night. If really necessary, it is possible (for an advanced user) to change that time by directly editing the EJB timer application table in the database. + We also recommend that the following entry in the :fixedwidthplain:`domain.xml`: + ```` is changed back to ```` + on all the non-timer server nodes. Similarly, this option is automatically set by the installer script. + Changing it back to the default setting on a server that doesn't need to run the timer will prevent a potential + race condition, where multiple servers try to get a lock on the timer database. Known Issues ------------ -We've received several reports of an intermittent issue where the application fails to deploy with the error message "EJB Timer Service is not available." Please see the :doc:`/admin/troubleshooting` section of this guide for a workaround. +Prior to Dataverse 4.10, we've received several reports of an intermittent issue where the application fails to deploy +with the error message "EJB Timer Service is not available." Please see the :doc:`/admin/troubleshooting` section of +this guide for a workaround. + +When running harvest clients, you should validate from the logs if all of your harvesters complete their job. "Trouble" +and incomplete harvests might await you, when your harvests take longer than one hour or stack up when grouped at start +times lying just an hour or two away from each other. If you suffer from this, please open an issue referencing the +:doc:`../developers/timers` part of the docs. \ No newline at end of file diff --git a/doc/sphinx-guides/source/developers/big-data-support.rst b/doc/sphinx-guides/source/developers/big-data-support.rst index e0d0b4ffd25..30c42a06ee1 100644 --- a/doc/sphinx-guides/source/developers/big-data-support.rst +++ b/doc/sphinx-guides/source/developers/big-data-support.rst @@ -396,3 +396,7 @@ Available variables are: * ``minorVersion`` * ``majorVersion`` * ``releaseStatus`` + +---- + +Previous: :doc:`selinux` | Next: :doc:`timers` \ No newline at end of file diff --git a/doc/sphinx-guides/source/developers/index.rst b/doc/sphinx-guides/source/developers/index.rst index 52bde9ee184..b5d44777815 100755 --- a/doc/sphinx-guides/source/developers/index.rst +++ b/doc/sphinx-guides/source/developers/index.rst @@ -31,3 +31,4 @@ Developer Guide geospatial selinux big-data-support + timers diff --git a/doc/sphinx-guides/source/developers/timers.rst b/doc/sphinx-guides/source/developers/timers.rst new file mode 100644 index 00000000000..3620909bc3f --- /dev/null +++ b/doc/sphinx-guides/source/developers/timers.rst @@ -0,0 +1,24 @@ +========== +EJB Timers +========== + +As described in :doc:`../admin/timers`, Dataverse uses EJB timers for scheduled jobs. This section is about the +techniques used for scheduling. + +* :doc:`../admin/metadataexport` is done via ``@Schedule`` annotation on ``OAISetServiceBean.exportAllSets()`` and + ``DatasetServiceBean.exportAll()``. Fixed to 2AM local time every day, non persistent. +* Harvesting is a bit more complicated. The timer is attached to ``HarvesterServiceBean.harvestEnabled()`` via + ``@Schedule`` annotation every hour, non-persistent. + That method collects all enabled ``HarvestingClient`` and runs them if time from client config matches. + +**NOTE:** the timers for Harvesting might cause trouble, when harvesting takes longer than one hour or multiple +harvests configured for the same starting hour stack up. There is a lock in place to prevent "bad things", but that +might result in lost harvest. If this really causes trouble in the future, the code should be refactored to use either +a proper task scheduler, JBatch API or asynchronous execution. A *TODO* message has been left in the code. + +.. contents:: |toctitle| + :local: + +---- + +Previous: :doc:`big-data-support` \ No newline at end of file diff --git a/scripts/installer/glassfish-setup.sh b/scripts/installer/glassfish-setup.sh index 2f7ae279923..a54447e481b 100755 --- a/scripts/installer/glassfish-setup.sh +++ b/scripts/installer/glassfish-setup.sh @@ -122,9 +122,8 @@ function final_setup(){ ./asadmin $ASADMIN_OPTS create-jdbc-resource --connectionpoolid dvnDbPool jdbc/VDCNetDS ### - # Set up the data source for the timers - - ./asadmin $ASADMIN_OPTS set configs.config.server-config.ejb-container.ejb-timer-service.timer-datasource=jdbc/VDCNetDS + # Obsolete since merge of GH-5345, using only non-persistent timers from now on. + #./asadmin $ASADMIN_OPTS set configs.config.server-config.ejb-container.ejb-timer-service.timer-datasource=jdbc/VDCNetDS ./asadmin $ASADMIN_OPTS create-jvm-options "\-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl" diff --git a/src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java index 55f8d1e1a92..3b9e515eeaf 100644 --- a/src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java @@ -31,12 +31,7 @@ import java.util.logging.FileHandler; import java.util.logging.Level; import java.util.logging.Logger; -import javax.ejb.Asynchronous; -import javax.ejb.EJB; -import javax.ejb.EJBException; -import javax.ejb.Stateless; -import javax.ejb.TransactionAttribute; -import javax.ejb.TransactionAttributeType; +import javax.ejb.*; import javax.inject.Named; import javax.persistence.EntityManager; import javax.persistence.NoResultException; @@ -576,10 +571,33 @@ public void exportAllAsync() { exportAllDatasets(false); } + /** + * Scheduled function triggering the export of all local & published datasets, + * but only on the node which is configured as master timer. + * + * TODO: this is not unit testable as long as dependent functions aren't. + */ + @Lock(LockType.READ) + @Schedule(hour = "2", persistent = false) public void exportAll() { - exportAllDatasets(false); + if (systemConfig.isTimerServer()) { + logger.info("DatasetService: Running a scheduled export job."); + exportAllDatasets(false); + } } + /** + * TODO: this code needs refactoring to be unit testable: + * 1) Move the Logger/FileHandler stuff to a factory in a Service + * (Export or Logging service) a) to make it mockable and + * b) to have common, reusable code. + * 2) Move this to OAIRecordServiceBean. The additional pieces for a + * complete OAI export is in OAISetServiceBean, so it makes more + * sense to live there and use this service as a service. + * 3) Moving this to OAIRecordServiceBean makes findAllLocalDatasetIds(), etc + * mockable, so this class (DatasetServiceBean) does not need immediate action. + * @param forceReExport + */ public void exportAllDatasets(boolean forceReExport) { Integer countAll = 0; Integer countSuccess = 0; diff --git a/src/main/java/edu/harvard/iq/dataverse/HarvestingClientsPage.java b/src/main/java/edu/harvard/iq/dataverse/HarvestingClientsPage.java index 826cb2b37d5..609f9f938df 100644 --- a/src/main/java/edu/harvard/iq/dataverse/HarvestingClientsPage.java +++ b/src/main/java/edu/harvard/iq/dataverse/HarvestingClientsPage.java @@ -9,14 +9,12 @@ import edu.harvard.iq.dataverse.engine.command.DataverseRequest; import edu.harvard.iq.dataverse.engine.command.exception.CommandException; import edu.harvard.iq.dataverse.engine.command.impl.CreateHarvestingClientCommand; -import edu.harvard.iq.dataverse.engine.command.impl.DeleteHarvestingClientCommand; import edu.harvard.iq.dataverse.engine.command.impl.UpdateHarvestingClientCommand; import edu.harvard.iq.dataverse.harvest.client.HarvesterServiceBean; import edu.harvard.iq.dataverse.harvest.client.HarvestingClient; import edu.harvard.iq.dataverse.harvest.client.HarvestingClientServiceBean; import edu.harvard.iq.dataverse.harvest.client.oai.OaiHandler; import edu.harvard.iq.dataverse.search.IndexServiceBean; -import edu.harvard.iq.dataverse.timer.DataverseTimerServiceBean; import edu.harvard.iq.dataverse.util.BundleUtil; import edu.harvard.iq.dataverse.util.JsfHelper; import static edu.harvard.iq.dataverse.util.JsfHelper.JH; @@ -65,8 +63,6 @@ public class HarvestingClientsPage implements java.io.Serializable { IndexServiceBean indexService; @EJB EjbDataverseEngine engineService; - @EJB - DataverseTimerServiceBean dataverseTimerService; @Inject DataverseRequestServiceBean dvRequestService; @Inject @@ -453,9 +449,6 @@ public void saveClient(ActionEvent ae) { configuredHarvestingClients = harvestingClientService.getAllHarvestingClients(); - if (!harvestingClient.isScheduled()) { - dataverseTimerService.removeHarvestTimer(harvestingClient); - } JsfHelper.addSuccessMessage(BundleUtil.getStringFromBundle("harvest.update.success") + harvestingClient.getName()); } catch (CommandException ex) { diff --git a/src/main/java/edu/harvard/iq/dataverse/harvest/client/HarvesterServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/harvest/client/HarvesterServiceBean.java index 40058dc734f..6df9c0624c3 100644 --- a/src/main/java/edu/harvard/iq/dataverse/harvest/client/HarvesterServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/harvest/client/HarvesterServiceBean.java @@ -8,32 +8,26 @@ import edu.harvard.iq.dataverse.Dataset; import edu.harvard.iq.dataverse.DatasetServiceBean; import edu.harvard.iq.dataverse.Dataverse; -import edu.harvard.iq.dataverse.DataverseServiceBean; -import edu.harvard.iq.dataverse.timer.DataverseTimerServiceBean; +import edu.harvard.iq.dataverse.authorization.AuthenticationServiceBean; +import edu.harvard.iq.dataverse.authorization.users.AuthenticatedUser; + import java.io.File; import java.io.IOException; import java.text.SimpleDateFormat; -import java.util.ArrayList; -import java.util.Date; -import java.util.Iterator; -import java.util.List; +import java.util.*; //import java.net.URLEncoder; import java.util.logging.FileHandler; import java.util.logging.Level; import java.util.logging.Logger; -import javax.annotation.Resource; -import javax.ejb.Asynchronous; -import javax.ejb.EJB; -import javax.ejb.EJBException; -import javax.ejb.Stateless; -import javax.ejb.Timer; -import javax.ejb.TransactionAttribute; -import javax.ejb.TransactionAttributeType; +import javax.ejb.*; import javax.faces.bean.ManagedBean; import javax.inject.Named; //import javax.xml.bind.Unmarshaller; +import javax.servlet.http.HttpServletRequest; import javax.xml.parsers.ParserConfigurationException; import javax.xml.transform.TransformerException; + +import edu.harvard.iq.dataverse.util.SystemConfig; import org.apache.commons.lang.mutable.MutableBoolean; import org.apache.commons.lang.mutable.MutableLong; import org.xml.sax.SAXException; @@ -50,6 +44,7 @@ import edu.harvard.iq.dataverse.harvest.client.oai.OaiHandler; import edu.harvard.iq.dataverse.harvest.client.oai.OaiHandlerException; import edu.harvard.iq.dataverse.search.IndexServiceBean; + import java.io.FileWriter; import java.io.PrintWriter; import javax.persistence.EntityManager; @@ -66,14 +61,8 @@ public class HarvesterServiceBean { @PersistenceContext(unitName="VDCNet-ejbPU") private EntityManager em; - @EJB - DataverseServiceBean dataverseService; @EJB DatasetServiceBean datasetService; - @Resource - javax.ejb.TimerService timerService; - @EJB - DataverseTimerServiceBean dataverseTimerService; @EJB HarvestingClientServiceBean harvestingClientService; @EJB @@ -82,6 +71,10 @@ public class HarvesterServiceBean { EjbDataverseEngine engineService; @EJB IndexServiceBean indexService; + @EJB + AuthenticationServiceBean authSvc; + @EJB + SystemConfig systemConfig; private static final Logger logger = Logger.getLogger("edu.harvard.iq.dataverse.harvest.client.HarvesterServiceBean"); private static final SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd"); @@ -96,6 +89,69 @@ public HarvesterServiceBean() { } + /** + * Run scheduled harvesting every hour. + * Check every client admin UI setting to determine if it should run. + * + * This code uses a WRITE lock. As it is run every hour, harvesting needs to done + * within that time or things will fail. + * + * TODO: This is not unit testable as long as dependent method doHarvest() is not. + * TODO: Maybe switch harvesting to use async or JBatch if necessary (timeouts, running longer than 1 hour, ...) + */ + @Lock(LockType.WRITE) + @Schedule(hour = "*", persistent = false) + public void schedule() { + // Fail silently when this is not the main timer node. + if (!systemConfig.isTimerServer()) { + return; + } + + // Timer batch jobs are run by the main Admin user. + AuthenticatedUser adminUser = authSvc.getAdminUser(); + if (adminUser == null) { + logger.severe("Scheduled harvest failed to locate the admin user! Exiting."); + return; + } + + // TODO: Refactor and add a lookup retrieving enabled clients only from the service. + for (HarvestingClient client : harvestingClientService.getAllHarvestingClients()) { + // Skip disabled clients + if (!client.isScheduled()) { + continue; + } + + // Determine if this client needs to be run (avoids code doubling) + boolean run = ( + // Check schedule: if "daily", check current hour and run on match. + HarvestingClient.SCHEDULE_PERIOD_DAILY.equals(client.getSchedulePeriod()) && + Calendar.getInstance().get(Calendar.HOUR_OF_DAY) == client.getScheduleHourOfDay() + ) || ( + // Check schedule: if "weekly", check current day of week plus hour and run on match. + HarvestingClient.SCHEDULE_PERIOD_WEEKLY.equals(client.getSchedulePeriod()) && + // ("Day of week" in DB zero-based but Calendar is one-based!) + Calendar.getInstance().get(Calendar.DAY_OF_WEEK)-1 == client.getScheduleDayOfWeek() && + Calendar.getInstance().get(Calendar.HOUR_OF_DAY) == client.getScheduleHourOfDay() + ); + + if (run) { + logger.info("Running harvesting client: id=" + client.getId() + " name=" + client.getName() + + " using admin user id=" + adminUser.getId() + " name=" + adminUser.getName()); + DataverseRequest dataverseRequest = new DataverseRequest(adminUser, (HttpServletRequest)null); + try { + // TODO: if harvests are reported to be slow or fail, it might be necessary to switch to + // async calling, so harvesting happens simultaneous. + doHarvest(dataverseRequest, client.getId()); + logger.info("Harvesting client id=" + client.getId() + " finished."); + } catch (Exception ex) { + // just log and continue with next client. + logger.log(Level.SEVERE, "Scheduled harvest failed with Exception:", ex); + continue; + } + } + } + } + /** * Called to run an "On Demand" harvest. */ @@ -109,34 +165,15 @@ public void doAsyncHarvest(DataverseRequest dataverseRequest, HarvestingClient h } } - public void createScheduledHarvestTimers() { - logger.log(Level.INFO, "HarvesterService: going to (re)create Scheduled harvest timers."); - dataverseTimerService.removeHarvestTimers(); - - List configuredClients = harvestingClientService.getAllHarvestingClients(); - for (Iterator it = configuredClients.iterator(); it.hasNext();) { - HarvestingClient harvestingConfig = (HarvestingClient) it.next(); - if (harvestingConfig.isScheduled()) { - dataverseTimerService.createHarvestTimer(harvestingConfig); - } - } - } - - public List getHarvestTimers() { - ArrayList timers = new ArrayList<>(); - - for (Iterator it = timerService.getTimers().iterator(); it.hasNext();) { - Timer timer = (Timer) it.next(); - if (timer.getInfo() instanceof HarvestTimerInfo) { - HarvestTimerInfo info = (HarvestTimerInfo) timer.getInfo(); - timers.add(info); - } - } - return timers; - } - /** * Run a harvest for an individual harvesting Dataverse + * + * TODO: this code needs refactoring to be unit testable: + * 1) Move the Logger/FileHandler stuff to a factory in a Service + * (Export or Logging service) a) to make it mockable and + * b) to have common, reusable code. + * 2) Dependent functions below are not yet testable. + * * @param dataverseRequest * @param harvestingClientId * @throws IOException diff --git a/src/main/java/edu/harvard/iq/dataverse/harvest/client/HarvestingClientServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/harvest/client/HarvestingClientServiceBean.java index 0af73550190..f5ed8cae885 100644 --- a/src/main/java/edu/harvard/iq/dataverse/harvest/client/HarvestingClientServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/harvest/client/HarvestingClientServiceBean.java @@ -5,10 +5,8 @@ import edu.harvard.iq.dataverse.DataverseRequestServiceBean; import edu.harvard.iq.dataverse.DataverseServiceBean; import edu.harvard.iq.dataverse.EjbDataverseEngine; -import edu.harvard.iq.dataverse.engine.command.exception.CommandException; -import edu.harvard.iq.dataverse.engine.command.impl.DeleteHarvestingClientCommand; import edu.harvard.iq.dataverse.search.IndexServiceBean; -import edu.harvard.iq.dataverse.timer.DataverseTimerServiceBean; + import java.util.ArrayList; import java.util.Date; import java.util.List; @@ -44,8 +42,6 @@ public class HarvestingClientServiceBean implements java.io.Serializable { DataverseRequestServiceBean dvRequestService; @EJB IndexServiceBean indexService; - @EJB - DataverseTimerServiceBean dataverseTimerService; @PersistenceContext(unitName = "VDCNet-ejbPU") private EntityManager em; @@ -139,9 +135,6 @@ public void deleteClient(Long clientId) { try { //engineService.submit(new DeleteHarvestingClientCommand(dvRequestService.getDataverseRequest(), victim)); HarvestingClient merged = em.merge(victim); - - // if this was a scheduled harvester, make sure the timer is deleted: - dataverseTimerService.removeHarvestTimer(victim); // purge indexed objects: indexService.deleteHarvestedDocuments(victim); diff --git a/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAIRecordServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAIRecordServiceBean.java index 973c712b5c8..2e4c4873847 100644 --- a/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAIRecordServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAIRecordServiceBean.java @@ -225,8 +225,13 @@ public void markOaiRecordsAsRemoved(Collection records, Date updateTi // (why these need to be in an EJB bean at all, what's wrong with keeping // them in the loadable ExportService? - since we need to modify the // "last export" timestamp on the dataset, being able to do that in the - // @EJB context is convenient. + // @EJB context is convenient. + /** + * TODO: This should be refactored to be the scheduled export routine. + * See DatasetServiceBean.exportAllDatasets(). + * This should use an Export service as EJB to get things mockable. + */ public void exportAllFormats(Dataset dataset) { try { ExportService exportServiceInstance = ExportService.getInstance(settingsService); diff --git a/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAISetServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAISetServiceBean.java index 3f4aa3e43fc..e1eb329a80e 100644 --- a/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAISetServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/harvest/server/OAISetServiceBean.java @@ -16,11 +16,7 @@ import java.util.logging.FileHandler; import java.util.logging.Level; import java.util.logging.Logger; -import javax.ejb.Asynchronous; -import javax.ejb.EJB; -import javax.ejb.Stateless; -import javax.ejb.TransactionAttribute; -import javax.ejb.TransactionAttributeType; +import javax.ejb.*; import javax.inject.Named; import javax.persistence.EntityManager; import javax.persistence.PersistenceContext; @@ -185,9 +181,28 @@ public void exportOaiSet(OAISet oaiSet, Logger exportLogger) { //} managedSet.setUpdateInProgress(false); - } + } + /** + * Scheduled export of all local & published datasets for OAI interface harvesting. + * Only runs on the node configured as timer master. + * + * TODO: this code needs refactoring to be unit testable: + * Move the Logger/FileHandler stuff to a factory in a Service + * (Export or Logging service) a) to make it mockable and + * b) to have common, reusable code. + */ + @Lock(LockType.READ) + @Schedule(hour = "2", persistent = false) public void exportAllSets() { + // In case this node is not the timer server, skip silently. + if (!systemConfig.isTimerServer()) { + return; + } + logger.info("OAISetService: Running a scheduled export job."); + + // TODO: this should be refactored to handle container usage, where these logs should not be + // saved locally, but get streamed to a handler like STDOUT. String logTimestamp = logFormatter.format(new Date()); Logger exportLogger = Logger.getLogger("edu.harvard.iq.dataverse.harvest.client.OAISetServiceBean." + "UpdateAllSets." + logTimestamp); String logFileName = "../logs" + File.separator + "oaiSetsUpdate_" + logTimestamp + ".log"; diff --git a/src/main/java/edu/harvard/iq/dataverse/timer/DataverseTimerServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/timer/DataverseTimerServiceBean.java deleted file mode 100644 index f4a30139a97..00000000000 --- a/src/main/java/edu/harvard/iq/dataverse/timer/DataverseTimerServiceBean.java +++ /dev/null @@ -1,350 +0,0 @@ -/* - * To change this license header, choose License Headers in Project Properties. - * To change this template file, choose Tools | Templates - * and open the template in the editor. - */ -package edu.harvard.iq.dataverse.timer; - -import edu.harvard.iq.dataverse.DatasetServiceBean; -import edu.harvard.iq.dataverse.Dataverse; -import edu.harvard.iq.dataverse.DataverseServiceBean; -import edu.harvard.iq.dataverse.authorization.AuthenticationServiceBean; -import edu.harvard.iq.dataverse.authorization.providers.builtin.BuiltinUser; -import edu.harvard.iq.dataverse.authorization.providers.builtin.BuiltinUserServiceBean; -import edu.harvard.iq.dataverse.authorization.users.AuthenticatedUser; -import edu.harvard.iq.dataverse.engine.command.DataverseRequest; -import edu.harvard.iq.dataverse.harvest.client.HarvestingClient; -import edu.harvard.iq.dataverse.harvest.client.HarvestTimerInfo; -import edu.harvard.iq.dataverse.harvest.client.HarvesterServiceBean; -import edu.harvard.iq.dataverse.harvest.client.HarvestingClientServiceBean; -import edu.harvard.iq.dataverse.harvest.server.OAISetServiceBean; -import edu.harvard.iq.dataverse.util.SystemConfig; -import java.io.IOException; -import java.io.Serializable; -import java.net.InetAddress; -import java.net.UnknownHostException; -import java.util.Calendar; -import java.util.Date; -import java.util.Iterator; -import java.util.logging.Level; -import java.util.logging.Logger; -import javax.annotation.PostConstruct; -import javax.annotation.Resource; -import javax.ejb.EJB; -import javax.ejb.Singleton; -import javax.ejb.Startup; -import javax.ejb.Stateless; -import javax.ejb.Timeout; -import javax.ejb.Timer; -import javax.ejb.TransactionAttribute; -import javax.ejb.TransactionAttributeType; -import javax.persistence.EntityManager; -import javax.persistence.PersistenceContext; -import javax.servlet.http.HttpServletRequest; - - -/** - * - * This is a largely intact DVN3 implementation. - * original - * @author roberttreacy - * ported by - * @author Leonid Andreev - */ -//@Stateless - -@Singleton -@Startup -public class DataverseTimerServiceBean implements Serializable { - @Resource - javax.ejb.TimerService timerService; - @PersistenceContext(unitName = "VDCNet-ejbPU") - private EntityManager em; - private static final Logger logger = Logger.getLogger("edu.harvard.iq.dataverse.timer.DataverseTimerServiceBean"); - @EJB - HarvesterServiceBean harvesterService; - @EJB - DataverseServiceBean dataverseService; - @EJB - HarvestingClientServiceBean harvestingClientService; - @EJB - AuthenticationServiceBean authSvc; - @EJB - DatasetServiceBean datasetService; - @EJB - OAISetServiceBean oaiSetService; - @EJB - SystemConfig systemConfig; - - - // The init method that wipes and recreates all the timers on startup - //@PostConstruct - - @PostConstruct - public void init() { - logger.info("PostConstruct timer check."); - - - if (systemConfig.isTimerServer()) { - logger.info("I am the dedicated timer server. Initializing mother timer."); - - removeAllTimers(); - // create mother timer: - createMotherTimer(); - // And the export timer (there is only one) - createExportTimer(); - - } else { - logger.info("Skipping timer server init (I am not the dedicated timer server)"); - } - } - - public void createTimer(Date initialExpiration, long intervalDuration, Serializable info) { - try { - logger.log(Level.INFO,"Creating timer on " + InetAddress.getLocalHost().getCanonicalHostName()); - } catch (UnknownHostException ex) { - Logger.getLogger(DataverseTimerServiceBean.class.getName()).log(Level.SEVERE, null, ex); - } - timerService.createTimer(initialExpiration, intervalDuration, info); - } - - /** - * This method is called whenever an EJB Timer goes off. - * Check to see if this is a Harvest Timer, and if it is - * Run the harvest for the given (scheduled) dataverse - * @param timer - */ - @Timeout - @TransactionAttribute(TransactionAttributeType.NOT_SUPPORTED) - public void handleTimeout(javax.ejb.Timer timer) { - // We have to put all the code in a try/catch block because - // if an exception is thrown from this method, Glassfish will automatically - // call the method a second time. (The minimum number of re-tries for a Timer method is 1) - - if (!systemConfig.isTimerServer()) { - //logger.info("I am not the timer server! - bailing out of handleTimeout()"); - Logger.getLogger(DataverseTimerServiceBean.class.getName()).log(Level.WARNING, null, "I am not the timer server! - but handleTimeout() got called. Please investigate!"); - } - - try { - logger.log(Level.INFO,"Handling timeout on " + InetAddress.getLocalHost().getCanonicalHostName()); - } catch (UnknownHostException ex) { - Logger.getLogger(DataverseTimerServiceBean.class.getName()).log(Level.SEVERE, null, ex); - } - - if (timer.getInfo() instanceof MotherTimerInfo) { - logger.info("Behold! I am the Master Timer, king of all timers! I'm here to create all the lesser timers!"); - removeHarvestTimers(); - for (HarvestingClient client : harvestingClientService.getAllHarvestingClients()) { - createHarvestTimer(client); - } - } else if (timer.getInfo() instanceof HarvestTimerInfo) { - HarvestTimerInfo info = (HarvestTimerInfo) timer.getInfo(); - try { - - logger.log(Level.INFO, "running a harvesting client: id=" + info.getHarvestingClientId()); - // Timer batch jobs are run by the main Admin user. - // TODO: revisit how we retrieve the superuser here. - // Should it be configurable somewhere, which superuser - // runs these jobs? Should there be a central mechanism for obtaining - // the "major", builtin superuser for this Dataverse instance? - // -- L.A. 4.5, Aug. 2016 - AuthenticatedUser adminUser = authSvc.getAdminUser(); // getAuthenticatedUser("admin"); - if (adminUser == null) { - logger.info("Scheduled harvest: failed to locate the admin user! Exiting."); - throw new IOException("Scheduled harvest: failed to locate the admin user"); - } - logger.info("found admin user "+adminUser.getName()); - DataverseRequest dataverseRequest = new DataverseRequest(adminUser, (HttpServletRequest)null); - harvesterService.doHarvest(dataverseRequest, info.getHarvestingClientId()); - - } catch (Throwable e) { - // Harvester Service should be handling any error notifications, - // if/when things go wrong. - // (TODO: -- verify this logic; harvesterService may still be able - // to throw an IOException, if it could not run the harvest at all, - // or could not for whatever reason modify the database record... - // in this case we should, probably, log the error and try to send - // a mail notification. -- L.A. 4.4) - //dataverseService.setHarvestResult(info.getHarvestingDataverseId(), harvesterService.HARVEST_RESULT_FAILED); - //mailService.sendHarvestErrorNotification(dataverseService.find().getSystemEmail(), dataverseService.find().getName()); - logException(e, logger); - } - } else if (timer.getInfo() instanceof ExportTimerInfo) { - try { - ExportTimerInfo info = (ExportTimerInfo) timer.getInfo(); - logger.info("Timer Service: Running a scheduled export job."); - - // try to export all unexported datasets: - datasetService.exportAll(); - // and update all oai sets: - oaiSetService.exportAllSets(); - } catch (Throwable e) { - logException(e, logger); - } - } - - } - - public void removeAllTimers() { - logger.info("Removing ALL existing timers."); - - int i = 0; - - for (Iterator it = timerService.getTimers().iterator(); it.hasNext();) { - - Timer timer = (Timer) it.next(); - - logger.info("Removing timer " + i + ";"); - timer.cancel(); - - i++; - } - logger.info("Done!"); - } - - public void removeHarvestTimers() { - // Remove all the harvest timers, if exist: - // - // (the logging messages below are set to level INFO; it's ok, - // since this code is only called on startup of the application, - // and it may be useful to know what existing timers were encountered). - - logger.log(Level.INFO,"Removing existing harvest timers.."); - - int i = 1; - for (Iterator it = timerService.getTimers().iterator(); it.hasNext();) { - - Timer timer = (Timer) it.next(); - logger.log(Level.INFO, "HarvesterService: checking timer "+i); - - if (timer.getInfo() instanceof HarvestTimerInfo) { - logger.log(Level.INFO, "HarvesterService: timer "+i+" is a harvesting one; removing."); - timer.cancel(); - } - - i++; - } - } - - public void createMotherTimer() { - MotherTimerInfo info = new MotherTimerInfo(); - Calendar initExpiration = Calendar.getInstance(); - long intervalDuration = 60 * 60 * 1000; // every hour - initExpiration.set(Calendar.MINUTE, 50); - initExpiration.set(Calendar.SECOND, 0); - - Date initExpirationDate = initExpiration.getTime(); - Date currTime = new Date(); - if (initExpirationDate.before(currTime)) { - initExpirationDate.setTime(initExpiration.getTimeInMillis() + intervalDuration); - } - - logger.info("Setting the \"Mother Timer\", initial expiration: " + initExpirationDate); - createTimer(initExpirationDate, intervalDuration, info); - } - - public void createHarvestTimer(HarvestingClient harvestingClient) { - - if (harvestingClient.isScheduled()) { - long intervalDuration = 0; - Calendar initExpiration = Calendar.getInstance(); - initExpiration.set(Calendar.MINUTE, 0); - initExpiration.set(Calendar.SECOND, 0); - if (harvestingClient.getSchedulePeriod().equals(HarvestingClient.SCHEDULE_PERIOD_DAILY)) { - intervalDuration = 1000 * 60 * 60 * 24; - initExpiration.set(Calendar.HOUR_OF_DAY, harvestingClient.getScheduleHourOfDay()); - - } else if (harvestingClient.getSchedulePeriod().equals(harvestingClient.SCHEDULE_PERIOD_WEEKLY)) { - intervalDuration = 1000 * 60 * 60 * 24 * 7; - initExpiration.set(Calendar.HOUR_OF_DAY, harvestingClient.getScheduleHourOfDay()); - initExpiration.set(Calendar.DAY_OF_WEEK, harvestingClient.getScheduleDayOfWeek() + 1); //(saved as zero-based array but Calendar is one-based.) - - } else { - logger.log(Level.WARNING, "Could not set timer for harvesting client id=" + harvestingClient.getId() + ", unknown schedule period: " + harvestingClient.getSchedulePeriod()); - return; - } - Date initExpirationDate = initExpiration.getTime(); - Date currTime = new Date(); - if (initExpirationDate.before(currTime)) { - initExpirationDate.setTime(initExpiration.getTimeInMillis() + intervalDuration); - } - logger.log(Level.INFO, "Setting timer for harvesting client " + harvestingClient.getName() + ", initial expiration: " + initExpirationDate); - createTimer(initExpirationDate, intervalDuration, new HarvestTimerInfo(harvestingClient.getId(), harvestingClient.getName(), harvestingClient.getSchedulePeriod(), harvestingClient.getScheduleHourOfDay(), harvestingClient.getScheduleDayOfWeek())); - } - } - - public void updateHarvestTimer(HarvestingClient harvestingClient) { - removeHarvestTimer(harvestingClient); - createHarvestTimer(harvestingClient); - } - - - public void removeHarvestTimer(HarvestingClient harvestingClient) { - // Clear dataverse timer, if one exists - try { - logger.log(Level.INFO,"Removing harvest timer on " + InetAddress.getLocalHost().getCanonicalHostName()); - } catch (UnknownHostException ex) { - Logger.getLogger(DataverseTimerServiceBean.class.getName()).log(Level.SEVERE, null, ex); - } - for (Iterator it = timerService.getTimers().iterator(); it.hasNext();) { - Timer timer = (Timer) it.next(); - if (timer.getInfo() instanceof HarvestTimerInfo) { - HarvestTimerInfo info = (HarvestTimerInfo) timer.getInfo(); - if (info.getHarvestingClientId().equals(harvestingClient.getId())) { - timer.cancel(); - } - } - } - } - - public void createExportTimer() { - ExportTimerInfo info = new ExportTimerInfo(); - Calendar initExpiration = Calendar.getInstance(); - long intervalDuration = 24 * 60 * 60 * 1000; // every day - initExpiration.set(Calendar.MINUTE, 0); - initExpiration.set(Calendar.SECOND, 0); - initExpiration.set(Calendar.HOUR_OF_DAY, 2); // 2AM, fixed. - - - Date initExpirationDate = initExpiration.getTime(); - Date currTime = new Date(); - if (initExpirationDate.before(currTime)) { - initExpirationDate.setTime(initExpiration.getTimeInMillis() + intervalDuration); - } - - logger.info("Setting the Export Timer, initial expiration: " + initExpirationDate); - createTimer(initExpirationDate, intervalDuration, info); - } - - public void createExportTimer(Dataverse dataverse) { - /* Not yet implemented. The DVN 3 implementation can be used as a model */ - - } - - public void removeExportTimer() { - /* Not yet implemented. The DVN 3 implementation can be used as a model */ - } - - /* Utility methods: */ - private void logException(Throwable e, Logger logger) { - - boolean cause = false; - String fullMessage = ""; - do { - String message = e.getClass().getName() + " " + e.getMessage(); - if (cause) { - message = "\nCaused By Exception.................... " + e.getClass().getName() + " " + e.getMessage(); - } - StackTraceElement[] ste = e.getStackTrace(); - message += "\nStackTrace: \n"; - for (int m = 0; m < ste.length; m++) { - message += ste[m].toString() + "\n"; - } - fullMessage += message; - cause = true; - } while ((e = e.getCause()) != null); - logger.severe(fullMessage); - } - -} \ No newline at end of file diff --git a/src/main/java/edu/harvard/iq/dataverse/timer/ExportTimerInfo.java b/src/main/java/edu/harvard/iq/dataverse/timer/ExportTimerInfo.java deleted file mode 100644 index d0f93f1c9c5..00000000000 --- a/src/main/java/edu/harvard/iq/dataverse/timer/ExportTimerInfo.java +++ /dev/null @@ -1,54 +0,0 @@ -/* - Copyright (C) 2005-2012, by the President and Fellows of Harvard College. - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. - - Dataverse Network - A web application to share, preserve and analyze research data. - Developed at the Institute for Quantitative Social Science, Harvard University. - Version 3.0. -*/ - -package edu.harvard.iq.dataverse.timer; - -import java.io.Serializable; - -/** - * - * @author Leonid Andreev - * This is the Export Timer, that executes regular export jobs. - * As of now (4.5) there is only 1; it's not configurable - rather it gets started - * on every restart/deployment automatically. - * If we have to add more configurable exports further down the road, more settings - * can be added here. - */ -public class ExportTimerInfo implements Serializable { - - String serverId; - - public String getServerId() { - return serverId; - } - - public void setServerId(String serverId) { - this.serverId = serverId; - } - - public ExportTimerInfo() { - - } - - public ExportTimerInfo(String serverId) { - this.serverId = serverId; - } - -} \ No newline at end of file diff --git a/src/main/java/edu/harvard/iq/dataverse/timer/MotherTimerInfo.java b/src/main/java/edu/harvard/iq/dataverse/timer/MotherTimerInfo.java deleted file mode 100644 index 113355c663b..00000000000 --- a/src/main/java/edu/harvard/iq/dataverse/timer/MotherTimerInfo.java +++ /dev/null @@ -1,51 +0,0 @@ -/* - Copyright (C) 2005-2012, by the President and Fellows of Harvard College. - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. - - Dataverse Network - A web application to share, preserve and analyze research data. - Developed at the Institute for Quantitative Social Science, Harvard University. - Version 3.0. -*/ - -package edu.harvard.iq.dataverse.timer; - -import java.io.Serializable; - -/** - * - * @author Leonid Andreev - * This is the "Mother Timer", that runs on the dedicated timer service and - * starts other timers. - */ -public class MotherTimerInfo implements Serializable { - - String serverId; - - public String getServerId() { - return serverId; - } - - public void setServerId(String serverId) { - this.serverId = serverId; - } - - public MotherTimerInfo() { - - } - - public MotherTimerInfo(String serverId) { - this.serverId = serverId; - } - -} \ No newline at end of file