This repository has been archived by the owner on Jan 30, 2024. It is now read-only.
Releases: PanDAWMS/pilot2
Releases · PanDAWMS/pilot2
PanDA Pilot 2.1.11
- Rucio API instead of CLI
- Using TRF name to identify PID for memory monitor to track, increased waiting time for payload to start within a container
- Initial stage-in workflow module (in which pilot only stage-in data and then quits; lots of details to be implemented); internal thread and job queue setup only
- ‘storm’ added to allowed protocols (Requested by Tomas Javurek)
- Fix for pilot reading output files from job report in user jobs when it shouldn’t
- Now supporting allowNoOutput file lists from job definition
- Fix for exception thrown because of failed memory monitor
- Introduced BADMEMORYMONITORJSON error code 1337, plus improvements in handling JSON read failures (could lead to exception in previous pilot versions)
- Fixes for XML problems seen on ND (missing SURL) and symlink issues in combination with containers
Code contributions from Tomas Javurek, David Cameron, Paul Nilsson
PanDA Pilot 2.1.10
- Corrected nevents->nEvents which led to number of events not being read from jobReport
- Corrected getctime()->getmtime() which led to some issues with looping job killer
- Added Process and merged_lhef._0.events-new to list of files/directories to be removed before log file is created, to avoid large log files
PanDA Pilot 2.1.9
- Now sending heartbeat during stage-in, previously heartbeats were only sent after stage-in had finished which could lead to lost heartbeat if stage-in would take a long time
- Now setting exeerrorcode and exeerrordiag before final server update (detected missing in BOINC jobs)
- X509_USER_PROXY removed from command executed inside container (moved to command executing container)
- Bucket id fix from Wen Guan (previously code did not handle bucket id's, problem seen with some ES merge jobs)
2.1.8
- Fixed urgent problem with file not closed, lead to problem with input file lists seen in ES jobs
- Only storing 1kB in log extracts
- Now returning NOPROXY error to wrapper if that occurs
2.1.7
- Added missing values in server update; nCores, nEvents and cpu consumption info (CPU model info is added to cpu consumption unit)
- Now exposing use_pcache field in queuedata (from A. Anisenkov)
Some minor refactoring was also done to please flake8. Thanks to Emmanouil for spotting the missing CPU consumption time.
2.1.6
- Added new error code 1335, MISSINGUSERCODE
- Now adding 10% grace margin in get_max_allowed_work_dir_size()
- Now displaying OS/architecture info after the pilot version banner
- Now fitting PSS+Swap as requested by Alessandra, Rod
- Added pool.root file pattern to cleanup_payload() - which left root files in log files in athenaMP jobs
2.1.5
- Exporting INDS in get_payload_environment_variables() to make it available inside the container, requested by EventIndex people
- Resetting getjob_requests counter after processed job (bug fix)
- Now fitting VMEM instead of PSS (leak estimation). Skipping tails. Requiring more than two data points to fit data
- Patch for surl in PFC, should be local path in non-direct access mode (reported by R. Walker)
- Now overwriting output file list instead of appending it when new output files have been discovered in job report
2.1.4
- Removed usage of tarfile module in favour of CLI command, ie same as Pilot 1. This is currently necessary since tarfile does not support the --one-file-system instruction. Affected at least one production task, where a file to be tarred resided on /tmp which could not be reached
- Now sending memory information with job updates
- Explicitly removing workDir directory in user jobs before log tarball creation
- Ignoring 'local' storage_token for .lib.-files (previously led to direct io bug reported by Rod)
- Now sending harvester_id and worked_id to server with getJob call
2.1.3
- Fix for no output files. Previous pilot version would fail with unknown internal problem.
- Appending PandaID, PanDA_TaskID and PanDA_AttemptNr to payload command, needed to bring these env variables into container. Requested by EventIndex people
2.1.2
- Improved exception handling in looping job algorithm
- Looping job killer now only verifies files that actually exist (previous version could fail if 'find' command returned a file name that no longer existed - happened at UKI-NORTHGRID-LIV-HEP_SL7_UCORE_PL2)
- Minor cleanup of useless log messages
- Bug fixes:
- Formatting error '%d -> %s bug in log message' (looping job killer)
- Wrongly used internal pilot state variable (harmless)
- Problematic server updates which led to lost heartbeat in some cases (for some reason many jobs at UKI-NORTHGRID-LIV-HEP_SL7_UCORE_PL2 failed like this, possibly because of slow server updates which led to a race condition)
- Multi-job fix (pilot terminated at the end of second job prematurely since a boolean was not reset and the final server update was skipped leading to lost heartbeat of second job)