Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

Releases: PanDAWMS/pilot2

PanDA Pilot 2.1.11

19 Jun 10:25
2299b5b
Compare
Choose a tag to compare
  • Rucio API instead of CLI
  • Using TRF name to identify PID for memory monitor to track, increased waiting time for payload to start within a container
  • Initial stage-in workflow module (in which pilot only stage-in data and then quits; lots of details to be implemented); internal thread and job queue setup only
  • ‘storm’ added to allowed protocols (Requested by Tomas Javurek)
  • Fix for pilot reading output files from job report in user jobs when it shouldn’t
  • Now supporting allowNoOutput file lists from job definition
  • Fix for exception thrown because of failed memory monitor
  • Introduced BADMEMORYMONITORJSON error code 1337, plus improvements in handling JSON read failures (could lead to exception in previous pilot versions)
  • Fixes for XML problems seen on ND (missing SURL) and symlink issues in combination with containers

Code contributions from Tomas Javurek, David Cameron, Paul Nilsson

PanDA Pilot 2.1.10

03 Jun 12:37
f0734b4
Compare
Choose a tag to compare
  • Corrected nevents->nEvents which led to number of events not being read from jobReport
  • Corrected getctime()->getmtime() which led to some issues with looping job killer
  • Added Process and merged_lhef._0.events-new to list of files/directories to be removed before log file is created, to avoid large log files

PanDA Pilot 2.1.9

20 May 08:59
9c5b03b
Compare
Choose a tag to compare
  • Now sending heartbeat during stage-in, previously heartbeats were only sent after stage-in had finished which could lead to lost heartbeat if stage-in would take a long time
  • Now setting exeerrorcode and exeerrordiag before final server update (detected missing in BOINC jobs)
  • X509_USER_PROXY removed from command executed inside container (moved to command executing container)
  • Bucket id fix from Wen Guan (previously code did not handle bucket id's, problem seen with some ES merge jobs)

2.1.8

14 May 16:16
98d5d7b
Compare
Choose a tag to compare
  • Fixed urgent problem with file not closed, lead to problem with input file lists seen in ES jobs
  • Only storing 1kB in log extracts
  • Now returning NOPROXY error to wrapper if that occurs

2.1.7

07 May 15:45
0625c8c
Compare
Choose a tag to compare
  • Added missing values in server update; nCores, nEvents and cpu consumption info (CPU model info is added to cpu consumption unit)
  • Now exposing use_pcache field in queuedata (from A. Anisenkov)

Some minor refactoring was also done to please flake8. Thanks to Emmanouil for spotting the missing CPU consumption time.

2.1.6

01 May 09:36
79caead
Compare
Choose a tag to compare
  • Added new error code 1335, MISSINGUSERCODE
  • Now adding 10% grace margin in get_max_allowed_work_dir_size()
  • Now displaying OS/architecture info after the pilot version banner
  • Now fitting PSS+Swap as requested by Alessandra, Rod
  • Added pool.root file pattern to cleanup_payload() - which left root files in log files in athenaMP jobs

2.1.5

24 Apr 13:00
1e2f604
Compare
Choose a tag to compare
  • Exporting INDS in get_payload_environment_variables() to make it available inside the container, requested by EventIndex people
  • Resetting getjob_requests counter after processed job (bug fix)
  • Now fitting VMEM instead of PSS (leak estimation). Skipping tails. Requiring more than two data points to fit data
  • Patch for surl in PFC, should be local path in non-direct access mode (reported by R. Walker)
  • Now overwriting output file list instead of appending it when new output files have been discovered in job report

2.1.4

10 Apr 08:47
cf151e8
Compare
Choose a tag to compare
  • Removed usage of tarfile module in favour of CLI command, ie same as Pilot 1. This is currently necessary since tarfile does not support the --one-file-system instruction. Affected at least one production task, where a file to be tarred resided on /tmp which could not be reached
  • Now sending memory information with job updates
  • Explicitly removing workDir directory in user jobs before log tarball creation
  • Ignoring 'local' storage_token for .lib.-files (previously led to direct io bug reported by Rod)
  • Now sending harvester_id and worked_id to server with getJob call

2.1.3

04 Apr 14:36
1c27e0c
Compare
Choose a tag to compare
  • Fix for no output files. Previous pilot version would fail with unknown internal problem.
  • Appending PandaID, PanDA_TaskID and PanDA_AttemptNr to payload command, needed to bring these env variables into container. Requested by EventIndex people

2.1.2

03 Apr 08:10
0e02344
Compare
Choose a tag to compare
  • Improved exception handling in looping job algorithm
  • Looping job killer now only verifies files that actually exist (previous version could fail if 'find' command returned a file name that no longer existed - happened at UKI-NORTHGRID-LIV-HEP_SL7_UCORE_PL2)
  • Minor cleanup of useless log messages
  • Bug fixes:
    1. Formatting error '%d -> %s bug in log message' (looping job killer)
    2. Wrongly used internal pilot state variable (harmless)
    3. Problematic server updates which led to lost heartbeat in some cases (for some reason many jobs at UKI-NORTHGRID-LIV-HEP_SL7_UCORE_PL2 failed like this, possibly because of slow server updates which led to a race condition)
    4. Multi-job fix (pilot terminated at the end of second job prematurely since a boolean was not reset and the final server update was skipped leading to lost heartbeat of second job)