This repository has been archived by the owner on Jan 30, 2024. It is now read-only.
Releases: PanDAWMS/pilot2
Releases · PanDAWMS/pilot2
2.12.8.1
2.12.7.8
- Now explicitly avoid directory names in looping job algorithm
- Requested by R. Walker
- Fixed a problem with esmerge jobs using containerized stage-in, leading to unresolved ddm endpoint
- Problem seen on LRZ-LMU
- Requested by R. Walker
- Fixed a problem where the pilot could not abort after not getting any jobs
- Problem seen at least on CA-VICTORIA-K8S-TEST-T2
- Reported by R. Taylor
- Updated the chi2 formula used with memory leak calculation to ignore units
- Now using same formula as DA, chi2=SUM((y_obs-y_exp)2/(y_exp2))
- Requested by M. Maeno, M. Villaplana
- Fixed bad variable affecting fake DBRelease setup files
- Requested by R. Walker
Contributions from A. Anisenkov, P. Nilsson
2.12.6.1
- Correction for lingering problem with non-existing files in dir size calculation
- Added additional exception handling
- Reported by R. Walker, P. Vokac
2.12.5.4
- Added a file existence check in directory size calculation function
- Problem seen in a few jobs where a file would disappear before its size was checked, resulting in taskbuffer:300 errors
- Note: not all taskbuffer:300 errors are because of this problem
- Reported by R. Walker, A. De Silva
- Queuedata fetching is now using locally defined $ATLAS_SW_BASE to determine where /cvmfs is mounted
- Requested by M. Svatos
2.12.4.24
- Replaced all usages of du with python function to calculate size of directory
- Using du causes a problem at CERN EOS with EOS mounts being active for too long
- ALICE is applying a similar change to their pilots
- Requested by L. Fernandez Alvarez (CERN IT)
- Update for voms-proxy-info command output (which is used as backup for arcproxy in case the latter fails)
- RC test jobs on some sites failed to extract the time left from the command output - added alternative extraction of integer value
- Support for pathless middleware container images
- Requested by M. Svatos
- By default use static JSON caches from CRIC instead of dynamic query for PandaQueues and DDMEndpoints, unless custom urls are explicitly specified in the pilot config file
- Bug fixes
- Previous update for fast exit after payload finish was not sufficient, now fixed (TM)
- Fixed case where a failed xcache kill operation led to failed job (RW)
- Core dumps from non-payload processes (e.g. prmon) led to failed job, now fixed (RW, ADS)
Contributions from A. Anisenkov, P. Nilsson.
2.12.3.13
- Changed disk space verification time from once per four minutes to once per minute
- This verification includes the payload stdout size check
- Requested by P. Vokac
- Faster termination of payload execution loop
- Previously pilot could spend up to a minute before exiting the payload execution loop
- Requested by T. Maeno
- Various Pylint recommended changes to improve the overall quality of the code base
- Bug fixes
- Fixed resetting of trf exit code after failure
- Previously led to wrong/misleading Pilot error message (output file not found, i.e. a secondary error)
- If Pilot has not set a dedicated Pilot error code to a non-zero TRF exit code, it will now set the more general “Failed to execute payload” error (code 1305)
- Fixed minor cleanup issue at the end of a multi-job
- Only led to some error messages about job object not being a string
- Fixed removal of core dumps prior to log file creation in looping jobs
- Pilot uses gdb to create a core dump in this case, but this should of course be included in the job log
- Fixed resetting of trf exit code after failure
- Non-ATLAS changes (Rubin)
- LFN field in job definition can now be used for PFNs for gs copy tool
- SURL is set from LFN value containing gs:// protocol, LFN reset to proper value
- Requested by S. Padolski
Contributions from B. Simmons, P. Nilsson.
2.12.2.1
- Corrections to pylint changes
- Added missing KeyError to exception handling
- Affected jobs: https://bigpanda.cern.ch/jobs/?pilotversion=2.12.1+%2862%29&jobstatus=failed&piloterrorcode=1310
2.12.1.62
- Advanced debug mode
- Added support for explicit debug command=”some_command some_option” instead of static ‘debug’ which currently tells the pilot to tail the latest found non-binary file
- If pilot now receives the command (e.g.) ‘tail some_log_file’ it will locate that file and tail it
- Other supported commands are ls, ps, du, and gdb (plus options)
- For gdb, pilot identifies pid of true athena child process (unless specified, identified in earlier ps debug command), produces and copies core file to work dir for storing in log (exception is made for present core file in work dir which normally would be removed before log is created), and kills the job with a descriptive error message
- Requested by R. Walker
- A core dump of the payload process is now produced and copied to the log if the payload is looping
- Requested by R. Walker
- The current event number (at the time of a job update) is now reported with job metrics
- Requested by R. Walker
- Number of ReSim events are now reported with jobMetrics also when resimevents=0
- Requested by R. Walker
- Xcache service updates
- Updated xcache kill, now using env variable for pid
- Xcache start now uses option ‘-b 4’ (only cleanup orphans older than four days)
- Xcache environmental variables are expanded in metadata xml
- Requested by A. de Silva
- Orphan handling
- Previously, Pilot explicitly killed identified orphan processes at the end of the job, lead to problems at RAL due to their special setup
- Pilot is now using ctypes library functions to make sure all subprocesses are parented, i.e. any ‘orphans’ at the end of the job will be removed when the child subprocesses are killed
- A new error code was added related to this, to make sure ctypes is available everywhere (as tested with RC jobs)
- 1365: "Python module ctypes not available on worker node"
- An RC test revealed that not all sites have the ctypes module installed (at least CSCS-LCG2)
- Note: the pilot continues to use the orphan killer algorithm, but if the ctypes library was successfully used, all processes that would otherwise become orphans, will be parented. Sites that do not want the orphan killer to execute, should define the environmental variable PILOT_NOKILL
- Summing up input file sizes
- Skipped when mv copy tool is used to avoid a subsequent problem with direct access
- Requested by R. Walker
- Updated gs copy tool
- Note: this copy tool is currently tailored for Rubin, but will be refactored at a later time for general use
- Fixed a problem with wrong localSite information in traces
- Previously, when middleware containers were used, the localSite was extracted from the RUCIO_LOCAL_SITE_ID env var which is unknown inside the container which led to the ddmendpoint value being used instead. The pilot now defaults to the value stored in the preliminary trace report
- Requested by I. Vukotic
- Code cleaning towards pylint compliance in preparation for a coming corresponding GitHub Action
- We are currently focusing on cleaning up errors detected by pylint
- Python 3 update
- Pilot can now handle cases where the payload produces illegal (non-printable) characters in the stdout
- Problem discovered in a user job
- Pilot now looks for singularity errors in the payload stderr even if the exit code is zero
- Seen with job running at Wuppertal (leading to misleading error message); https://bigpanda.cern.ch/job?pandaid=5101939120
- HPO updates
- Additional exit code handling to allow pre-process to abort after user defined iteration limit
- Requested by T. Maeno
- Raythena plug-in update
- Output files moved to an external directory (defined by pilot option --output-dir ) as soon as they are reported by AthenaMP
- To be tested
- Improved error reporting after remote file verification failure
- Added diagnostics to error message
- Requested by P. Vokac
- Added prmon to list of unwanted files in looping job killer
- Parallel remote file open
- The remote file open script may now attempt to open turls in parallel
- Activated by specifying nopenfiles=N in PQ.catchall (for testing)
- The default number of file open threads is 1, i.e. same as before
- Requested by I. Vukotic
- Some cleanup of the main pilot log, less clutter (incl. no dumping of stage-in/out logs if container is not used)
Code contributions from O. Freyermuth, A. Anisenkov, S. Ye, B. Simmons, P. Vokac, P. Nilsson
2.11.2.22
- Free space updates
- Limit and frequency
- Pilot imposes a 2 GB limit on the available free space before downloading a payload. To alleviate cases with low available WN space, this has now gets lowered to 1 GB when the payload is running
- The check for remaining available space was lowered from once per 5 minutes to once per 4 minutes
- Problem seen at OU with merge jobs on 16 GB nodes
- Related discussion: https://its.cern.ch/jira/browse/ATLASJT-382
- Requested by H. Severini
- Update for truePilots in PUSH mode
- The available space check is delayed until after the job definition has been processed, which allows the pilot to communicate any out of space error directly to the server instead of going via a batch exit code
- Requested by T. Maeno
- Limit and frequency
- Raythena related updates
- Internal file lists are now properly updated when the external PILOT_LOGFILE environmental variable is set
- Requested by V. Tsulaia
- Updated orphan killer
- Now sending SIGTERM before SIGKILL with 10s sleep (with further improvements to come)
- Requested by O. Freyermuth (RAL)
- Now sending cpu model info in cpu_consumption_unit also for running jobs (previously only when payload had finished)
- Requested by R. Walker
- ReSim
- In case the job report contains resimevents info from the ReSim_tf, it will be added to the jobMetrics
- Requested by R. Walker
- HPO update
- The error code from a failed HPO main payload was previously overwritten by the ‘no more data points’ error in the subsequent payload iteration. Now fixed
- Requested by T. Maeno / R. Zhang
- Improvements related to AthenaMP with pile-up
- Job parameter filtering (such as quotation mark handling) has been made user specific (internal improvement)
- The --athenaopts field is treated specially in the job parameter filtering to allow for environmental variables to be used (ATHENA_CORE_NUMBER)
- Discussed in JIRA ticket: https://its.cern.ch/jira/browse/ATLPHYSVAL-756
- Requested by R. Walker
2.11.1.65
- Rucio traces
- The transfer protocol returned from the Rucio API is now reported with traces
- Requested by A. Forti, T. Beermann
- Trace report issues in jobs with remote input
- Fixed issue with messed up appid when middleware container was used (resulted in “--eventservicemerge=False” being added to appid)
- Fixed localSite and remoteSite values that became mixed up due to changes inside the middleware container (updated ddmendpoint in combination with a Rucio env variable that is not known inside the container) - this lead to the wrong values ending up in the base trace report which is only updated and sent after the middleware container and remote file verification have finished
- Reported by I. Vukotic
- The transfer protocol returned from the Rucio API is now reported with traces
- Raythena related updates
- New pilot options added to facilitate Raythena testing (previously pilot config file had to be manually merged after pilot release with config changes)
- -u (no value - turn off payload proxy verification; default is True when -u is not specified)
- -v (number of getjob requests; default is 2)
- --es-executor-type (event service executor; generic or raythena)
- Removed executor_type, maximum_getjob_requests, payload_proxy_from_server, use_middleware_container from Pilot config (not needed, set in CRIC)
- Pilot can now use a set env variable PILOT_LOGFILE to determine name of log, instead of job definition. Useful when the pilot is interrupted
- The log creation function may now be used (easily) by an external user (read: Harvester)
- New pilot options added to facilitate Raythena testing (previously pilot config file had to be manually merged after pilot release with config changes)
- Fix for killing payload process after receiving tobekilled server command
- Reported by M. Borodin
- Will not kill runpilot2-wrapper.sh process at the end if labelled as an orphan
- Problem seen at RAL
- Reported by J. Walder
- Discussed in ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=151098
- Support for new task parameter to control looping jobs
- The field ‘noLoopingCheck’ can now be used on the task level to instruct the Pilot to skip the looping payload check
- Looping job documentation: https://github.com/PanDAWMS/pilot2/wiki/Special-Algorithms-and-Functionalities
- Initial versions of s3 and gs copy tools
- Buckets are currently hardcoded, i.e. needs further development
- Pilot Xcache service
- The Pilot may now launch a local xcache service on the WN, in an effort to solve problems with direct access on some sites (e.g. CYFRONET)
- Currently activated via catchall, with a dedicated CRIC option to come
- Documentation: https://github.com/PanDAWMS/pilot2/wiki/Xcache
- Direct access updates
- Added davs to schema list to allow such replicas in direct access
- Corrected usage of allowed schemas for direct access over WAN (full schema list was used, should only be ‘root’ and ‘davs’)
- Requested by R. Walker, A. Anisenkov
- The Pilot is now sending a list of supported CPU instruction sets to the server
- Currently only checking for AVX2, but any other set (or all sets) can be added
- The info is sent with updateJob for the time being (eventually it will be sent with getJob to be used for brokering)
- HPO payloads
- The Pilot now resets the output file list after the pre-process has finished with no more HPO points available, to solve a problem with missing output (which will not exist when the pre-process finishes with exit code 160 in the first iteration)
- Debug mode update
- In the debug mode, the Pilot sends a tail of the latest updated payload log every five minutes to the server
- From this version, the tail will be from the last updated non-binary file, instead of any .log or log. file
- Requested by R. Walker