This repository has been archived by the owner on Jan 30, 2024. It is now read-only.
Releases: PanDAWMS/pilot2
Releases · PanDAWMS/pilot2
2.8.9.4
- Fixed broken prmon
- The prmon mechanism was accidentally broken by the introduction of coprocess support last week
- Fixed missing input_dir and output_dir in middleware container
- Could have affected some jobs where middleware containerisation was used (if input_dir/output_dir were set as a pilot option)
2.8.8.1
- The pilot is now skipping the input file size check completely when PQ.maxinputsize=-1 to skip the unwanted file size checks on SToRM sites
2.8.7.14
- Fix for unwanted total input file size calculation in direct access production jobs
- Problem seen on SToRM sites
- Requested by Rodney Walker
- Fixed problem with replacing LFNs with TURLs in production jobs using direct access
- Requested by Rodney Walker
- Support for executing a coprocess in HPO jobs
- Fixed dumping of large failed find commands in looping job algorithm (rare)
- Requested by Fa-Hui Lin
- Added new error code 1362, “Xrdcp was unable to open file”
- Error seen in https://bigpanda.cern.ch/job?pandaid=4892946326 (which led to some exception)
- Added handling for None return from get events
- Corrections to proxy verifications
- Problems with returning error code found by A. Bogdanchikov
Code contributions from D. Benjamin, P. Nilsson
2.8.6.7
- Fix for large memory usage seen (at least) at MWT2
- It was discovered that when the pilot requested a logger object using the python logging.getChild(), this can sometimes lead to a huge memory leak
- Pilot does not need to use this so this functionality was removed
2.8.5.15
- Internal improvements
- Added self monitoring of memory usage
- It was seen at MWT2 that the pilot process in some cases used up to 5 GB memory, which is currently not understood
- The output from a ps call is dumped to the log at regular intervals + around essential steps. Also, the size of the main job object is measured when it’s moved to internal queues
- No problems have been seen in testing, including staging in multiple and large input files - but there could still be an issue with having many input files in long running jobs (it’s just not seen in testing)
- Corrected the ‘url’ field in traces for lib files in user jobs
- Previously the wrong turl was sent with the trace
- 12 complex functions were refactored
- 8 functions and classes are currently deemed ‘too complex’ by flake8 (skipped)
- NB: there are currently 1140 functions + 133 classes in Pilot 2
- Added self monitoring of memory usage
2.8.4.35: Merge pull request #297 from PanDAWMS/next
- Native HPO mode
- Pilot is now able to loop over multiple pre-, main payload and post-processes (abort when preprocess is returning special exit code)
- Improved calculation for actualcorecount
- Previously, a used grep command could potentially return a non-unique pgrp id
- Requested by D. Cameron
- Discussing with G. Stewart to see if prmon can measure core count usage (needs to be seconded if this is wanted). Pilot currently uses ps for this
- Pilot is now sending the mean [actual] core count with the last server update (as ‘meanCoreCount’)
- Middleware container may now be specified in pilot config file (‘middleware_name’)
- The default value is the rucio container image taken from the unpacked cvmfs area
- In case the image does not exist, pilot will fall back to CentOS7
- Added missing container options from queuedata to middleware container setup
- This will otherwise cause a problem on SToRM sites using a middleware container
- File open verification for direct I/O files
- Added new error code 1361, “Remote file could not be opened” in combination with clientState=’FAILED_REMOTE_OPEN’ in the rucio trace (to be extended in next pilot version)
- Requested by R. Walker
- Traces in direct access mode are now sent immediately before launching the payload to avoid complications with containers and remote input file open verification
- Only using root schema with list_replicas() call for VP jobs
- Also setting rse_expression = 'istape=False\type=SPECIAL' in the list_replicas() query
- Python 3 updates for Raythena plugin
Code contributions from D. Benjamin, P. Nilsson
2.8.3.3
- Reverted the core count change from pilot version 2.8.1
- The earlier pilot version generated an unwanted change in monitoring and created a seemingly large increase in the number of running jobs since (the mean of the) actualcorecount was reported instead of the queue.corecount value
- Requested by DPA
2.8.2.3
- Removed overwriting of corecount with actualcorecount since it caused a scaling factor to be calculated wrongly which led to jobs being killed for spending too much memory when in fact they didn’t
- Note: the pilot is still setting corecount=actualcorecount in the server updates
- Correction for an issue with missing new settings in locally used pilot config files which led to a problem on HPCs reading CRIC data
- Note: this could be bypassed by updating the local config files with the recently introduced settings
- Correction for insufficient error information in the cases of too large stdout and workdir sizes
Code contributions from A. Anisenkov, P. Nilsson.
2.8.1.26
- Raythena related updates
- Skipped input file transfers
- HTTP time-outs now configurable (via default.cfg file)
- Python 3 update
- Corrected queuedata JSON download (byte stream -> string)
- Overwriting original core count value with actual core count if it is known (only report actual core count in job metrics)
- Final server update now has the average actual core count that was used during running
- Requested by M. Grigoryeva and T. Maeno
- Corrections for multiple input and output files in middleware container mode
- Reading GUIDs from metadata XML in case jobReport does not exist
- Useful for ancient releases
- Discussed in [long thread and in] JIRA ticket: https://its.cern.ch/jira/browse/ATLMCPROD-8736
- Asetup related changes for running on Summit (simplifications)
- In combination with resetting appdir in CRIC (after which pilot cannot use asetup)
- Always try to use locally image first (even if imagename has docker path)
- CRIC update
- Pilot now pulls info from CRIC instead of AGIS
- Full descriptions in PR: #285
Code contributions from D. Benjamin, A. Anisenkov, P. Nilsson
2.7.1.4
- Improved stage-in containerisation
- Previously stage-in script could not be found on Nordugrid queues
- Implemented stage-out containerisation
- Added new error code 1359, “PanDA queue is not active”
- Previously this would lead to an aborted pilot but not reported properly
- When this happens, pilot returns exit code 78 to wrapper (which will receive it as 19968: 19968>>8 or mod 255=78)
- Added conversion function for Harvester to translate a batch system exit code back into a pilot error code
- Added new error code 1360, “Image not found”
- Pilot now verifies that an image with a given path exists
- Raythena updates
- Increased event processing waiting time
- It was noticed (by Miha Muskinja) that the pilot timed out still running processes when it asked for ‘too few’ events
- Skipped creating input file metadata
- Transfer time-out updates
- Added protection for unset trace report stateReason
- When the pilot times out a transfer the trace report will be empty and this could lead to an exception
- Corrected overwritten transfer failure message that can happen after a time-out (unless the time-out is very short)
- Details are discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-526
- Corrected server URL update
- When a server communication fails, pilot should retry the communication but the URL was not updated correctly
- Output from ps command while searching for prmon process can now be dumped to log using catchall PRMON_DEBUG
- Now removing /cores dir from work dir before tarring up the pilot (in COVID jobs this will contain the main executable)
- Now setting up new version of prmon using lsetup rather than hardcoded release as before
- Note: prmon now generates JSON with memory values as floats rather than ints - pilot reports these to the server, but the server currently converts the floats back to ints since the DB can otherwise not handle them
Code contributions from Miha Muskinja and Paul Nilsson.