Releases · PanDAWMS/pilot2

23 Nov 17:35

PalNilsson

2.8.9.4

f76c3e9

2.8.9.4

Fixed broken prmon
- The prmon mechanism was accidentally broken by the introduction of coprocess support last week
Fixed missing input_dir and output_dir in middleware container
- Could have affected some jobs where middleware containerisation was used (if input_dir/output_dir were set as a pilot option)

Assets 2

19 Nov 14:37

PalNilsson

2.8.8.1

5edbfd8

2.8.8.1

The pilot is now skipping the input file size check completely when PQ.maxinputsize=-1 to skip the unwanted file size checks on SToRM sites

Assets 2

18 Nov 19:46

PalNilsson

2.8.7.14

366a1ed

2.8.7.14

Fix for unwanted total input file size calculation in direct access production jobs
- Problem seen on SToRM sites
- Requested by Rodney Walker
Fixed problem with replacing LFNs with TURLs in production jobs using direct access
- Requested by Rodney Walker
Support for executing a coprocess in HPO jobs
Fixed dumping of large failed find commands in looping job algorithm (rare)
- Requested by Fa-Hui Lin
Added new error code 1362, “Xrdcp was unable to open file”
- Error seen in https://bigpanda.cern.ch/job?pandaid=4892946326 (which led to some exception)
Added handling for None return from get events
Corrections to proxy verifications
- Problems with returning error code found by A. Bogdanchikov

Code contributions from D. Benjamin, P. Nilsson

Assets 2

12 Nov 20:38

PalNilsson

2.8.6.7

05bb714

2.8.6.7

Fix for large memory usage seen (at least) at MWT2
- It was discovered that when the pilot requested a logger object using the python logging.getChild(), this can sometimes lead to a huge memory leak
- Pilot does not need to use this so this functionality was removed

Assets 2

10 Nov 13:39

PalNilsson

2.8.5.15

ed3d0be

2.8.5.15

Internal improvements
- Added self monitoring of memory usage
  - It was seen at MWT2 that the pilot process in some cases used up to 5 GB memory, which is currently not understood
  - The output from a ps call is dumped to the log at regular intervals + around essential steps. Also, the size of the main job object is measured when it’s moved to internal queues
  - No problems have been seen in testing, including staging in multiple and large input files - but there could still be an issue with having many input files in long running jobs (it’s just not seen in testing)
- Corrected the ‘url’ field in traces for lib files in user jobs
  - Previously the wrong turl was sent with the trace
- 12 complex functions were refactored
  - 8 functions and classes are currently deemed ‘too complex’ by flake8 (skipped)
  - NB: there are currently 1140 functions + 133 classes in Pilot 2

Assets 2

29 Oct 13:20

PalNilsson

2.8.4.35

0026537

2.8.4.35: Merge pull request #297 from PanDAWMS/next

Native HPO mode
- Pilot is now able to loop over multiple pre-, main payload and post-processes (abort when preprocess is returning special exit code)
Improved calculation for actualcorecount
- Previously, a used grep command could potentially return a non-unique pgrp id
- Requested by D. Cameron
- Discussing with G. Stewart to see if prmon can measure core count usage (needs to be seconded if this is wanted). Pilot currently uses ps for this
Pilot is now sending the mean [actual] core count with the last server update (as ‘meanCoreCount’)
Middleware container may now be specified in pilot config file (‘middleware_name’)
- The default value is the rucio container image taken from the unpacked cvmfs area
- In case the image does not exist, pilot will fall back to CentOS7
Added missing container options from queuedata to middleware container setup
- This will otherwise cause a problem on SToRM sites using a middleware container
File open verification for direct I/O files
- Added new error code 1361, “Remote file could not be opened” in combination with clientState=’FAILED_REMOTE_OPEN’ in the rucio trace (to be extended in next pilot version)
- Requested by R. Walker
Traces in direct access mode are now sent immediately before launching the payload to avoid complications with containers and remote input file open verification
Only using root schema with list_replicas() call for VP jobs
- Also setting rse_expression = 'istape=False\type=SPECIAL' in the list_replicas() query
Python 3 updates for Raythena plugin

Code contributions from D. Benjamin, P. Nilsson

Assets 2

23 Sep 19:06

PalNilsson

2.8.3.3

5999036

2.8.3.3

Reverted the core count change from pilot version 2.8.1
- The earlier pilot version generated an unwanted change in monitoring and created a seemingly large increase in the number of running jobs since (the mean of the) actualcorecount was reported instead of the queue.corecount value
- Requested by DPA

Assets 2

23 Sep 10:54

PalNilsson

2.8.2.3

76f4991

2.8.2.3

Removed overwriting of corecount with actualcorecount since it caused a scaling factor to be calculated wrongly which led to jobs being killed for spending too much memory when in fact they didn’t
- Note: the pilot is still setting corecount=actualcorecount in the server updates
Correction for an issue with missing new settings in locally used pilot config files which led to a problem on HPCs reading CRIC data
- Note: this could be bypassed by updating the local config files with the recently introduced settings
Correction for insufficient error information in the cases of too large stdout and workdir sizes

Code contributions from A. Anisenkov, P. Nilsson.

Assets 2

22 Sep 13:46

PalNilsson

2.8.1.26

9a3a8db

2.8.1.26

Raythena related updates
- Skipped input file transfers
- HTTP time-outs now configurable (via default.cfg file)
Python 3 update
- Corrected queuedata JSON download (byte stream -> string)
Overwriting original core count value with actual core count if it is known (only report actual core count in job metrics)
- Final server update now has the average actual core count that was used during running
- Requested by M. Grigoryeva and T. Maeno
Corrections for multiple input and output files in middleware container mode
Reading GUIDs from metadata XML in case jobReport does not exist
- Useful for ancient releases
- Discussed in [long thread and in] JIRA ticket: https://its.cern.ch/jira/browse/ATLMCPROD-8736
Asetup related changes for running on Summit (simplifications)
- In combination with resetting appdir in CRIC (after which pilot cannot use asetup)
- Always try to use locally image first (even if imagename has docker path)
CRIC update
- Pilot now pulls info from CRIC instead of AGIS
- Full descriptions in PR: #285

Code contributions from D. Benjamin, A. Anisenkov, P. Nilsson

Assets 2

13 Aug 16:56

PalNilsson

2.7.1.4

7f92214

2.7.1.4

Improved stage-in containerisation
- Previously stage-in script could not be found on Nordugrid queues
Implemented stage-out containerisation
Added new error code 1359, “PanDA queue is not active”
- Previously this would lead to an aborted pilot but not reported properly
- When this happens, pilot returns exit code 78 to wrapper (which will receive it as 19968: 19968>>8 or mod 255=78)
- Added conversion function for Harvester to translate a batch system exit code back into a pilot error code
Added new error code 1360, “Image not found”
- Pilot now verifies that an image with a given path exists
Raythena updates
- Increased event processing waiting time
- It was noticed (by Miha Muskinja) that the pilot timed out still running processes when it asked for ‘too few’ events
- Skipped creating input file metadata
Transfer time-out updates
- Added protection for unset trace report stateReason
- When the pilot times out a transfer the trace report will be empty and this could lead to an exception
- Corrected overwritten transfer failure message that can happen after a time-out (unless the time-out is very short)
- Details are discussed in JIRA ticket https://its.cern.ch/jira/browse/ATLASPANDA-526
Corrected server URL update
- When a server communication fails, pilot should retry the communication but the URL was not updated correctly
Output from ps command while searching for prmon process can now be dumped to log using catchall PRMON_DEBUG
Now removing /cores dir from work dir before tarring up the pilot (in COVID jobs this will contain the main executable)
Now setting up new version of prmon using lsetup rather than hardcoded release as before
- Note: prmon now generates JSON with memory values as floats rather than ints - pilot reports these to the server, but the server currently converts the floats back to ints since the DB can otherwise not handle them

Code contributions from Miha Muskinja and Paul Nilsson.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: PanDAWMS/pilot2

2.8.9.4

2.8.8.1

2.8.7.14

2.8.6.7

2.8.5.15

2.8.4.35: Merge pull request #297 from PanDAWMS/next

2.8.3.3

2.8.2.3

2.8.1.26

2.7.1.4