Releases: PanDAWMS/pilot2
2.3.3
- Internal changes to enable Python 3 support (in progress)
- 100% of the pilot code is Python 3 compliant [BUILD 7] as measured by the 2to3 command
- Testing is expected to reveal other issues since 2to3 does not find every incompatibility
- Three additional changes were resolved during interactive testing. A blocker was discovered on Nov 22 - there is a problem with the cvmfs Python 3 version (enum module is broken) that causes rucio API to fail at import (Asoka is working on a fix)
- Reduction of “Payload metadata does not exist”-errors
- Added recognition of (trf) “command not found” due to InstallArea error, using error code 1346 (“Transform not found”). Previously this led to secondary error 1187 (“Payload metadata does not exist”)
- Note: less important fix since trf error “127: Transformation not installed in CE” is also reported in this case
- Identification of “cannot create directory” due to permission errors, using error code 1199 ("Failed to create local directory") instead of 1187 error
- Note: these errors are also labelled with trf error 64: General failure in transform substep executor
- Added recognition of (trf) “command not found” due to InstallArea error, using error code 1346 (“Transform not found”). Previously this led to secondary error 1187 (“Payload metadata does not exist”)
- Added chi2 from memory leak calculation to job metrics
- The chi2 calculation can be improved further after study with different job types; the current calculation excludes the first five and last two measurement
- Updates for unified queues, pending tests on unified test queue
- Pilot uploads output to destination decided by server (unified queues only)
- Problematic time-out mechanism fixed
- Pilot is now once again protecting rucio stage-in/out
- Support for new HPC workflow; standard/normal jobs on restricted HPCs
- Tested on Cori
- Fix for prmon restarts during stage-out
- Reported by R. Walker
Contributions from A. Anisenkov, D. Benjamin, P. Nilsson
2.3.2 (1)
-
Corrected a problem triggered by empty acopytools list (due to a python 3 related change). Failure example: https://bigpanda.cern.ch/job?pandaid=4546607417. Reported by David Cameron. In the meantime, Rod Walker "corrected" the missing AGIS info for at least one queue (Toronto)
-
Removed annoying error message about missing pilot.util.mpi logging that ended up in stderr. Requested by Ivan Glushkov
2.3.1 (30)
- Internal changes to enable Python 3 support (in progress)
- Reset of a timing counter that led to increasing setup time measurements in multi jobs.
- Reported by Claire A. Bourdarios (not released as planned in v 2.2.2)
- Prevented a case where a failed transfer led to failure to send final server update. Reported by R. Walker
- Now cutting the tail of long stage-in/out error messages rather than the beginning, which cut the LFN+DDM endpoint info in some cases. Removed useless "Copy operation failed .." sub string from stage-in/out messages
- Switched off containers when neither platform nor alrbuserplatform are set. Requested by T. Maeno
- Now only including logExtracts for failed/holding jobs, which also cleans up the log somewhat (the logExtracts are just the tail of the pilot log and is never interesting for finished jobs)
- Using Rucio traces info to improve error messages from Rucio (requires version 1.20.8). Note: this change means we break compatibility with SLC6 since only deprecated rucio versions run there (a few queues in test mode remain on SLC6)
- Resolving the real source location when using mv
- Data API and top workflow updates
- Direct access workflow upgraded (support allow_lan & allow_wan, direct_access_lan & direct_access_wan)
- Removed direct access handling from copytools (gfal, lsm, rucio, xrdcp), the logic is applied at the top level
- Stage-in: ignore already processed files on stage-in retry with failover copy tool
- (Additional details below)
Contributions from A. Anisenkov, D. Cameron, P. Nilsson. Thanks to Ilija Vukotic for help with testing.
Affected (direct) access settings:
Job settings requested by server: transfertype=direct (per job) for remoteio, storage_token=local * (per specific file) for copy2scratch
- Job input options: --accessmode=direct for remoteio, --accessmode=copy or --useLocalIO for copy2scratch
- PandaQueue settings: allow_lan + direct_acccess_lan for remoteio over LAN, allow_wan + direct_access_wan for remoteio over WAN
Logic
- By default access method is copyt2scrach.
- Each input file with transfertype = 'direct' and storage_token!=local as requested by PanDA could potentially use remotio (direct access)
- Job input options (--accessmode, --useLocalIO) overwrite PanDA server decision
- PandaQueue configuration takes finally controls either LAN/WAN replica can be used and direct access method could be applied for given file, pilot takes PanDA server decision and Job input options and apply for them PandaQueue settings:
- for LAN: requires: allow_lan=True + direct_access_lan=True + and availability of appropriate replica for remote io
- for WAN: requires: allow_wan=True + direct_access_wan=True + and availability of appropriate replica for remote io
2.2.2 (1)
- Hotfix for a rucio copy tool bug; added missing local file check after download that was labelled as successful by the rucio API. Previously, the failed transfer was discovered by the pilot but an error code was not set for the case that the supposedly downloaded file did not exist. An error code is now set (stage-in failed). Affected jobs failed at a later stage, obviously, as the input was not available. Reported by R. Walker.
2.2.1 (6)
- Reset of a timing counter that led to increasing setup time measurements in multi jobs. Reported by Claire A. Bourdarios
- Now waiting longer time for final server update to be done. Previously, hanging transfers could lead to final server update not being sent at all before pilot ends since a break function did not wait long enough. Discovered by R. Walker
- (Temporarily) removed usages of timeout function in lsm, xrdcp copy tools pending a fix for occasionally casting a SIGTERM
- Rucio copy tool now adds DDM endpoint to error message after a failure. Requested by Stephane Jezequel
- It is not longer required to launch pilot with option -s site_name (-r resource_name can also be skipped since the previous pilot version). Pilot now only needs to be launched with -q queue_name
- Updated the handling of booleans in the pilot argparser. Now possible to use --allow-other-country and --allow-same-user with boolean values. Reported and fixed by F. Barreiro
- New error codes: 1353: "CPU consumption calculation failed: No such process", 1354: "General CPU consumption calculation problem (consult Pilot log)", 1355: "Core dump detected” (see https://twiki.cern.ch/twiki/bin/view/PanDA/Pilot2ErrorCodes for details). These error codes were all introduced to reduce the number of general ('unknown') PilotException (1301) errors that otherwise would have been set. Requested by P. Svirin
- Removed debug messages for ps output (prmon identification) which is often quite extensive
Contributions from F. Barreiro, P. Nilsson
2.2.0 (25)
- Added MANIFEST file - Pilot 2 is now registered with pypi which means it can be pip-installed without referencing github
- pip install panda-pilot
- Data component upgrade
- Refactored and unified ES StagingClients
- Automatically prefer LAN protocol (read_lan/write_lan) for stage-in/stage-out file if source/destination RSE is local for given PQ (defined in inputddms=astorages['read_lan'])
- Base movers workflow upgraded
- Introduced require_input_protocols mode to look up and manually form input replicas for specific copytool (activated for the objectstore mover, ES workflow)
- Refactored and simplified the objectstore copytool
- Implemented fail-over transfer for ES stage-out
- Preparing for containerized middleware commands [minor update]
- Added debug messages for potential problem with relying on SC_CLK_TCK
- Local problem seen on MPPMU (https://bigpanda.cern.ch/job/4482660563/). Pilot cannot calculate CPU consumption without this value
- Added LFN in diagnostics message for checksum errors. Corrected mislabelled checksum types (MD5SUM reported instead of ADLER32). Requested by R. Walker
- Cleaned up stage-in/out error messages containing irrelevant Traceback info (should now be concise)
- Following an update in the auto-setup script, the pilot is now using RUCIO_LOCAL_SITE_ID instead of the deprecated DQ2_LOCAL_SITE_ID for localsite in Rucio traces
- Simplification of pilot arguments: now using resource name from queuedata instead of relying on pilot option -r (which can now be removed from wrapper)
- Instead of a traceback, now reporting the real error returned from rucio download or upload. However, the current version of rucio does not propagate errors well so the message will always be "None of the requested files have been downloaded". D. Cameron is working on fixing this so a future version of Rucio will report the real error
- Changed minimum allowed local space from 5 GB to 2 GB (as verified during payload running); the higher limit affected event index jobs run at OU. Requested by H. Severini
- Pilot is now always setting ATHENA_CORE_NUMBER (previously only set for event service jobs)
- Updated memory leak calculation to be consistent with new prmon field names (changed PSS+Swap to pss+swap)
- Added new error code 1352, “Failed to stat proc file for CPU consumption calculation” which is set when the pilot cannot access /prod/pid/stat. Requested by P. Svirin
- Corrected the local/remoteSite sent with the traces - previously if the pilot overwrote the requested ddmendpoint (ie if the requested ddmendpoint was not allowed), then the trace was not updated as well. Now it is.
Code contributions from D. Cameron, A. Anisenkov, W. Guan, F. Barreiro, P. Nilsson
2.1.25 (11)
- Added spacetoken from DDM conf to TURL used with mv output file list (ND)
- Improved error reporting (added diagnostics) for errors extracted from payload.stderr
- New error code 1351 (Unrecognized fatal error in transform stderr)
- Protection against failure to parse metadata in ancient release
- Problem seen in job using release 17, https://bigpanda.cern.ch/job?pandaid=4459932063
- Now ignoring nentries=0 in job report (previously failed with ‘Empty output file’)
- Added MANIFEST file - Pilot 2 is now registered with pypi which means it can be pip-installed without referencing github
- pip install panda-pilot
- Now creating pandaIDs.out file for wrapper. Requested by P. Love
- Removed useless 10s sleep before getting a new job
Code contributions from D. Cameron, F. Barreiro, P. Nilsson
2.1.24 (1)
- A secondary patch is needed for job report interpretation; the pilot should now handle the cases below. The case (7) when nentries = null / None was not handled correctly and led to some failures of jobs where the output file verification should have been "successful", ie the case should be ignored
- No output files in the job definition (abort output file verification)
- Output info is an empty list (consult with allowNoOutput list; fail job is output file is not present in that list, otherwise remove file from stage-out and approve verification)
- Output info is entirely missing (ignore output file verification)
- Output nentries are missing but output info exists (ignore validation)
- nentries is an int (good)
- nentries is 0 (consult with allowNoOutput list; fail job is output file is not present in that list, otherwise remove file from stage-out and approve verification)
- nentries is null or None (ignore validation if file is not in allowNoOutput; otherwise remove file from stage-out and approve verification)
- nentries is a string; only known value: “UNDEFINED” (ignored)
- any other cases will be ignored (ie output file verification will be approved)
- Pilot is now recognising the following stderr message patterns for all combinations of ‘ERROR:’, ’Error:’, ‘error:’, ‘WARNING:’, ‘Warning:’, ‘warning:’ plus optional extra space before the ‘:’. The previous version only recognised some of these which meant that the primary failure was missed in some cases and the secondary error was instead reported (e.g. HC tests using bad job options)
2.1.23 (2)
Patch for unexpected values in the job report; nentries should normally be an int ('None' is also accepted) but is sometimes also a string (e.g. "UNDEFINED"). Problem seen with release 20.7.7 in which jobs with nentries=UNDEFINED lead to an internal pilot problem. Pilot will now allow these jobs to finish.
2.1.22 (8)
- Silenced sourcing of atlasLocalSetup.sh (in proxy setup). Requested by F. berghaus
- Preventing None values to end up in the list of zombie processes to kill after running a job, which would cause the pilot to end rather than asking for a new job. Reported by D. Cameron
- Added new error codes for errors that previously were reported as “Payload metadata does not exist” (which was a secondary error)
- 1345 (Singularity: Failed to create user namespace)
- 1346 (Transform not found)
- 1347 (Unsupported SL5 OS)
- 1348 (Singularity: Resource temporarily unavailable)
- 1349 (Unrecognized transform arguments)
- 1350 (Empty output file detected)
- Now inserting DOCTYPE tag in PFC xml - this was revealed to be the missing piece for successfully running release 20.7.7 NTUP_PILEUP
- Trace report now contains pq=site name. Requested by T. Beermann
- Verification of output files and nentries listed in job report, secondary patch for output=[] case
- Pilot can now be pip installed
- pip install --upgrade git+git://github.com/PanDAWMS/pilot2.git@setup.py
Code contributions from F. Barreiro Megino, P. Nilsson