Skip to content
This repository was archived by the owner on Jan 30, 2024. It is now read-only.

2.12.1.62 #353

Merged
merged 106 commits into from
Jun 30, 2021
Merged

2.12.1.62 #353

merged 106 commits into from
Jun 30, 2021

Conversation

PalNilsson
Copy link
Collaborator

@PalNilsson PalNilsson commented Jun 30, 2021

  • Advanced debug mode
    • Added support for explicit debug command=”some_command some_option” instead of static ‘debug’ which currently tells the pilot to tail the latest found non-binary file
    • If pilot now receives the command (e.g.) ‘tail some_log_file’ it will locate that file and tail it
    • Other supported commands are ls, ps, du, and gdb (plus options)
    • For gdb, pilot identifies pid of true athena child process (unless specified, identified in earlier ps debug command), produces and copies core file to work dir for storing in log (exception is made for present core file in work dir which normally would be removed before log is created), and kills the job with a descriptive error message
    • Requested by R. Walker
  • A core dump of the payload process is now produced and copied to the log if the payload is looping
    • Requested by R. Walker
  • The current event number (at the time of a job update) is now reported with job metrics
    • Requested by R. Walker
  • Number of ReSim events are now reported with jobMetrics also when resimevents=0
    • Requested by R. Walker
  • Xcache service updates
    • Updated xcache kill, now using env variable for pid
    • Xcache start now uses option ‘-b 4’ (only cleanup orphans older than four days)
    • Xcache environmental variables are expanded in metadata xml
    • Requested by A. de Silva
  • Orphan handling
    • Previously, Pilot explicitly killed identified orphan processes at the end of the job, lead to problems at RAL due to their special setup
    • Pilot is now using ctypes library functions to make sure all subprocesses are parented, i.e. any ‘orphans’ at the end of the job will be removed when the child subprocesses are killed
    • A new error code was added related to this, to make sure ctypes is available everywhere (as tested with RC jobs)
      • 1365: "Python module ctypes not available on worker node"
      • An RC test revealed that not all sites have the ctypes module installed (at least CSCS-LCG2)
      • Note: the pilot continues to use the orphan killer algorithm, but if the ctypes library was successfully used, all processes that would otherwise become orphans, will be parented. Sites that do not want the orphan killer to execute, should define the environmental variable PILOT_NOKILL
  • Summing up input file sizes
    • Skipped when mv copy tool is used to avoid a subsequent problem with direct access
    • Requested by R. Walker
  • Updated gs copy tool
    • Note: this copy tool is currently tailored for Rubin, but will be refactored at a later time for general use
  • Fixed a problem with wrong localSite information in traces
    • Previously, when middleware containers were used, the localSite was extracted from the RUCIO_LOCAL_SITE_ID env var which is unknown inside the container which led to the ddmendpoint value being used instead. The pilot now defaults to the value stored in the preliminary trace report
    • Requested by I. Vukotic
  • Code cleaning towards pylint compliance in preparation for a coming corresponding GitHub Action
    • We are currently focusing on cleaning up errors detected by pylint
  • Python 3 update
    • Pilot can now handle cases where the payload produces illegal (non-printable) characters in the stdout
    • Problem discovered in a user job
  • Pilot now looks for singularity errors in the payload stderr even if the exit code is zero
  • HPO updates
    • Additional exit code handling to allow pre-process to abort after user defined iteration limit
    • Requested by T. Maeno
  • Raythena plug-in update
    • Output files moved to an external directory (defined by pilot option --output-dir ) as soon as they are reported by AthenaMP
    • To be tested
  • Improved error reporting after remote file verification failure
    • Added diagnostics to error message
    • Requested by P. Vokac
  • Added prmon to list of unwanted files in looping job killer
  • Parallel remote file open
    • The remote file open script may now attempt to open turls in parallel
    • Activated by specifying nopenfiles=N in PQ.catchall (for testing)
    • The default number of file open threads is 1, i.e. same as before
    • Requested by I. Vukotic
  • Some cleanup of the main pilot log, less clutter (incl. no dumping of stage-in/out logs if container is not used)

Code contributions from O. Freyermuth, A. Anisenkov, S. Ye, B. Simmons, P. Vokac, P. Nilsson

PalNilsson and others added 30 commits May 12, 2021 12:25
…ol name (this is probably irrelevant since the proper value is set later when it is known)
…ssage log. Added preliminary functions for advanced debug mode.
…ts of debugging info for xcache [to be removed again]
PalNilsson and others added 29 commits June 17, 2021 15:03
Improve code in jobreport parsing function
…o from log messages. No longer dumping stage-in/out in main log. Cleanup and pylint corrections
self.trace_report was being used before being bound to the instance.
logging.getLogger was being called with multiple args.
Fix initialisation bug and logging bug
Add nojekyll file to build docs workflow
@PalNilsson PalNilsson merged commit 93e123d into master Jun 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants