Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement of the stdout/stderr handling in the WMCore logs #12149

Open
anpicci opened this issue Oct 17, 2024 · 0 comments
Open

Improvement of the stdout/stderr handling in the WMCore logs #12149

anpicci opened this issue Oct 17, 2024 · 0 comments

Comments

@anpicci
Copy link
Contributor

anpicci commented Oct 17, 2024

Impact of the new feature
WMAgent logs, potentially Microservices

Is your feature request related to a problem? Please describe.
As reported here, the stderr and stdout are mixed up in the job logs, and this make difficult to understand the reports, with obvious consequences on the capacity to detect and eventually fix bugs.

An example is reported:

2024-10-15 21:02:07,975:INFO:StageOutImpl:Creating output directory...
2024-10-15 21:02:07,982:INFO:StageOutImpl:Running the stage out...
2024-10-15 21:02:07,990:ERROR:StageOutImpl:Attempt 0 to stage out failed.
Error details:
<@========== WMException Start ==========@>
Exception Class: StageOutError
Message: Simulated stage out failure
        ClassName : None
        ModuleName : WMCore.Storage.StageOutError
        MethodName : __init__
        ClassInstance : None
        FileName : /srv/job/WMCore.zip/WMCore/Storage/StageOutError.py
        LineNumber : 32
        ErrorNr : 0
        ErrorCode : 60311
        ErrorType : GeneralStageOutFailure

Traceback:

<@---------- WMException End ----------@>

2024-10-15 21:07:08,091:ERROR:StageOutImpl:Maximum number of retries exhausted. Further details on the failed command reported below.
2024-10-15 21:07:08,100:INFO:StageOutMgr:===> Stage Out Failure for file:
2024-10-15 21:07:08,100:INFO:StageOutMgr:======>  /store/unmerged/logs/prod/2024/10/15/apiccine_TC_LHE_PFN_GFALStageoutTest_v11_FNAL_241015_194912_6819/DarkSUSY_14TeV_2023D17_GenSimHLBeamSpotFull14/0000/0/01280c59-36ef-4642-9465-eb098d075a4e-3-0-logArchive.tar.gz

2024-10-15 21:07:08,100:INFO:StageOutMgr:======>  storageSite: T1_US_FNAL, volume: FNAL_dCache_EOS, protocol: XRootD, command: xrdcp, options: None, phedex-node: T1_US_FNAL_Disk
 using this stage out
2024-10-15 21:07:08,101:INFO:StageOutMgr:LFN to PFN match made:
LFN: /store/unmerged/logs/prod/2024/10/15/apiccine_TC_LHE_PFN_GFALStageoutTest_v11_FNAL_241015_194912_6819/DarkSUSY_14TeV_2023D17_GenSimHLBeamSpotFull14/0000/0/01280c59-36ef-4642-9465-eb098d075a4e-3-0-logArchive.tar.gz
PFN: davs://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/logs/prod/2024/10/15/apiccine_TC_LHE_PFN_GFALStageoutTest_v11_FNAL_241015_194912_6819/DarkSUSY_14TeV_2023D17_GenSimHLBeamSpotFull14/0000/0/01280c59-36ef-4642-9465-eb098d075a4e-3-0-logArchive.tar.gz

2024-10-15 21:07:08,101:INFO:StageOutImpl:Creating output directory...
2024-10-15 21:07:08,102:INFO:StageOutImpl:Running the stage out...
2024-10-15 21:07:08,111:ERROR:StageOutImpl:Attempt 0 to stage out failed.
Error details:
<@========== WMException Start ==========@>
Exception Class: StageOutError
Message: Simulated stage out failure
        ClassName : None
        ModuleName : WMCore.Storage.StageOutError
        MethodName : __init__
        ClassInstance : None
        FileName : /srv/job/WMCore.zip/WMCore/Storage/StageOutError.py
        LineNumber : 32
        ErrorNr : 0
        ErrorCode : 60311
        ErrorType : GeneralStageOutFailure

Traceback:

<@---------- WMException End ----------@>

2024-10-15 21:12:08,211:ERROR:StageOutImpl:Maximum number of retries exhausted. Further details on the failed command reported below.
2024-10-15 21:12:08,421:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: -----------------------------------------------------------
===========================================================
Debugging information on failing gfal-copy command
Current date and time: 2024-10-15 21:12:08
gfal-copy command which failed: env -i X509_USER_PROXY=/srv/myproxy.pem JOBSTARTDIR=/srv bash -c '. /srv/startup_environment.sh; date; gfal-copy -t 2400 -T 2400 -p -v --abort-on-failure -K adler32  file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz davs://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/un\
merged/logs/prod/2024/10/15/apiccine_TC_LHE_PFN_GFALStageoutTest_v11_FNAL_241015_194912_6819/DarkSUSY_14TeV_2023D17_GenSimHLBeamSpotFull14/0000/0/01280c59-36ef-4642-9465-eb098d075a4e-3-0-logArchive.tar.gz'
Hostname:   cmsgli-428988-0-cmswn2423.fnal.gov
OS:  Linux 5.14.0-427.37.1.el9_4.x86_64
GFAL environment variables:
GFAL_PLUGIN_DIR=/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.57/el6-x86_64/usr/lib64/gfal2-plugins/
GFAL_CONFIG_DIR=/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.57/el6-x86_64/etc/gfal2.d/
PYTHON environment variables:
PYTHONPATH=/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/py2-future/0.16.0/lib/python3.6/site-packages:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.57/el6-x86_64/usr/lib/python2.6/site-packages:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.57/el6-x86_64/usr/lib64/python2.6/site-pack\
ages:/srv/job/WMCore.zip:/srv/job
LD_* environment variables:
LD_LIBRARY_PATH=/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/py2-future/0.16.0/lib:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/python3/3.6.4/lib:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/python3/3.6.4/lib:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/libffi/3.2.1-omkpbe2/lib64:/cvmfs/cms.cern.ch/slc6_a\
md64_gcc700/external/sqlite/3.22.0-omkpbe/lib:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/gdbm/1.10-omkpbe2/lib:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/db6/6.0.30-omkpbe2/lib:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/expat/2.1.0-omkpbe2/lib:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/bz2lib/1.0.\
6-omkpbe2/lib:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/openssl/1.0.2d-omkpbe2/lib:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/zlib-x86_64/1.2.11-omkpbe2/lib:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/zlib-x86_64/1.2.11-omkpbe2/lib:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/gcc/7.0.0-omkpbe2/lib64\
:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/gcc/7.0.0-omkpbe2/lib:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.57/el6-x86_64/lib64:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.57/el6-x86_64/lib:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.57/el6-x86_64/usr/lib64:/c\
vmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.57/el6-x86_64/usr/lib:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.57/el6-x86_64/usr/lib64/dcap:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.57/el6-x86_64/usr/lib64/lcgdm:/.singularity.d/libs
gfal-copy location: /cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.57/el6-x86_64/usr/bin/gfal-copy.
Source PFN: file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz
Target PFN: davs://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/unmerged/logs/prod/2024/10/15/apiccine_TC_LHE_PFN_GFALStageoutTest_v11_FNAL_241015_194912_6819/DarkSUSY_14TeV_2023D17_GenSimHLBeamSpotFull14/0000/0/01280c59-36ef-4642-9465-eb098d075a4e-3-0-logArchive.tar.gz
Here are informations about certification methods
Bearer token content:
Bearer token file:
httokendecode path:
httokendecode:
VOMS proxy info:
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues/CN=1132199593/CN=978440072/CN=1090606534/CN=1857206451/CN=1707977247
issuer    : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues/CN=1132199593/CN=978440072/CN=1090606534/CN=1857206451
identity  : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues/CN=1132199593/CN=978440072/CN=1090606534/CN=1857206451
type      : RFC compliant proxy
strength  : 2048 bits
path      : /srv/myproxy.pem
timeleft  : 22:52:19
key usage : Digital Signature, Key Encipherment
=== VO cms extension information ===
VO        : cms
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues
issuer    : /DC=ch/DC=cern/OU=computers/CN=cms-auth.web.cern.ch
attribute : /cms/Role=production/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
attribute : /cms/country/Role=NULL/Capability=NULL
attribute : /cms/country/us/Role=NULL/Capability=NULL
attribute : /cms/uscms/Role=NULL/Capability=NULL
timeleft  : 159:42:55
uri       : voms-cms-auth.app.cern.ch:15000
===========================================================
-----------------------------------------------------------

stderr: which: no httokendecode in (/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/py2-future/0.16.0/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/python3/3.6.4/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/python3/3.6.4/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/sqlite/3.22.0-omkpbe/bin:/cvmfs/\
cms.cern.ch/slc6_amd64_gcc700/external/gdbm/1.10-omkpbe2/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/db6/6.0.30-omkpbe2/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/expat/2.1.0-omkpbe2/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/bz2lib/1.0.6-omkpbe2/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/ex\
ternal/openssl/1.0.2d-omkpbe2/bin:/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/gcc/7.0.0-omkpbe2/bin:/cvmfs/cms.cern.ch/common:/srv/.gwms.d/bin:/storage/local/data1/condor/execute/dir_269578/glide_kXmu43/main/condor/libexec:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.57/el6-x86_64/usr/bin:/cvm\
fs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.57/el6-x86_64/usr/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)

2024-10-15 21:12:08,436:INFO:StageOutMgr:===> Stage Out Failure for file:
2024-10-15 21:12:08,436:INFO:StageOutMgr:======>  /store/unmerged/logs/prod/2024/10/15/apiccine_TC_LHE_PFN_GFALStageoutTest_v11_FNAL_241015_194912_6819/DarkSUSY_14TeV_2023D17_GenSimHLBeamSpotFull14/0000/0/01280c59-36ef-4642-9465-eb098d075a4e-3-0-logArchive.tar.gz

2024-10-15 21:12:08,437:INFO:StageOutMgr:======>  storageSite: T1_US_FNAL, volume: FNAL_dCache_EOS, protocol: WebDAV, command: gfal2, options: None, phedex-node: T1_US_FNAL_Disk
 using this stage out
2024-10-15 21:12:08,440:ERROR:ExecuteMaster:Exception occured when executing step
2024-10-15 21:12:08,440:ERROR:ExecuteMaster:Exception is <@========== WMException Start ==========@>
Exception Class: StageOutFailure
Message: Failure for stage out:
<@========== WMException Start ==========@>
Exception Class: StageOutError
Message: Simulated stage out failure
        ClassName : None
        ModuleName : WMCore.Storage.StageOutError
        MethodName : __init__
        ClassInstance : None
        FileName : /srv/job/WMCore.zip/WMCore/Storage/StageOutError.py
        LineNumber : 32
        ErrorNr : 0
        ErrorCode : 60311
        ErrorType : GeneralStageOutFailure

Traceback:

<@---------- WMException End ----------@>Traceback (most recent call last):

Describe the solution you'd like
Avoid separation bewteen stderr and stdout, for example in this way:

process = subprocess.Popen(
    ["ls", "-l", "/nonexistent_directory"], 
    stdout=subprocess.PIPE, 
    stderr=subprocess.STDOUT
)

output, _ = process.communicate()

Describe alternatives you've considered
None for now

FYI @amaltaro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ToDo
Development

No branches or pull requests

1 participant