Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RequestExecutingAgent and StorageElementCache problems #3727

Closed
andresailer opened this issue Jun 20, 2018 · 10 comments
Closed

RequestExecutingAgent and StorageElementCache problems #3727

andresailer opened this issue Jun 20, 2018 · 10 comments
Assignees
Labels

Comments

@andresailer
Copy link
Contributor

When the RequestExecutingAgent (REA) is executing requests for non-shifters, it gets proxies for the user and then deletes them after the execution of the operation. This causes problems when StorageElements are involved, because when they are created and a gfal2 context is created this context picks up the location of the proxy ($X509_USER_PROXY) and keeps that for its lifetime, which is equal to the lifetime of the StorageElement. However, this proxy file is deleted after the execution of the request that caused the creation of the StorageElement.

I noticed this behaviour in the development of the MultiVO transformations, because I was then only running requests as a "user" proxy. The resulting error messages from GFAL2 are somewhat cryptic

GError: srm-ifce err: Communication error on send, err: [SE][Ls][] httpg://srm-public.cern.ch:8443/srm/managerv2: CGSI-gSOAP running on pclcd50.dyndns.cern.ch reports No such file or directory

(No such file or directory refers to the proxy, not to the file on the storage, took me way too long to understand what was happening...)

When running with predominantly shifterproxies for the execution of the Requests the error messages don't show up, because the user requests will simply use the shifter proxy associated to the StorageElement already in the cache. The problem is still there. that the wrong proxy is used for the request when calling storage elements.

The gfal2 context storing the proxy location can be reproduced in just python

import os, gfal2
ctx = gfal2.creat_context() # this picks up the proxy from $X509_USER_PROXY, so export before starting python
ctx.stat("srm://srm-public.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/grid/ilc/prod/clic/500gev/Z_uds/CLIC_o3_v14/SIM/00010494/000/Z_uds_sim_10494_304.slcio")
os.environ['X509_USER_PROXY' ] = '/no/such/file'  # this is basically what the REA is doing except it points to an existing proxy, it just won't be used
ctx.stat("srm://srm-public.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/grid/ilc/prod/clic/500gev/Z_uds/CLIC_o3_v14/SIM/00010494/000/Z_uds_sim_10494_304.slcio") # still works because we use the original proxy

ctx = gfal2.creat_context() # picks up the location now in $509_USER_PROXY
ctx.stat("...")  # fails with file not found
@chaen
Copy link
Contributor

chaen commented Jun 20, 2018

Just to comment that, as agreed with Andre, I will provide a fix for that

@chaen
Copy link
Contributor

chaen commented Jun 20, 2018

Could you try that patch ?

rel-v6r20...chaen:rel-v6r20_FIX_SECache

@andresailer
Copy link
Contributor Author

Yes, that works.

@chaen
Copy link
Contributor

chaen commented Jun 20, 2018

how about the memory consumption ?

@andresailer
Copy link
Contributor Author

16 GB RAM server (2.7% per instance)

     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 528253 dirac     39  19 1162m 422m  12m S  0.0  2.7   4:40.57 python /opt/dirac/pro/DIRAC/Core/scripts/dirac-agent.py RequestManagement/RequestExecutingAgent /opt/dirac/pro/etc/RequestManagement_RequestExecutingAgent.cfg
 528331 dirac     39  19 1162m 422m  12m S  0.0  2.7   4:38.30 python /opt/dirac/pro/DIRAC/Core/scripts/dirac-agent.py RequestManagement/RequestExecutingAgent /opt/dirac/pro/etc/RequestManagement_RequestExecutingAgent.cfg
 528424 dirac     39  19 1172m 417m  12m S  0.0  2.7   4:38.04 python /opt/dirac/pro/DIRAC/Core/scripts/dirac-agent.py RequestManagement/RequestExecutingAgent /opt/dirac/pro/etc/RequestManagement_RequestExecutingAgent.cfg
 528441 dirac     39  19 1162m 416m  12m S  0.0  2.7   4:36.53 python /opt/dirac/pro/DIRAC/Core/scripts/dirac-agent.py RequestManagement/RequestExecutingAgent /opt/dirac/pro/etc/RequestManagement_RequestExecutingAgent.cfg
 528358 dirac     39  19 1172m 415m  12m S  0.0  2.7   4:36.75 python /opt/dirac/pro/DIRAC/Core/scripts/dirac-agent.py RequestManagement/RequestExecutingAgent /opt/dirac/pro/etc/RequestManagement_RequestExecutingAgent.cfg

Each thread stored about 420 StorageElements in the cache

/opt/dirac/pro/runit/RequestManagement/RequestExecutingAgent/log/current:adding (139764841404160, 'CERN-SRM', None, 'ilc', '/tmp/tmpMRoMvB') number 414
/opt/dirac/pro/runit/RequestManagement/RequestExecutingAgent/log/current:adding (139764851943168, 'CERN-SRM', None, 'ilc', '/tmp/tmpY6XKUh') number 415
/opt/dirac/pro/runit/RequestManagement/RequestExecutingAgent/log/current:adding (139764931901184, 'CERN-SRM', None, 'ilc', '/tmp/tmpQBkpiS') number 421
/opt/dirac/pro/runit/RequestManagement/RequestExecutingAgent/log/current:adding (139764841666304, 'CERN-SRM', None, 'ilc', '/tmp/tmpyQBxXo') number 417
/opt/dirac/pro/runit/RequestManagement/RequestExecutingAgent/log/current:adding (139764864792320, 'CERN-SRM', None, 'ilc', '/tmp/tmpzW2WeK') number 420

Assuming all memory is taken by the cached SEs, which is not the case of course, we come to 1 MB per StorageElement item.

@chaen
Copy link
Contributor

chaen commented Jun 20, 2018

I see. Out of luck, did you do the same measurement before ?

@andresailer
Copy link
Contributor Author

On the production machine the instances take 0.5% of 16GB, but they are not processing Requests at the moment. Assuming the SECache is empty this takes things down to 0.86 MB per element.

How many user requests do you treat per 30 minutes (the cache lifetime)?

@chaen
Copy link
Contributor

chaen commented Jun 20, 2018

several tens of thousands in case of removal or forward diset operations :-/

@andresailer
Copy link
Contributor Author

With the updates in @chaen 's branch (using the proxyManagers proxyFile cache)

RequestExecutingAgent/log/current:re-using SE (140649021716224, 'CERN-SRM', None, 'ilc', '/tmp/tmpi0SB4M') number 2
RequestExecutingAgent/log/current:re-using SE (140649021716224, 'CERN-SRM', None, 'ilc', '/tmp/tmpi0SB4M') number 2

The memory stays at the 0.5% percent.

@andresailer
Copy link
Contributor Author

Fixed in #3733

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants