Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile RucioConMon memory #12089

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

amaltaro
Copy link
Contributor

Fixes #<GH_Issue_Number>

Status

<In development | not-tested | on hold | ready>

Description

Is it backward compatible (if not, which system it affects?)

<YES | NO | MAYBE>

Related PRs

<If it's a follow up work; or porting a fix from a different branch, please mention them here.>

External dependencies / deployment changes

<Does it require deployment changes? Does it rely on third-party libraries?>

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
  • Python3 Pylint check: failed
    • 7 warnings and errors that must be fixed
    • 1 warnings
    • 23 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 7 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15179/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 7 tests deleted
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 21 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15180/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests deleted
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 6 warnings and errors that must be fixed
    • 26 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 11 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15197/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests deleted
  • Python3 Pylint check: failed
    • 7 warnings and errors that must be fixed
    • 21 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 10 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15198/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Sep 6, 2024

I finally managed to do memory measurements with the code currently provided in this PR. It compares the current RucioConMon implementation in the WMCore stack in 2 scenarios:

  1. fetching compressed data from RucioConMon (format=raw, generating line by line)
  2. fetching uncompressed data from RucioConMon (format=json, loading the whole data in memory)

I ran these tests in vocms0259, such that I could measure memory usage in the node with the grafana host monitor. See screenshot below:
memory_rucio_conmon

Some observations are:

  • even though the memory application barely change in format=raw (compressed), we can see that about 2.5GB of memory gets allocated in the "Cache" category. My guess is that it's related to the file written to the file system (when retrieving data through RucioConMon -> Services -> pycurl_manager)
  • the same "Cache" 2.5GB allocation happens in the format=json as well. But this time, the application memory footprint also jumps by almost 4GB.

As I was running memory_profiler as well, here is a breakdown of the format=raw vs ``format=json`, where we can see that the application memory barely changes in the raw/compressed implementation; but has GBs of memory footprint in the json one (no generator).

format=raw (also faster!)

(WMAgent-2.3.4.3) [xxx@xxx:install]$ python testRucioConMonMem.py 
2024-09-06 19:49:28,773:INFO:testRucioConMonMem: Fetching unmerged dump for RSE: T1_US_FNAL_Disk with compressed data: True
2024-09-06 19:49:28,788:INFO:testRucioConMonMem: Fetching data from Rucio ConMon for RSE: T1_US_FNAL_Disk.
2024-09-06 19:49:28,802:INFO:RucioConMon: Size of rseUnmerged object: 11888

2024-09-06 20:48:44,553:INFO:testRucioConMonMem: Total files received: 10877227, unique dirs: 12885
Filename: testRucioConMonMem.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    27     36.0 MiB     36.0 MiB           1   @profile
    28                                         def getUnmergedFiles(rucioConMon, logger, compressed=False):
    29     36.0 MiB      0.0 MiB           1       dirs = set()
    30     36.0 MiB      0.0 MiB           1       counter = 0
    31     36.0 MiB      0.0 MiB           1       logger.info("Fetching data from Rucio ConMon for RSE: %s.", RSE_NAME)
    32     68.6 MiB     30.6 MiB    10877228       for lfn in rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed):
    33     68.6 MiB      2.0 MiB    10877227           dirPath = _cutPath(lfn)
    34     68.6 MiB      0.0 MiB    10877227           dirs.add(dirPath)
    35                                                 #logger.info(f"Size of dirs object: {asizeof.asizeof(dirs)}")
    36     68.6 MiB      0.0 MiB    10877227           counter += 1
    37     68.6 MiB      0.0 MiB           1       logger.info(f"Total files received: {counter}, unique dirs: {len(dirs)}")
    38     68.6 MiB      0.0 MiB           1       return dirs

2024-09-06 20:48:44,555:INFO:testRucioConMonMem: Done!

format=json:

(WMAgent-2.3.4.3) [xxx@xxx:install]$ python testRucioConMonMem.py 
2024-09-06 21:12:16,810:INFO:testRucioConMonMem: Fetching unmerged dump for RSE: T1_US_FNAL_Disk with compressed data: False
2024-09-06 21:12:16,825:INFO:testRucioConMonMem: Fetching data from Rucio ConMon for RSE: T1_US_FNAL_Disk.
2024-09-06 21:20:38,011:INFO:RucioConMon: Size of rseUnmerged object: 2812956952
2024-09-06 22:18:50,841:INFO:testRucioConMonMem: Total files received: 10877227, unique dirs: 12885
Filename: testRucioConMonMem.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    27     35.8 MiB     35.8 MiB           1   @profile
    28                                         def getUnmergedFiles(rucioConMon, logger, compressed=False):
    29     35.8 MiB      0.0 MiB           1       dirs = set()
    30     35.8 MiB      0.0 MiB           1       counter = 0
    31     35.8 MiB      0.0 MiB           1       logger.info("Fetching data from Rucio ConMon for RSE: %s.", RSE_NAME)
    32   2946.0 MiB     24.6 MiB    10877228       for lfn in rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed):
    33   2946.0 MiB      3.0 MiB    10877227           dirPath = _cutPath(lfn)
    34   2946.0 MiB      0.2 MiB    10877227           dirs.add(dirPath)
    35                                                 #logger.info(f"Size of dirs object: {asizeof.asizeof(dirs)}")
    36   2946.0 MiB      0.0 MiB    10877227           counter += 1
    37     63.6 MiB  -2882.4 MiB           1       logger.info(f"Total files received: {counter}, unique dirs: {len(dirs)}")
    38     63.6 MiB      0.0 MiB           1       return dirs

2024-09-06 22:18:50,842:INFO:testRucioConMonMem: Done!

I will make sure these changes are reflected in #12059 and proceed with this development over there.

@vkuznet
Copy link
Contributor

vkuznet commented Sep 7, 2024

Alan, it seems to me that actual issue in accumulation of results in this for loop: for lfn in rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed):. How about converting the code to generator and let client process it. You may measure the size of returned dirs and it may likely to be constant size you observe in grafana which is unavoidable. The JSON adds overhead to load the JSON data into the RAM.

@amaltaro
Copy link
Contributor Author

amaltaro commented Sep 8, 2024

The function rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed) returns a generator to the client, and in this example the client is testRucioConMonMem.py (I am making a similar test to what MSUnmerged does here).
With that said, this line:

for lfn in rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed):

is in the correct place, as here will be the place to parse each lfn and decide what to do with them (on the client side).
Please let me know if I misunderstood your comment though.

@vkuznet
Copy link
Contributor

vkuznet commented Sep 8, 2024

Alan, the issue is in RucioConMon and Service modules. Here is my insight into its behavior:

My proposal is to stream data from makeRequest which should be converted to a generator, to preserve backward compatibility you need a new API for that. This way the code should not read data from remote server but rather reads it line by line and yields it back to a client. Also, to avoid converting data from list to a set (unique set of LFNs) you have two options: either make unique constrain on BE and avoid it on a client, or if you use cache file let the external process to handle uniqueness in a file, e.g. cat lfns.txt | sort -n | unique > new file.txt and switch your cache to a new file.

From my observation the current implementation of fetching data has unavoidable memory footprint due to loading data into python after HTTP call, and the larger HTTP response the larger memory footprint the code will deal with.

@amaltaro
Copy link
Contributor Author

amaltaro commented Sep 9, 2024

Valentin, you seem to have captured well the flow of an HTTP call through the Service parent class.

To add on what you described above, I think the actual data is loaded into memory at the most base class (pycurl_manager), at these lines:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/pycurl_manager.py#L342-L343

which can then be automatically decompressed as well, if content is in gzip format. In that scenario, the cache file in the filesystem will not be binary as well, but it will be in the content-type from the response object (json, text, etc).

To really minimize the memory footprint, we would have to stream data from server to client, fetching 1 row each time. I don't know the exact details, but I guess the server would have to support a new-line (or similar) data streaming, the connection between client and server would have to remain opened until the client exhausts all the data in the response object.

This data-streaming is somehow a conflicting idea with the custom data caching we have implemented in the Services python package. So it needs to be carefully thought and implemented.

@vkuznet
Copy link
Contributor

vkuznet commented Sep 9, 2024

What you are looking for is NDJSON data-format which server must implement, basically it is list of JSON records separated by '\n'. Doing this way client can request such data (it can be in zipped format as well) and read one record at a time, similar to how CSV data-format is processed. And, the total amount of memory required for entire set of records will be reduced to a size of a single record.

@cmsdmwmbot
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants