<LMON BE API> (ERROR): read_lmonp return a neg value #46

antonl321 · 2022-07-08T13:48:38Z

Hi,

I did some tests with the latest stat-cl.

It works file on runs with 45 nodes x 16 MPI
but it fails with the error from bellow on a run with 880 nodes x 8 MPI

[atosla@ac1-1001 ~]$ stat-cl  `pgrep -o srun` 2>~/stat-cl.err
STAT started at 2022-07-08-13:44:15
Attaching to job launcher (null):690121 and launching tool daemons...
<Jul 08 13:44:15> <Launchmon> (INFO): Just continued the RM process out of the first trap
<Jul 08 13:44:18> <LMON BE API> (ERROR): read_lmonp return a neg value

In stat-cl.err I see the following

graph is not connected, found 0 potential roots
<Jul 08 13:44:18> <STAT_FrontEnd.C: 642> STAT returned error type STAT_MRNET_ERROR: MRNet reported a Network error with message: MRNet: Topology error: file has incorrect format
<Jul 08 13:44:18> <STAT.C: 181> STAT returned error type STAT_MRNET_ERROR: Failed to launch MRNet tree()
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-2041.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-2041.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-1087.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-1087.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac1-4064.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac1-4064.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-2012.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-2012.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac1-4008.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac1-4008.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-2013.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-2013.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-3013.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-3013.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 08 13:44:18> <STAT_lmonBackEnd.C:146> ac3-2033.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 08 13:44:18> <STATD.C:230> ac3-2033.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
...

The text was updated successfully, but these errors were encountered:

lee218llnl · 2022-07-08T16:00:53Z

Can you attach the .top file in your stat_results directory?

antonl321 · 2022-07-09T07:27:22Z

I got the following, obviously an incomplete list of hosts

$ cat ~/stat_results/ifsMASTER.sp.x.0016/*top
ac1-1001.bullx:0 =>
	ac1-1001:0
	ac1-1002:0
	ac1-1003:0
	ac1-1006:0
	ac1-1007:0
	ac1-1008:0
	ac1-1009:0
	ac1-1012:0
	ac1-1013:0
	ac1-1014:0
	ac1-1016:0
	ac1-1017:0
	ac1-1018:0
	ac1-1019:0
	ac1-1020:0
	ac1-1021:0
	ac1-1022:0
	ac1-1023:0
	ac1-1030:0
	ac1-1031:0
	ac1-1032:0
	ac1-1033:0
	ac1-1034:0
	ac1-1035:0
	ac1-1036:0
	ac1-1037:0
	ac1-1038:0
	ac1-1039:0
	ac1-1040:0
	ac1-1041:0;

lee218llnl · 2022-07-11T16:34:19Z

as a workaround, you can set the topology to "depth" with a value of "1". This will connect the STAT FE directly to all the STAT BEs. The topology file is actually just MRNet communication processes that STAT uses to make it's communication more scalable. The number of hosts you see in the topology file is probably the square root of the number of nodes of your job.

That aside, I will have to do some debugging to see why it doesn't like the topology file. It might not like that your hostnames have "-" characters in it.

Anyway, let me know if the depth 1 workaround suffices until I can fix this.

lee218llnl · 2022-07-11T16:42:47Z

I'm actually thinking that MRNet does not like the fact that there is the ".bullx" in your hostname. It may be the case that when it sees ".something" it thinks "something" is your domain name and then tries to resolve the network address using that. I'm guessing that in your case, ".bullx" is not the domain, but rather just part of the node name. Is this accurate?

antonl321 · 2022-07-12T13:19:13Z

I think so. We have discussed this a while back in #33 (comment)

antonl321 · 2022-07-21T13:54:38Z

depth 1 option works (no crash) but the stat-cl stays for ever in "Sampling traces..." when I try to sample a 880 nodes x 8 MPI/node job.
I have a suggestion here, if it were easy to implement it would be useful to see the progress of sampling every second or so. In this way one can decide quickly if it worth to wait or to cancel

lee218llnl · 2022-07-21T17:11:56Z

how long did you wait? It could take a few minutes when running with this flat topology rather than using a scalable tree.

Regarding progress sampling, there is no easy way to poll the daemons for progress, so unfortunately you suggestion cannot be implemented.

Ultimately, I'm hoping that we can get a fix in MRNet so you can use a scalable tree and then in theory the operation should be fairly quick.

antonl321 · 2022-07-21T20:30:38Z

Up to 30 minutes or so.
If I use a very simple test code (a few lines) I can get a stack trace in a couple of minutes even at a larger scale.
As I said in my previous message stat-cl gets stuck in "Sampling traces...". The launch of the monitors goes ok. The app I'm interested in is rather complex with deep stack trace. I wonder if some timeouts block the whole process.

lee218llnl · 2022-07-21T20:34:46Z

can you try running the same complex app with the deep stack at a smaller scale? One potential issue is contention on the file system since all the STATD's will be trying to parse you executable's debug information.

antonl321 · 2022-07-25T15:50:52Z

I have run it with 300 nodes, but still it takes a long time. I think that you are right about STATD being stuck waiting for the debug info from the file system. I looked on one node with bpf profile and offcpu tools and most of the time STATD is off cpu when stack traces are being collected. Also la large number of lustre rpc are around. Can this be improved? I need to think and ask around.

antonl321 · 2022-07-27T16:08:39Z

I have run stac-cl with -l BE -l SW -l CP on the case that hangs ( 880 nodes x 8 MPI/node)
I have noticed that after some progress reported all log files stop being updated at about the same time (to the second).
The last line in *.STATD.log is
<STAT_FilterDefinitions.C:661> File request down stream filter end.
EXCEPT for one file which has
<STAT_BackEnd.C:1268> sending ackowledgement with tag 103 (first = 100) and return value 0

I wonder if a hang appears in STAT. Can one run stat on stat ? :)

One more note:
after I kill the hanging job I found in stat_results subfolder 880 files with this name pattern
ifsMASTER.sp.x.0050.BE__.dot

lee218llnl · 2022-07-27T16:34:01Z

Yes, you can run STAT on STAT, but the latter instance is likely to hang for the same reason. I think the problem is that without the MRNet tree, the STAT FE is trying to connect to 880 BEs and send data. This is clearly not scalable and will either take too long or hang. I'm hoping we can get the fix in MRNet to tolerate the "." in your system's nodes hostnames so we can let STAT/MRNet deploy a more scalable tree.

antonl321 · 2022-07-27T16:54:48Z

Perhaps this helps, internally in our cluster the short node names are enough for comms. So if you can get them with something like hostname -s it should be a quick fix.

lee218llnl · 2022-07-27T17:12:34Z

I looked at the STAT source code and found that we already have a STAT_FE_HOSTNAME environment variable. @antonl321 in your case from above, can you set this variable to "ac1-1001" or whatever the short hostname is from where you invoke STAT? If you look at the .top file that it produces, the value that you specify should be on the first line, rather than the hostname with ".bullx". You might want to test at a smaller scale, say 4 nodes and then when you run stat-cl, also add "-d 2" to the stat-cl args.

antonl321 · 2022-07-27T17:30:31Z

I'm a bit confused, what exactly should be in STAT_FE_HOSTNAME? The name of the head node, or hostname pattern, or ...?

lee218llnl · 2022-07-27T17:31:53Z

It should be the short hostname of the node from where you run stat-cl or stat-gui. Does that make sense?

antonl321 · 2022-07-27T18:04:43Z

Yes, it does. I'll test this when I get some free nodes on the system. Hopefully this evening or tomorrow morning.

antonl321 · 2022-07-27T21:05:28Z

With STAT_PE_HOSTNAME defined and without -d option stat-cl fails quickly with this sort of message ( see below).
Note STATD reports that it has a -d 1 flag although I didn't use one.
Without -d 1 stat-cl hangs as before.

<Jul 27 20:45:09> <STAT_BackEnd.C:655> registering signal handlers
<Jul 27 20:45:09> STATD.C:196 STATD invoked with STAT invoked with: /lus/h2resw01/hpcperm/atosla/Tools/spack/opt/spack/linux-rhel8-zen/gcc-8.4.1/stat-develop-xq35kogkta6vrg4jonmm4sblhegoxpf3/bin/STATD -d 1 -L /home/atosla/R3/tmp/stat-logs/42873620 -l BE -l SW --lmonsharedsec=393846926 --lmonsecchk=1241494145
<Jul 27 20:45:09> <STAT_lmonBackEnd.C:50> Registering unpack function with Launchmon
<Jul 27 20:45:09> <STAT_lmonBackEnd.C:58> Launchmon handshaking
<Jul 27 20:45:09> <STAT_lmonBackEnd.C:66> Waiting for launchmon to be ready
<Jul 27 20:45:09> <STAT_lmonBackEnd.C:74> Allocating process table
<Jul 27 20:45:09> <STAT_lmonBackEnd.C:89> Getting my process table entries
<Jul 27 20:45:09> <STAT_lmonBackEnd.C:115> Launchmon successfully initialized
<Jul 27 20:45:09> <STAT_lmonBackEnd.C:132> Prepearing to connect to MRNet
<Jul 27 20:45:09> <STAT_lmonBackEnd.C:142> Receiving connection information from FE
<Jul 27 20:45:09> <STAT_lmonBackEnd.C:146> ad1-1036.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 27 20:45:09> STATD.C:230 ad1-1036.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 27 20:45:09> <STAT_BackEnd.C:650> de-registering signal handlers

lee218llnl · 2022-07-27T21:21:04Z

Perhaps it's just a typo in your message, but you said you set "STAT_PE_HOSTNAME" when it should be "STAT_FE_HOSTNAME", that is it looks like you put a "P" where it was supposed to be "F". Regarding the "-d" option, that has a different meaning for the STATD process than it would with stat-cl, and "-d 1" is the correct option for STATD. That aside, can you send the .top file in the stat_results directory an also the debug log files that STAT generated?

antonl321 · 2022-07-28T08:58:42Z

I have checked, in my script I have the right variable name. I have miss-typed the variable name in my message.
Attached are the debug anf .top file
stat-debug.tar.gz

lee218llnl · 2022-07-28T16:46:09Z

Do you happen to have the STAT FE log file too? This would be enabled with -l FE. It looks like this is a different error in the STATD daemons when they are trying to receive their connection info from the STAT FE, and it would help to see where the FE is in the process.

That aside, it looks like there is a potential "fix" in MRNet. Can you modify your var/spack/repos/builtin/packages/mrnet/package.py in Spack and add:

version('hntest', commit='c7d029b03fd6fea932f83c6fe6a2fd9b063956a3')

After that you will have to spack uninstall --dependents mrnet and then when you reinstall STAT in spack, please add "^mrnet@hntest" to your install spec.

lee218llnl · 2022-07-28T16:52:18Z

I should add that with the "hntest" mrnet, please add XPLAT_USE_FQDN=1 when you run and please do not specify STAT_FE_HOSTNAME.

lee218llnl · 2022-07-28T16:57:04Z

And when you test this, try at small scale again, perhaps 4 nodes and add the "-d 2" option to stat-cl1

antonl321 · 2022-07-29T14:02:07Z

stat logs with -l FE attached.

I have rebuild mrnet and stat.
stat-cl -d 2 on small a test (4 nodes 128 MPI ranks) fails, see the attached file (-d-2).
The behaviour for large scale is unchanged, that is -d 1 hangs. -d 1 hung also on the small run once out of 3.
stat-FE-debug.tar.gz
42875702-d-2.tar.gz

lee218llnl · 2022-07-29T15:37:40Z

In both cases, i still see this in the STAT FE log:

<Jul 29 13:11:33> <STAT_FrontEnd.C:642> STAT returned error type STAT_MRNET_ERROR: MRNet reported a Network error with message: MRNet: Topology error: file has incorrect format

Just to confirm, did you set XPLAT_USE_FQDN=1 in your environment? Also, can you cd to the MRNet installation prefix and grep for that environment variable to make sure you picked up the changes:

[lee218@rzwiz2:spack]$ cd opt/spack/linux-rhel7-broadwell/gcc-8.3.1/mrnet-hntest-kbsbtzcerm7qlrlxankxxhvmzr7mpcgy/
[lee218@rzwiz2:mrnet-hntest-kbsbtzcerm7qlrlxankxxhvmzr7mpcgy]$ grep XPLAT_USE_FQDN -nr *
Binary file bin/mrnet_commnode matches
Binary file bin/mrnet_topgen matches
Binary file lib/libxplat.a.5.0.1 matches
Binary file lib/libxplat.so.5.0.1 matches

lee218llnl · 2022-07-29T15:44:09Z

Also did you get any error output to the terminal when you ran?

lee218llnl · 2022-07-29T16:37:55Z

I noticed in the topology file:

ad2-2051.bullx:0 =>
ad2-2051:0
ad2-2055:0;

That ad-2051 appears twice with the :0 rank. This is because the string comparison thinks the .bullx is a different host than the one without .bullx, and this may cause a conflict in MRNet. I pushed a change to the STAT develop branch that may fix this. Can you uninstall STAT with spack and rebuild spack install stat@develop... If you see failures, please do the same thing with sending debug logs, topology file, and terminal output. Thanks for being patient with me... I don't have an easy way to test this myself since we don't have any systems with "." in the hostname.

antonl321 · 2022-07-29T16:51:53Z

I tested my mrnet installation, it matches the variable as in your example. I launch stat-* from a script which has this line
export XPLAT_USE_FQDN=1. It should be fine. If possible stat in debug mod should dump the values of all environment variable of interest, just to be sure that they have the expected values.
For the new stat installation should I keep mrnet@hntest?

lee218llnl · 2022-07-29T17:00:36Z

STAT doesn't dump the value on env variables, but that is a good idea.

please keep using mrnet@hntest from here on out. This is what I think will be needed moving forward and if/when we get it to work, it will hopefully be merged in to the MRNet mast and then a release.

antonl321 · 2022-07-29T18:30:01Z

I have installed the new version. I'm afraid that we are back to the negative value error when using -d 2

+ ssh ab1-3004 /lus/h2resw01/hpcperm/atosla/Tools/spack/opt/spack/linux-rhel8-zen/gcc-8.4.1/stat-develop-gshuszoyutctc5xapk2dvjhv54hezfcg/bin/stat-cl -d 2 -l FE -l BE -l SW -l CP -L /home/atosla/R3/tmp/stat-logs/29932705 426346
STAT started at 2022-07-29-18:09:42
Attaching to job launcher (null):426346 and launching tool daemons...
<Jul 29 18:09:44> <STAT_FrontEnd.C: 2108> STAT returned error type STAT_ARG_ERROR: Not enough resources for specified topology.  Please specify more resources with --nodes and --procs.
<Jul 29 18:09:44> <STAT_FrontEnd.C: 535> STAT returned error type STAT_ARG_ERROR: Failed to create topology file
<Jul 29 18:09:44> <STAT.C: 181> STAT returned error type STAT_ARG_ERROR: Failed to launch MRNet tree()
<Jul 29 18:09:44> <LMON BE API> (ERROR): read_lmonp return a neg value
<Jul 29 18:09:44> <STAT_lmonBackEnd.C:146> ab1-3005.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 29 18:09:44> <STATD.C:230> ab1-3005.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 29 18:09:44> <STAT_lmonBackEnd.C:146> ab1-3007.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 29 18:09:44> <STATD.C:230> ab1-3007.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 29 18:09:44> <STAT_lmonBackEnd.C:146> ab1-3009.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 29 18:09:44> <STATD.C:230> ab1-3009.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE
<Jul 29 18:09:42> <Launchmon> (INFO): Just continued the RM process out of the first trap
srun: error: ab1-3005: task 1: Exited with exit code 12
srun: launch/slurm: _step_signal: Terminating StepId=29932705.5
srun: error: ab1-3004: task 0: Exited with exit code 12
srun: error: ab1-3007: task 2: Exited with exit code 12
srun: error: ab1-3009: task 3: Exited with exit code 12
<Jul 29 18:09:44> <STAT_lmonBackEnd.C:146> ab1-3004.bullx: STAT returned error type STAT_LMON_ERROR: Failed to receive data from FE
<Jul 29 18:09:44> <STATD.C:230> ab1-3004.bullx: STAT returned error type STAT_LMON_ERROR: Failed to connect BE

logs attached
42875786.tar.gz

lee218llnl · 2022-07-29T18:43:35Z

Can you add --procs=4 to your stat-cl command and let me know if that helps?

antonl321 · 2022-07-29T19:40:38Z

yes, that fixes the error. the small test is sampled but the large one hangs as before.

lee218llnl · 2022-07-29T19:59:50Z

can you please send me the debug log files for the large-scale hang case?

antonl321 · 2022-07-29T20:39:13Z

This run generated a lot more data in the STATD files. In total I have 7.8 GB of logs but it hangs nevertheless. I put only 1 STATD file and the other ones in the attached tar ball.
42875792.tar.gz

lee218llnl · 2022-07-29T20:58:47Z

you can reduce the size of the BE logs if you remove the "-l SW" flag. We don't really want to debug stackwalker anyway. That aside, the one STAD log file you did send looks like gathered samples OK and sent them to the FE. It may not be easy, but can tail -n 1 the other STATD logs files to see if they have a line similar to:

<Jul 29 20:20:28> <STAT_BackEnd.C:1268> sending ackowledgement with tag 103 (first = 100) and return value 0

If you find one that doesn't end with that, can you please send the file?

antonl321 · 2022-07-29T21:14:47Z

In fact there are about 162 files that have in the last line "return value 0" and about 760 with something else on the last line. I attach one of them.
ad5-4093.bullx.STATD.log
.

lee218llnl · 2022-07-29T21:34:10Z

Even in that file I see:

<Jul 29 20:20:25> <STAT_BackEnd.C:1268> sending ackowledgement with tag 103 (first = 100) and return value 0

this is the 5th to last line. So this STAD actually gathered samples OK and sent them to the FE. However, it looks like the FE never receives the merged traces. I'm still wondering if there is a lingering STATD. Can I trouble you to run again without "-l SW"? The log files should be a lot smaller (and easier for me to parse), then hopefully it'll be small enough that you can send them all to me and I can poke around instead of asking you to look.

antonl321 · 2022-07-29T22:42:57Z

surprise, without -l SW it has collected first 2 traces but then it hung.
the logs are still 2.5 M. Let's see if git hub takes it. Yes!
42875794.tar.gz

antonl321 · 2022-07-29T22:56:35Z

but it hung at the first "sampling traces ..." in a second run without any debug flags. Hmm!
I'm out for today.

lee218llnl · 2022-07-29T23:44:15Z

All of the 880 STATD processes were killed with signal 15 (SIGTERM):

$ grep "STATD intercepted signal 15" -nr * | wc
880 7040 84480

Is it possible that the application exited while STAT was attached and gathering samples? As far as I can see, there is no error in STAT itself, but if the application exits, then SLURM may be killing off the STATD daemons. Or is ist possible that you are running at the edge of the memory limit and getting OOM killed? The STATD can consume quite a bit of memory when gathering stack traces with function and line information. You could try running STAT with "-t 1". This will cause STAT to only collect one stack trace with just function frames (no line number info).

lee218llnl · 2022-07-30T00:05:19Z

In looking at the *STAT.log file, it appears that STAT is taking a while to process the third trace. The good news, is that you should have 2 good .dot files in the stat_results directory that you can look at. I'm not sure if it is legitimately hanging on the third one or if it is just taking a long time. Did you set a time limit on your job or again, did the job perhaps exit in the meantime? If you can tar up the run-specific stat_results directory, that may help too.

antonl321 · 2022-07-31T14:05:12Z

I think that the job has been terminated because I killed the app to launch a new experiment. Unfortunately later tries weren't successful.

I was wondering, does the gdb alternative for stack traces collection works? I saw it mentioned in the documentation as a build option.
A lateral thought: I was able to collect stack traces from the large scale app in few minutes using a ssh loop and gstack. It might be useful for the user community if stat-view could be modularized to process gstack or perhaps gdb stack files.

lee218llnl · 2022-08-01T20:04:50Z

Using GDB instead of Stackwalker for the daemons won't help in this case. Stackwalker got the traces OK and the daemons successfully sent them to the STAT frontend, it's just that the STAT frontend is churning on the graphs. I'm pretty sure if you wait long enough, that the STAT frontend operation would complete, but I couldn't tell you how long it might take. You could attach gdb (or even use STAT's serial attach) to attach to the STAT frontend python process when it is in this operation to make sure it is progressing. For the log files you sent, it looks like you killed it off after 6 minutes. Would you be willing to let it run for a half hour or an hour to see if it completes? The duration of this operation is dependent on the number of nodes/edges in the traces. Actually, another thing that may help speed things up is if you run stat-cl with the "-U" flag. Instead of keeping track of every MPI rank, the edges will just include the total count that visited that edge and then record the lowest ranked process that took that call path. Sorry for yet another test idea, but I think we are making progress.

I will also point out that STAT does have the ability to merge certain core files via the core-stack-merge script that is included in the source. It likely won't work out of the box for something like gstack, but currently works for lightweight core files (this was done back in the days of IBM BG/L, BG/P, BG/Q) and it can also use gdb to parse full core files.

antonl321 · 2022-08-02T14:30:40Z

I have tried to sample the large scale run with stat-gui. The first try came with a graph in few minutes, but the second one got stuck in "Sample Stack Traces" for more than 15 minutes. It could be that the sampling of some processes in some peculiar states gets stuck.

Would it be possible and wise to implement a timeout for the stack sampling? Assuming that a small number of process are stuck I think that partial the information (with an warning attached) could be useful to the user.

antonl321 · 2022-08-02T15:43:54Z

I have tried the -U option and got crashes. But when I enable the debug option it just hangs.
I attach two stack traces for STATD, perhaps you could see something useful in them. As I wrote last week, when hanging STATD spends most of its time off cpu. At least on the few nodes I have tested.

lee218llnl · 2022-08-02T15:46:41Z

I was actually hoping to see the STAT frontend process (which is python), not the STATD process. From the debug logs, the STATD daemons are waiting on the STAT frontend.

lee218llnl · 2022-08-02T16:06:08Z

Also, we had previously explored the idea of timeouts, but while a good idea and conceptually simple, it was deemed too difficult to implement. Thanks for the suggestion/request, though.

antonl321 · 2022-08-02T17:32:19Z

Well, I think I have fixed it! Or at least I made some good progress.
I have noticed that stat-gui starts the python process and quite a number of mrnet-hntest-* on the login node.
Well, on my system the login nodes have some user restrictions which I think are the possible cause of the hangs.
When I start stat-gui on a compute node (exclusive allocation) I got the graph in a very good time, no crashes, twice in a row.
Still I need to test more but this could be a possible solution for the system I'm working on.

lee218llnl · 2022-10-25T16:26:09Z

@antonl321 I think there were a few issues raised here. Have they all been resolved? Were you able to run OK at scale? Please let me know if this is still an open issue of if I may close it.

antonl321 · 2022-10-25T17:42:34Z

Hi, Unfortunately the freezing of STAT at large scale has came back.
Sometime in early August I have found that I can have STAT working with 880 nodes jobs if I use a compute node to run stat-gui.
I have tested that several times and it worked fine.
About a month later, after some changes in the system config STAT has regressed. It gets locked up for the 880 node runs using either a login node or a compute node to ran stat-gui. I'm in the same situation with the latest release. I have no idea what happened.
stat-gui gets stuck ( more that 10-20 minutes) typically in the collecting stack traces step. For the same job configuration using gstack I can get all the stacks in a folder in about ~2 minutes.

lee218llnl · 2022-10-25T17:48:08Z

have you built STAT with fgfs? The fgfs variant should be enabled by default in Spack unless you specify ~fgfs in the install command. Since it is stuck on sampling, I'm wondering if it is contention on the file system, which fgfs should alleviate.

antonl321 · 2022-10-25T19:06:57Z

fgfs is enabled.

…

On Tue, 25 Oct 2022, 20:48 Gregory Lee, ***@***.***> wrote: have you built STAT with fgfs? The fgfs variant should be enabled by default in Spack unless you specify ~fgfs in the install command. Since it is stuck on sampling, I'm wondering if it is contention on the file system, which fgfs should alleviate. — Reply to this email directly, view it on GitHub <#46 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMSEMBRIH5ARTT62U3RENWDWFAMOHANCNFSM53BB2DXA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

lee218llnl · 2022-10-26T16:24:31Z

Can you gather STAT logs for me? In the "Advanced" tab of the attach dialog in the STAT GUI, you can select "Log Frontend", "Log Backend", "Log CP". Please do not log SW or SWERR at this time because they will generate large log files and at the scale you are running, this is probably not a good idea.

antonl321 · 2022-10-27T14:26:40Z

Sure, but it might take a bit of time. It's a bit more difficult to get large runs on the my system. Hopefully early next week.

lee218llnl · 2022-10-27T17:13:44Z

one smaller scale test you can run is to launch a job on 4 nodes. Then in the stat-gui attach dialog, go to the "Topology" tab and set "Specify Topology Type" to "depth". Then click on "Modify Topology" and change it to "2". After that go back to the "Attach" tab and see if you can attach to a job as usual. This will test to make sure the MRNet tree is functioning properly on your system. If you can gather me the logs from this small scale test, I can start to debug and see if anything stands out that may cause issues at scale.

antonl321 · 2022-10-31T18:46:50Z

I did a run on 16 nodes and the depth=2 worked fine. I attach the logs for the the same run (default params for attach).
I didn't had yet an opportunity to run a large job.
stat-logs-FE+BE+CP.tar.gz

lee218llnl · 2022-11-02T00:06:42Z

thanks for the logs. It looks like fgfs is working and should scalably handle reading binaries. I will await logs from the larger run inf you are able to produce them.

antonl321 · 2022-11-21T22:12:02Z

Hi, I got some logs for a run with 440 nodes. The node on which STAT run is ad6-175.
As usual STAT got stuck at this scale. I have waited ~ 10 minutes after the last change was done to the log files. Also I have noticed that there are only 421 log files from compute nodes instead of 440.
stat-FE-BE-CP-440-nodes-logs.tar.gz

lee218llnl mentioned this issue Jul 11, 2022

graph is not connected, found 0 potential roots dyninst/mrnet#5

Closed

<LMON BE API> (ERROR): read_lmonp return a neg value #46

<LMON BE API> (ERROR): read_lmonp return a neg value #46

Comments

antonl321 commented Jul 8, 2022

lee218llnl commented Jul 8, 2022

antonl321 commented Jul 9, 2022

lee218llnl commented Jul 11, 2022

lee218llnl commented Jul 11, 2022

antonl321 commented Jul 12, 2022

antonl321 commented Jul 21, 2022

lee218llnl commented Jul 21, 2022

antonl321 commented Jul 21, 2022

lee218llnl commented Jul 21, 2022

antonl321 commented Jul 25, 2022

antonl321 commented Jul 27, 2022 • edited Loading

lee218llnl commented Jul 27, 2022

antonl321 commented Jul 27, 2022

lee218llnl commented Jul 27, 2022

antonl321 commented Jul 27, 2022

lee218llnl commented Jul 27, 2022

antonl321 commented Jul 27, 2022

antonl321 commented Jul 27, 2022

lee218llnl commented Jul 27, 2022

antonl321 commented Jul 28, 2022

lee218llnl commented Jul 28, 2022 • edited Loading

lee218llnl commented Jul 28, 2022

lee218llnl commented Jul 28, 2022 • edited Loading

antonl321 commented Jul 29, 2022

lee218llnl commented Jul 29, 2022

lee218llnl commented Jul 29, 2022

lee218llnl commented Jul 29, 2022

antonl321 commented Jul 29, 2022

lee218llnl commented Jul 29, 2022

antonl321 commented Jul 29, 2022

lee218llnl commented Jul 29, 2022

antonl321 commented Jul 29, 2022

lee218llnl commented Jul 29, 2022

antonl321 commented Jul 29, 2022

lee218llnl commented Jul 29, 2022

antonl321 commented Jul 29, 2022

lee218llnl commented Jul 29, 2022

antonl321 commented Jul 29, 2022

antonl321 commented Jul 29, 2022

lee218llnl commented Jul 29, 2022

lee218llnl commented Jul 30, 2022

antonl321 commented Jul 31, 2022

lee218llnl commented Aug 1, 2022

antonl321 commented Aug 2, 2022

antonl321 commented Aug 2, 2022

lee218llnl commented Aug 2, 2022

lee218llnl commented Aug 2, 2022

antonl321 commented Aug 2, 2022

lee218llnl commented Oct 25, 2022

antonl321 commented Oct 25, 2022

lee218llnl commented Oct 25, 2022

antonl321 commented Oct 25, 2022 via email

lee218llnl commented Oct 26, 2022

antonl321 commented Oct 27, 2022

lee218llnl commented Oct 27, 2022

antonl321 commented Oct 31, 2022

lee218llnl commented Nov 2, 2022

antonl321 commented Nov 21, 2022

antonl321 commented Jul 27, 2022 •

edited

Loading

lee218llnl commented Jul 28, 2022 •

edited

Loading

lee218llnl commented Jul 28, 2022 •

edited

Loading