Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce volume memory usage #168

Merged
merged 18 commits into from
Mar 17, 2020
Merged

Reduce volume memory usage #168

merged 18 commits into from
Mar 17, 2020

Conversation

glasserm
Copy link
Contributor

@glasserm glasserm commented Mar 3, 2020

Reduce memory usage when computing variance normalization on volume files.

Reduce memory usage when computing variance normalization on volume files.
Copy link
Contributor

@mharms mharms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to check the memory performance of this relative to the previous version?

@glasserm
Copy link
Contributor Author

glasserm commented Mar 4, 2020

I will not have time to mess with this any more, so if you wish to make further adjustments, you can do this after it is merged or we can stick with what we have that uses more memory. This approach minimizes the memory footprint as much as is easily possible regarding the zeros in the NIFTI volume. One clears the full array immediately after using it so as to never have that in memory when doing other steps that may temporarily increase memory usage over the amount needed when the full array is being used (e.g. icaDim). This was tested with both detrending and hp2000, but is anticipated to only be used with detrending given current recommendations.

@glasserm
Copy link
Contributor Author

glasserm commented Mar 4, 2020

The above changes are fine with me. Were they tested with hp=0 and hp=2000 (and now I guess hp=-1)?

@mharms mharms force-pushed the Reduce-Volume-Memory-Usage- branch from 5740f07 to 8ad2ca5 Compare March 4, 2020 17:12
@mharms
Copy link
Contributor

mharms commented Mar 4, 2020

FYI: The push I just forced just corrects the variable indentation. I was hoping that you could quickly test this with what you already have set up, since I'm not really set up to test under interpreted matlab.

Also, still unclear though if you have empirically tested the memory reduction, which I would like to be established before I work with @hodgem to get these changes into the compiled matlab. I wonder if the matlab profiler would make it easy to assess the memory usage...

@glasserm
Copy link
Contributor Author

glasserm commented Mar 4, 2020

I can run the basic run to completion tests; however, I don't think it is necessary to test the memory reduction (and I don't have a way to do that)--it will be reduced considerably.

@glasserm
Copy link
Contributor Author

glasserm commented Mar 4, 2020

The code runs through the VN function for hp=0 and hp=2000. hp=-1 apparently is not allowed.

@mharms
Copy link
Contributor

mharms commented Mar 4, 2020

That's good. hp=-1 should be able to run successfully through functionhighpassandvariancenormalize.m itself. But it isn't a supported option in hcp_fix_multi_run.

@coalsont
Copy link
Member

coalsont commented Mar 4, 2020

Sticking a /usr/bin/time in front of the matlab call would give you some memory numbers to work with, though I seem to recall they may be scaled strangely (so, should not be compared against things like htop).

@mharms
Copy link
Contributor

mharms commented Mar 4, 2020

?? /usr/bin/time reports the time used by a command, not the memory...

@coalsont
Copy link
Member

coalsont commented Mar 4, 2020

tim@timsdev:~$ /usr/bin/time ls
...
0.00user 0.00system 0:00.00elapsed 71%CPU (0avgtext+0avgdata 1088maxresident)k
0inputs+0outputs (0major+342minor)pagefaults 0swaps

"maxresident" refers to memory. In posix mode, it will only report time, so you might need to give it some options in your environment.

@mharms
Copy link
Contributor

mharms commented Mar 4, 2020

This is interesting (from CHPC):

[mharms@login02 fixTesting]$ time ls
...
real	0m0.016s
user	0m0.000s
sys	0m0.010s
[mharms@login02 fixTesting]$ /usr/bin/time ls
...
0.00user 0.01system 0:00.02elapsed 52%CPU (0avgtext+0avgdata 920maxresident)k
0inputs+0outputs (0major+278minor)pagefaults 0swaps

The man time page reports that

Note: some shells (e.g., bash(1)) have a built-in time command that provides less  function-
       ality  than the command described here.  To access the real command, you may need to specify
       its pathname (something like /usr/bin/time).

But, then, I'd expect that which time would report that time is a "built-in", but that is not the case:

[mharms@login02 fixTesting]$ which time
/usr/bin/time

I guess that isn't how which functions under bash?

@coalsont
Copy link
Member

coalsont commented Mar 4, 2020

tim@timsdev:~$ type time
time is a shell keyword
tim@timsdev:~$ type which
which is /usr/bin/which
tim@timsdev:~$ type type
type is a shell builtin

@mharms
Copy link
Contributor

mharms commented Mar 4, 2020

Any particular reason bash doesn't report "built-ins" via which, since most people are (I suspect) much more familiar with which?

@mharms
Copy link
Contributor

mharms commented Mar 4, 2020

Esp. considering that which does report on aliases...

@coalsont
Copy link
Member

coalsont commented Mar 4, 2020

which doesn't tell me about aliases:

tim@timsdev:~$ which ls
/bin/ls
tim@timsdev:~$ type ls
ls is aliased to `ls --color=auto'

As I edited my reply to show, which is not a builtin (at least not on my system), so it doesn't know about the internals of bash.

@coalsont
Copy link
Member

coalsont commented Mar 4, 2020

And here is your answer on the chpc:

[...@login01 ~]$ type which
which is aliased to `alias | /usr/bin/which --tty-only --read-alias --show-dot --show-tilde'
[...@login01 ~]$ type type
type is a shell builtin

From here via /etc/profile:

[...@login01 ~]$ cat /etc/profile.d/which2.sh
# Initialization script for bash and sh

# export AFS if you are in AFS environment
alias which='alias | /usr/bin/which --tty-only --read-alias --show-dot --show-tilde'

@mharms
Copy link
Contributor

mharms commented Mar 4, 2020

I see that my comment about which was tripped up by the fact that the system itself aliases which as part of a default set of aliases (to include the --read-alias flag):

[mharms@login02 ~]$ which which
alias which='alias | /usr/bin/which --tty-only --read-alias --show-dot --show-tilde'
	/usr/bin/which
[mharms@login02 ~]$ which rm
alias rm='rm -i'
	/bin/rm
[mharms@login02 ~]$ /usr/bin/which rm
/bin/rm

I see that you just made the same discovery...

@coalsont
Copy link
Member

coalsont commented Mar 4, 2020

With my change to delete(), it appears to complete successfully (with hp=2000, so it should hit that code).

@coalsont
Copy link
Member

coalsont commented Mar 4, 2020

From the master commit this PR started from, here are the outputs of /usr/bin/time on just the functionhighpass... lines:

148.10user 61.93system 1:18.69elapsed 266%CPU (0avgtext+0avgdata 3794780maxresident)k
72inputs+2750040outputs (0major+30161541minor)pagefaults 0swaps

110.79user 41.44system 0:49.44elapsed 307%CPU (0avgtext+0avgdata 3721664maxresident)k
40inputs+1246304outputs (0major+20672548minor)pagefaults 0swaps

Before changing single(zeros(:

190.61user 77.33system 1:26.93elapsed 308%CPU (0avgtext+0avgdata 2812540maxresident)k
72inputs+2750040outputs (0major+38105443minor)pagefaults 0swaps

104.78user 43.89system 0:44.69elapsed 332%CPU (0avgtext+0avgdata 2737868maxresident)k
40inputs+1246304outputs (0major+19979866minor)pagefaults 0swaps

So, those edits saved a fair amount of memory used by matlab in that function. Strange that it took more time, maybe from the stdev computation? After zeros(..., 'single'):

190.40user 77.16system 1:25.92elapsed 311%CPU (0avgtext+0avgdata 2812632maxresident)k
72inputs+2750040outputs (0major+37486119minor)pagefaults 0swaps

107.54user 39.42system 0:44.10elapsed 333%CPU (0avgtext+0avgdata 2737484maxresident)k
32inputs+1246304outputs (0major+19668406minor)pagefaults 0swaps

So, not really any change to the maximum there.

@coalsont
Copy link
Member

My version, with range and masking-bypass:

First pass:

156.81user 38.57system 3:20.25elapsed 97%CPU (0avgtext+0avgdata 1577072maxresident)k
171952inputs+2750040outputs (420major+25930748minor)pagefaults 0swaps
173.58user 46.03system 4:54.48elapsed 74%CPU (0avgtext+0avgdata 1492044maxresident)k
503368inputs+2750040outputs (1831major+28643473minor)pagefaults 0swaps

Second run had more major pagefaults, so ignore the memory difference between the two.

Concatenated:

108.39user 25.36system 4:28.90elapsed 49%CPU (0avgtext+0avgdata 2687976maxresident)k
234824inputs+2487416outputs (1138major+15806211minor)pagefaults 0swaps

Some major pagefaults, so expect another 100MB or so...I still have /usr/bin/time on melodic, etc, so I can compare to those numbers. We used std in the fix_3_clean changes too, so that needs editing.

On another note, does functionhighpass... really need to spit out mountains of numbers? It makes it slightly more difficult to find the timing results in the output.

@glasserm
Copy link
Contributor Author

The numbers tell you if the golden search is behaving sensibly.

@coalsont
Copy link
Member

Do you mean if the golden mean search has a bug in it, or do you mean if there isn't anything particularly good for the search to find? If it is only for debugging, then it can be commented out when not debugging.

@glasserm
Copy link
Contributor Author

If there is something wrong with the data it can floor or ceil, which is bad.

@coalsont
Copy link
Member

coalsont commented Mar 11, 2020

That sounds like you can just print the end result and the initial endpoints, and that will tell you whether what it found was an initial endpoint.

@coalsont
Copy link
Member

So, for my 2-run test, the concatenated highpass is now only barely above the melodic memory usage, and fix_3_clean (fix -f) is now in line with fix -c:

melodic:

659.51user 6.62system 21:21.15elapsed 51%CPU (0avgtext+0avgdata 2523728maxresident)k
38480inputs+923472outputs (204major+5845224minor)pagefaults 0swaps

fix -f (LOTS of major pagefaults, so this number could be wrong, but it mostly matches a much-earlier run of fix -f):

872.93user 70.86system 56:32.20elapsed 27%CPU (0avgtext+0avgdata 2381880maxresident)k
3677648inputs+35472outputs (25280major+18657473minor)pagefaults 0swaps

fix -c:

29.95user 0.90system 0:39.67elapsed 77%CPU (0avgtext+0avgdata 2952384maxresident)k
2261832inputs+40outputs (73major+766193minor)pagefaults 0swaps

fix -a:

11.04user 4.06system 1:40.66elapsed 15%CPU (0avgtext+0avgdata 2884456maxresident)k
154048inputs+2748856outputs (324major+1970588minor)pagefaults 0swaps

So, my versions appear to peak at only slightly more memory than melodic (which we can't edit easily). Emailing them to Mike for change resolution.

Comment on lines 27 to 28
% Apply mask, reusing 'data' variable for memory efficiency
data = data(mask,:);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input variables don't count as used memory until they are assigned to, nor does a trivial copy (no indexing expression). I prefer not overwriting variables with modified versions, and in this case and data_detrend, it shouldn't improve memory usage versus my version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the usemask stuff, but I don't see the disadvantage of overwriting variables with a modified version, esp. since it is a misnomer to name the variable data_detrend since whether it is actually detrended depends on the value of DEMDT.

@mharms
Copy link
Contributor

mharms commented Mar 12, 2020

Ok, see latest commits, that role in Tim's changes. I restored Origdata as the input variable name, but use (and reuse) data for the rest of the code. That seems clean to me.

@glasserm: Given that icaDim isn't actually the memory limiting step, and given that we need to return to using the non-VN'ed data over and over, it doesn't seem worth it to me to bother with deleting data and recomputing it on the fly from the SVD within the while loop. Besides, in this version, the full time series of data_trend is only formed at the front end for the DEMDT==1 condition (which is the only condition for which we actually need to preserve that data throughout the while looping), so for the DEMDT=={0,-1} conditions, this version is actually a wash memory-wise with your earlier 6edc725 version.

Update: I realized that data can/should nonetheless be deleted once we are actually done with it (outside the while loop), so I just added that.

@mharms
Copy link
Contributor

mharms commented Mar 12, 2020

@coalsont: If you could run this version, that would be appreciated!

@coalsont
Copy link
Member

For f1b2c5f:
First pass:

152.06user 48.16system 5:57.56elapsed 55%CPU (0avgtext+0avgdata 1574704maxresident)k
301160inputs+2750040outputs (1034major+27044101minor)pagefaults 0swaps
211.18user 58.82system 2:33.85elapsed 175%CPU (0avgtext+0avgdata 1575388maxresident)k
26400inputs+2750040outputs (145major+38181696minor)pagefaults 0swaps

Concatenated VN:

90.43user 34.90system 1:54.81elapsed 109%CPU (0avgtext+0avgdata 2751528maxresident)k
327000inputs+2487416outputs (1050major+14559735minor)pagefaults 0swaps

@glasserm
Copy link
Contributor Author

Let's get this merged, compiled, and running on the CHPC.

@mharms
Copy link
Contributor

mharms commented Mar 12, 2020

@glasserm: Do you have scripts in place to quickly test all three DEMDT conditions? Not for memory, but just that they run successfully?

@glasserm
Copy link
Contributor Author

It runs in all three conditions.

@mharms
Copy link
Contributor

mharms commented Mar 12, 2020

Fabulous! @coalsont: Anything else you want to change (or are still reviewing), before I request this gets recompiled?

@coalsont
Copy link
Member

The excessive amount of text output isn't that important, and is easy enough to comment out when it gets annoying. I'd say go ahead.

@mharms
Copy link
Contributor

mharms commented Mar 13, 2020

ARGH!!! There is no doubt that we've improved the memory utilization of interpreted matlab. We've been assuming that any such gains there would readily translate to the compiled matlab as well. But, what if that isn't actually the case?!?

I just ran a test of the compiled matlab, and the results of /usr/bin/time on the 1st pass VN from the latest compile (25f3e1f) are no different than what we had from the earlier compile (624c8d1) from the code that added the masking to functionhighpass... but accomplished it via std rather than range. Which is to say, the maxresident values were ~ 4.4 x the size of the input (uncompressed) volume time series, rather than ~ 2.5 x as @coalsont achieved in his testing with interpreted matlab following the switch to 'range'.

I've checked with @hodgem and he is confident that recompiled from the tip of this PR (although to be absolutely 100% sure, it might be worth adding an fprintf to the code, recompiling, and confirming that we see the expected output in the log).

The results above would be consistent with the hypothesis that in compiled matlab, the range operation is basically the same as std in terms of memory.

@mharms
Copy link
Contributor

mharms commented Mar 13, 2020

Ok, another wrinkle. The 2nd pass VN stage just completed from my compiled matlab run, and the results are even more confusing (to me at least)

The output of the /usr/bin/time that I appended to the front end of the functionhighpasss... call was:

1315.55user 117.50system 32:46.33elapsed 72%CPU (0avgtext+0avgdata 38761832maxresident)k
1338648inputs+22320072outputs (832major+6527444minor)pagefaults 0swaps

i.e., 39 GB, which represents no improvement relative to the version that created the spatial mask using std.
However, the reported Resource usage on CHPC for the overall job was:

Resources: cput=00:31:43,vmem=29324464kb,walltime=00:32:49,mem=23844428kb,energy_used=0

i.e., mem of only 24 GB. This is the biggest discrepancy between the Resources mem value and the /usr/bin/time maxresident value that I've seen.
@coalsont Any insight as to what might be going on?

@coalsont
Copy link
Member

/usr/bin/time should measure the true memory peak, while other resource tracking may be polling the process, so if the peak was extremely short...

I would have expected compiled matlab to be no worse than interpreted, there doesn't seem like any good reason their compiler should be dumber (though maybe the compiler exists for bad reasons in the first place...).

@glasserm
Copy link
Contributor Author

I agree we should add an echo statement and retry.

@mharms
Copy link
Contributor

mharms commented Mar 15, 2020

I ran the 2nd pass VN stage two more times using the compiled matlab, just to see if my previously reported result (with the inconsistent reported peak memory: #168 (comment)) was repeatable, and in each case the reported mem value from TORQUE this time matched the maxresident value reported by /usr/bin/time on the functionhighpass... call:

Resources: cput=02:19:43,vmem=43890700kb,walltime=02:24:14,mem=37181008kb,energy_used=0

1153.64user 6028.97system 2:24:11elapsed 83%CPU (0avgtext+0avgdata 38652224maxresident)k
1865120inputs+26811104outputs (4323major+31476466minor)pagefaults 0swaps

Resources: cput=06:36:33,vmem=43824136kb,walltime=06:40:50,mem=37187800kb,energy_used=0

1183.44user 16265.38system 6:40:45elapsed 72%CPU (0avgtext+0avgdata 38761520maxresident)k
1900048inputs+23573592outputs (9039major+19870045minor)pagefaults 0swaps

Of note, in each of these other two instances, the duration of the job on the node was much longer than the earlier test (which took only 33 min), so the longer duration for the latest two test may have given the TORQUE polling process the opportunity to catch the memory peak (consistent with @coalsont's hypothesis).

I'll ask @hodgem to do a recompile after I add an fprintf to the matlab code, just to be 100% sure that the recompile is hitting the intended code. But at this point, it sure looks like compiled matlab is using memory that interpreted matlab is optimized to not need.

@mharms mharms force-pushed the Reduce-Volume-Memory-Usage- branch from 25f3e1f to ca736b1 Compare March 16, 2020 15:59
@mharms
Copy link
Contributor

mharms commented Mar 17, 2020

I have confirmed that the expected out of the fprintf statements that I added are showing up in the log files when using compiled matlab mode for the latest compile. But, the memory profile is as before. I'm therefore concluding that the improved (lower) memory profile that @coalsont achieved using interpreted matlab is unfortunately not propagated into the compiled matlab.

We do still have a small reduction (~10% in compiled matlab mode in the 2nd pass VN stage), and of course the more efficient memory is obviously beneficial for interpreted matlab. So, I suggest we merge this PR and tag a new release (v4.2) that we can use in the QuNex container for ICAFIX.

@glasserm glasserm merged commit 49830a6 into master Mar 17, 2020
@glasserm glasserm deleted the Reduce-Volume-Memory-Usage- branch March 17, 2020 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants