Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RELION-3.1 Pre-read all particles into RAM #514

Closed
kaoweichun opened this issue Oct 15, 2019 · 41 comments
Closed

RELION-3.1 Pre-read all particles into RAM #514

kaoweichun opened this issue Oct 15, 2019 · 41 comments

Comments

@kaoweichun
Copy link

kaoweichun commented Oct 15, 2019

Hello,

I used Relion 3.1-beta-commit-a6aaa5 to repeat a previously completed 3D auto-refine from Relion 3.0.7 without changing any settings. I ran it on a single machine (specs see below) and I always enabled Pre-read all particles into RAM. The typical behaviour on 3.0.7 is, one master MPI proc uses 250 GB RAM, and each slave MPI proc uses some 20 GB RAM. However, in Relion 3.1, upon starting refinement the master MPI proc used up all the RAM capacities till mpirun crashed, before estimating initial noise spectra (update: and each slave MPI proc used only around 4 GBM RAM in this stage).

By observing the aforementioned RAM usage behaviour of Relion 3.1, I simply disabled Pre-read all particles into RAM and I could avoid this problem and the refinement could proceed, but I am afraid that there is an issue with memory usage.

Thanks,

WCK


Computer brief specs: 48 cores (hyperthreading-enabled)/ 384 GB RAM/ 2 TB SSD/ Open MPI 3.0.2/ GTX 1080 Ti (4x)/ sge/ CentOS 7.6/ CUDA version 10.1

@biochem-fan
Copy link
Member

Are you sure you used the same Use parallel disc I/O? setting?

@kaoweichun
Copy link
Author

kaoweichun commented Oct 15, 2019

Yes. Use parallel disc I/O? is always off. Otherwise each slave MPI proc will use the same amount of RAM as the master MPI proc does without Use parallel disc I/O? I suppose?

@biochem-fan
Copy link
Member

This is puzzling... We didn't change codes related to preread images to RAM. Does this happen on other datasets as well?

@kaoweichun
Copy link
Author

kaoweichun commented Oct 16, 2019

Yes, it happened to other datasets and on another computer as well. That computer has even more ram (768 GB) and Relion 3.1 still attempted to fill it with the Master MPI proc (each Slave MPI procs use roughly < 10 GB RAM). It uses Open MPI 2.1.1 and 2x Tesla M60 so I can say the issue is somehow independent from the systems I am using (?).

@ashkumatov
Copy link

Hi. I am having similar issue.. I was monitoring the RAM usage now (relion3.1) and before (<3.1 versions). In 3.1b the particles are read to RAM and then at a first step (maximization) suddenly RAM is filled up to it's max and there is a longer waiting.. In subsequent steps, after RELION prints Maximization is done in XX seconds, the RAM is again filled up to the max (actually more than NCPU x 'du -hs Extract/jobXXX' and there is like 5-10 min waiting till RELION is going to next iteration. THe same dataset reading from disk - no waiting.. In 3.0stable there was no issue like that.

@biochem-fan
Copy link
Member

biochem-fan commented Oct 31, 2019

What are the box size, pixel size and resolution?

Are you familiar with gdb? Can you investigate what RELION is doing during "5-10 min waiting"? Check the process ID of one of MPI processes (not master), attach gdb by gdb -p processID and do bt (backtrace).

reading from disk

Is this from the scratch, or from the original location?

@ashkumatov
Copy link

What are the box size, pixel size and resolution?
it's never been an issue before.. But ok.. After 3x decimation it's 200px, 0.7*3A/px and it's a 2D classification.. After that exactly the same during 3D classification.

Can you investigate what RELION is doing during "5-10 min waiting"?
Attaching to process 38416
[New LWP 38420]
[New LWP 38423]
[New LWP 38575]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f5fb204d093 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20

Is this from the scratch, or from the original location?
i don't think it's an issue, but essentially i move the extracted job to /ssd and then just create a softlink in the relion dir -works fine in RELION3.0s

@biochem-fan
Copy link
Member

Can you do bt (backtrace) in GDB?

@ashkumatov
Copy link

as in 'gdb bt -p XX' ?
not really familiar with gdb..

@ashkumatov
Copy link

btw, i rolled back to relion3.0s on the same gpu station - works flawless..

@biochem-fan
Copy link
Member

After gdb -p XX, it will show a prompt (gdb). Please type bt there.

@ashkumatov
Copy link

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f7e28d06bf9 in __GI___poll (fds=0x55a5b2dc5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb)
(gdb) bt
#0 0x00007f7e28d06bf9 in __GI___poll (fds=0x55a5b2dc5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f7e29270403 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20
#2 0x00007f7e2926760b in opal_libevent2022_event_base_loop () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20
#3 0x000055a5b0e984a3 in ?? ()
#4 0x000055a5b0e96aea in ?? ()
#5 0x00007f7e28c13b97 in __libc_start_main (main=0x55a5b0e96aca, argc=43, argv=0x7ffd6eedc358, init=,
fini=, rtld_fini=, stack_end=0x7ffd6eedc348) at ../csu/libc-start.c:310
#6 0x000055a5b0e969ea in ?? ()
(gdb)
#0 0x00007f7e28d06bf9 in __GI___poll (fds=0x55a5b2dc5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f7e29270403 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20
#2 0x00007f7e2926760b in opal_libevent2022_event_base_loop () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20
#3 0x000055a5b0e984a3 in ?? ()
#4 0x000055a5b0e96aea in ?? ()
#5 0x00007f7e28c13b97 in __libc_start_main (main=0x55a5b0e96aca, argc=43, argv=0x7ffd6eedc358, init=,
fini=, rtld_fini=, stack_end=0x7ffd6eedc348) at ../csu/libc-start.c:310
#6 0x000055a5b0e969ea in ?? ()

@biochem-fan
Copy link
Member

Thanks. Can you try the same with other MPI processes?

@ashkumatov
Copy link

sure.
my command: which relion_refine_mpi --o Class2D/job037/run --i Extract/job019/particles.star --dont_combine_weights_via_disc --preread_images --pool 300 --pad 2 --ctf --iter 25 --tau2_fudge 1 --particle_diameter 280 --K 40 --flatten_solvent --zero_mask --strict_highres_exp 8 --oversampling 1 --psi_step 12 --offset_range 20 --offset_step 4 --norm --scale --j 1 --gpu "0:1" --pipeline_control Class2D/job037/
After it executed it gets stuck after

root@jekyll:/home# for i in ps -aux | grep emuser | grep relion_refine | awk {'print $2'}; do echo $i; done
6459
6468
6469
6470
root@jekyll:/home# gdb -p 6459
GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 6459
[New LWP 6464]
[New LWP 6465]
[New LWP 6466]
[New LWP 6467]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f72ce84dbf9 in __GI___poll (fds=0x5610720c5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0 0x00007f72ce84dbf9 in __GI___poll (fds=0x5610720c5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f72cedb7403 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20
#2 0x00007f72cedae60b in opal_libevent2022_event_base_loop () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20
#3 0x00005610706d14a3 in ?? ()
#4 0x00005610706cfaea in ?? ()
#5 0x00007f72ce75ab97 in __libc_start_main (main=0x5610706cfaca, argc=43, argv=0x7ffe718cd4e8, init=,
fini=, rtld_fini=, stack_end=0x7ffe718cd4d8) at ../csu/libc-start.c:310
#6 0x00005610706cf9ea in ?? ()
(gdb) quit
A debugging session is active.

Inferior 1 [process 6459] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/bin/orterun, process 6459
[Inferior 1 (process 6459) detached]
root@jekyll:/home# gdb -p 6468
GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 6468
[New LWP 6471]
[New LWP 6472]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
b0x00007f70486024b9 in __brk (addr=0x56499416a000) at ../sysdeps/unix/sysv/linux/x86_64/brk.c:31
31 ../sysdeps/unix/sysv/linux/x86_64/brk.c: No such file or directory.
(gdb) bt
#0 0x00007f70486024b9 in __brk (addr=0x56499416a000) at ../sysdeps/unix/sysv/linux/x86_64/brk.c:31
#1 0x00007f7048602591 in __GI___sbrk (increment=159744) at sbrk.c:56
#2 0x00007f7048587199 in __GI___default_morecore (increment=) at morecore.c:47
#3 0x00007f704857fdac in sysmalloc (nb=nb@entry=160016, av=av@entry=0x7f70488d7c40 <main_arena>) at malloc.c:2489
#4 0x00007f7048580ff0 in _int_malloc (av=av@entry=0x7f70488d7c40 <main_arena>, bytes=bytes@entry=160000) at malloc.c:4125
#5 0x00007f70485832ed in __GI___libc_malloc (bytes=160000) at malloc.c:3065
#6 0x00007f7049155258 in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x0000563f849532c3 in MultidimArray::resize(long, long, long, long) ()
#8 0x0000563f84953ce3 in ExpImage::ExpImage(ExpImage const&) ()
#9 0x0000563f849548d0 in std::vector<ExpImage, std::allocator >::operator=(std::vector<ExpImage, std::allocator > const&) ()
#10 0x0000563f84960101 in void std::__stable_sort<__gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()
#11 0x0000563f8494fd99 in Experiment::randomiseParticlesOrder(int, bool, bool) ()
#12 0x0000563f848cc14b in MlOptimiserMpi::iterate() ()
#13 0x0000563f848895a7 in main ()
(gdb) quit
A debugging session is active.

Inferior 1 [process 6468] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/software/relion/git-relion-3.1_beta/build-relion3.1_beta-20191025_cu92/bin/relion_refine_mpi, process 6468
[Inferior 1 (process 6468) detached]
root@jekyll:/home# gdb -p 6469
GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 6469
[New LWP 6475]
[New LWP 6476]
[New LWP 6498]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:249
249 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:249
#1 0x000055d22dfd1a65 in std::vector<ExpImage, std::allocator >::operator=(std::vector<ExpImage, std::allocator > const&) ()
#2 0x000055d22dfdb497 in __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > > std::__move_merge<ExpParticle*, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(ExpParticle, ExpParticle*, ExpParticle*, ExpParticle*, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)>) ()
#3 0x000055d22dfdc036 in void std::__merge_sort_with_buffer<__gnu_cxx::__normal_iterator<ExpParticle
, std::vector<ExpParticle, std::allocator > >, ExpParticle*, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, ExpParticle*, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)>) ()
#4 0x000055d22dfdcf5b in void std::__stable_sort_adaptive<__gnu_cxx::__normal_iterator<ExpParticle
, std::vector<ExpParticle, std::allocator > >, ExpParticle*, long, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, ExpParticle*, long, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)>) ()
#5 0x000055d22dfdd15a in void std::__stable_sort<__gnu_cxx::__normal_iterator<ExpParticle
, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()
#6 0x000055d22dfccd99 in Experiment::randomiseParticlesOrder(int, bool, bool) ()
#7 0x000055d22df4914b in MlOptimiserMpi::iterate() ()
#8 0x000055d22df065a7 in main ()
(gdb) quit
A debugging session is active.

Inferior 1 [process 6469] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/software/relion/git-relion-3.1_beta/build-relion3.1_beta-20191025_cu92/bin/relion_refine_mpi, process 6469
[Inferior 1 (process 6469) detached]
root@jekyll:/home# gdb -p 6470
GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 6470
[New LWP 6473]
[New LWP 6474]
[New LWP 6499]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:249
249 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:249
#1 0x00005643a715aa65 in std::vector<ExpImage, std::allocator >::operator=(std::vector<ExpImage, std::allocator > const&) ()
#2 0x00005643a7164b29 in void std::__merge_sort_with_buffer<__gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, ExpParticle*, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, ExpParticle*, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)>) ()
#3 0x00005643a7165f5b in void std::__stable_sort_adaptive<__gnu_cxx::__normal_iterator<ExpParticle
, std::vector<ExpParticle, std::allocator > >, ExpParticle*, long, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, ExpParticle*, long, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)>) ()
#4 0x00005643a716615a in void std::__stable_sort<__gnu_cxx::__normal_iterator<ExpParticle
, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()
#5 0x00005643a7155d99 in Experiment::randomiseParticlesOrder(int, bool, bool) ()
#6 0x00005643a70d214b in MlOptimiserMpi::iterate() ()
#7 0x00005643a708f5a7 in main ()
(gdb) quit
A debugging session is active.

Inferior 1 [process 6470] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/software/relion/git-relion-3.1_beta/build-relion3.1_beta-20191025_cu92/bin/relion_refine_mpi, process 6470
[Inferior 1 (process 6470) detached]

@biochem-fan
Copy link
Member

Thanks. This is very useful.

Another question: how many particles do you have?

@ashkumatov
Copy link

Another question: how many particles do you have?
moderate amount = 148k

@biochem-fan
Copy link
Member

OK, probably I understood what is happening.

How many optics groups do you have?
If you have only one: does --random_seed 0 make it faster?

@ashkumatov
Copy link

ashkumatov commented Oct 31, 2019

# version 30001

data_optics

loop_ 
_rlnOpticsGroupName #1 
_rlnOpticsGroup #2 
_rlnMicrographOriginalPixelSize #3 
_rlnVoltage #4 
_rlnSphericalAberration #5 
_rlnAmplitudeContrast #6 
_rlnImagePixelSize #7 
_rlnImageSize #8 
_rlnImageDimensionality #9 
opticsGroup1            1     0.784000   300.000000     2.550000     0.100000     2.352000          200            2 
# version 30001

data_particles

loop_ 

@ashkumatov
Copy link

--random_seed 0
i will check later.. But if it's the case with optics groups, would i not had the same issue when ptcls are not read into RAM?

@biochem-fan
Copy link
Member

biochem-fan commented Oct 31, 2019

The problem seems to be the sorting of particles by optics groups.
https://github.com/3dem/relion/blob/ver3.1/src/exp_model.cpp#L394
When particles are pre-read into RAM, the ExpImage objects become larger and take more time to be copied and might lead to memory fragmentation.

This sorting was not present in RELION 3.0. --random_seed 0 prevents calls to this function.

@ashkumatov
Copy link

i see. Thanks!

@biochem-fan
Copy link
Member

I made an improvement to the code; can you test it without setting --random_seed 0?

@biochem-fan
Copy link
Member

The latest version on the repository should fix this issue. If not, please reopen this issue.

@ashkumatov
Copy link

in the latest RELION-3.1 (commit 9d7525), when reading ptcls to RAM, one still has to specify "--random_seed 0", otherwise it "eats" up all RAM and stalls.. If read from disk - normal behaviour.

@biochem-fan
Copy link
Member

biochem-fan commented Nov 15, 2019

Although the commit 76fa3d2 reduced the memory usage, nonetheless RELION 3.1 needs more space and operations than 3.0 due to optics groups. I don't think we can reduce further.

If you have plenty of RAM, you can make a RAM disk and use it as a scratch space.

@biochem-fan
Copy link
Member

biochem-fan commented Nov 15, 2019

Note to self:

  • Cannot reproduce huge memory consumption locally; although larger than 3.0, the difference is not huge.
  • Is this compiler dependent?
  • Can we improve performance by making ExpParticle, Image, MultidimArray, MetaDataTable move-constructible and move-assignable for efficient sorting? This is a HUGE work...
  • Locally reproducible case is necessary for optimization.

@biochem-fan biochem-fan reopened this Nov 15, 2019
@biochem-fan
Copy link
Member

@nym2834610, @ashkumatov What is the number of particles? Which compiler did you use?

@nym2834610
Copy link

~200 K particles at 1A/pix, box size 400. We use the bash shell.

@biochem-fan
Copy link
Member

What is the compiler, not the shell?

@ashkumatov
Copy link

Actually I tried with 6k ptcls, which is about 10Gb. Running it on 4GPUs, using 5CPUs total. If I don’t use the flag “--random_seed 0“, RAM consumption goes up to 240Gb.. and at the peak consumption steps, there’s a really long waiting. if I use the flag, it’s a standard behaviour.

I will check a compiler version on Monday.

@biochem-fan
Copy link
Member

Is the memory consumption more or less proportional to the number of the particles? How much does it use with 3K particles, for example?

@ashkumatov
Copy link

It loads into ram proportional amount and then at certain steps it goes to max RAM available..

@biochem-fan
Copy link
Member

Does it use all RAM and take very long even with say 100 particles?

@ashkumatov
Copy link

Thanks for you comment! It actually helped me to find the problem: i typically compile two version of RELION - with CUDA8.0 (to be able to run GCTF) and CUDA9.2, which require different version of C complier. Basically, i forgot to switch back to newer C compiler when compiling with CUDA9.2
Now all works. Thanks for your help!

@biochem-fan
Copy link
Member

@ashkumatov Can you comment on which compiler works and which does not?

@nym2834610
Copy link

We use the CUDA 7.5 complier. I'll try other complier versions on Monday and let you know if the problem is gone without randomseed to 0.

@biochem-fan
Copy link
Member

What is the version of GCC invoked by your CUDA compiler (nvcc)?

@ashkumatov
Copy link

@biochem-fan actually, i did more tests - still the problem is there. I load to RAM 90Gb of ptcls for 2D classification and at some steps the RAM gets filled up to 180Gb.. so essentially doubles.

@biochem-fan
Copy link
Member

biochem-fan commented Nov 16, 2019

I think double are reasonable. We need space to move particles arounds.

But earlier you said 10Gb of particles consume "up to 240Gb", which is quite unexpected and something I cannot reproduce locally. Does the memory consumption different between GCC versions (I don't care CUDA versions)? Does it still happen with very very few particles, say 100?

When particles (ExpParticle) are sorted, they are copied. Of course, old particles are freed but memory might get fragmented. Depending on the compiler, malloc and/or std::sort is less efficient and can take more memory and time. By using move semantics in C++11, we can explicitly ask the compiler to move objects, instead of copy and delete, thus saving time and space. This is better but takes huge efforts to implement and test. Unless I can locally reproduce this problem, I cannot investigate further.

@biochem-fan
Copy link
Member

@ashkumatov @nym2834610 @kaoweichun
In the latest commit 6d9a0da, we improved memory management. In our local tests, the huge spike in the memory usage was eliminated and the time after the M step and the next E step has shortened. Could you please test?

@eariascib
Copy link

We also had large peaks of RAM usage that stalled the jobs (relion 3.1 downloaded on March 5). The new version solved these issues. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants