RELION-3.1 Pre-read all particles into RAM #514

kaoweichun · 2019-10-15T20:03:04Z

Hello,

I used Relion 3.1-beta-commit-a6aaa5 to repeat a previously completed 3D auto-refine from Relion 3.0.7 without changing any settings. I ran it on a single machine (specs see below) and I always enabled Pre-read all particles into RAM. The typical behaviour on 3.0.7 is, one master MPI proc uses 250 GB RAM, and each slave MPI proc uses some 20 GB RAM. However, in Relion 3.1, upon starting refinement the master MPI proc used up all the RAM capacities till mpirun crashed, before estimating initial noise spectra (update: and each slave MPI proc used only around 4 GBM RAM in this stage).

By observing the aforementioned RAM usage behaviour of Relion 3.1, I simply disabled Pre-read all particles into RAM and I could avoid this problem and the refinement could proceed, but I am afraid that there is an issue with memory usage.

Thanks,

WCK

Computer brief specs: 48 cores (hyperthreading-enabled)/ 384 GB RAM/ 2 TB SSD/ Open MPI 3.0.2/ GTX 1080 Ti (4x)/ sge/ CentOS 7.6/ CUDA version 10.1

The text was updated successfully, but these errors were encountered:

biochem-fan · 2019-10-15T20:25:56Z

Are you sure you used the same Use parallel disc I/O? setting?

kaoweichun · 2019-10-15T20:40:58Z

Yes. Use parallel disc I/O? is always off. Otherwise each slave MPI proc will use the same amount of RAM as the master MPI proc does without Use parallel disc I/O? I suppose?

biochem-fan · 2019-10-16T11:29:42Z

This is puzzling... We didn't change codes related to preread images to RAM. Does this happen on other datasets as well?

kaoweichun · 2019-10-16T11:57:01Z

Yes, it happened to other datasets and on another computer as well. That computer has even more ram (768 GB) and Relion 3.1 still attempted to fill it with the Master MPI proc (each Slave MPI procs use roughly < 10 GB RAM). It uses Open MPI 2.1.1 and 2x Tesla M60 so I can say the issue is somehow independent from the systems I am using (?).

ashkumatov · 2019-10-31T11:42:20Z

Hi. I am having similar issue.. I was monitoring the RAM usage now (relion3.1) and before (<3.1 versions). In 3.1b the particles are read to RAM and then at a first step (maximization) suddenly RAM is filled up to it's max and there is a longer waiting.. In subsequent steps, after RELION prints Maximization is done in XX seconds, the RAM is again filled up to the max (actually more than NCPU x 'du -hs Extract/jobXXX' and there is like 5-10 min waiting till RELION is going to next iteration. THe same dataset reading from disk - no waiting.. In 3.0stable there was no issue like that.

biochem-fan · 2019-10-31T11:58:02Z

What are the box size, pixel size and resolution?

Are you familiar with gdb? Can you investigate what RELION is doing during "5-10 min waiting"? Check the process ID of one of MPI processes (not master), attach gdb by gdb -p processID and do bt (backtrace).

reading from disk

Is this from the scratch, or from the original location?

ashkumatov · 2019-10-31T13:42:58Z

What are the box size, pixel size and resolution?
it's never been an issue before.. But ok.. After 3x decimation it's 200px, 0.7*3A/px and it's a 2D classification.. After that exactly the same during 3D classification.

Can you investigate what RELION is doing during "5-10 min waiting"?
Attaching to process 38416
[New LWP 38420]
[New LWP 38423]
[New LWP 38575]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f5fb204d093 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20

Is this from the scratch, or from the original location?
i don't think it's an issue, but essentially i move the extracted job to /ssd and then just create a softlink in the relion dir -works fine in RELION3.0s

biochem-fan · 2019-10-31T13:50:07Z

Can you do bt (backtrace) in GDB?

ashkumatov · 2019-10-31T13:51:09Z

as in 'gdb bt -p XX' ?
not really familiar with gdb..

ashkumatov · 2019-10-31T14:01:17Z

btw, i rolled back to relion3.0s on the same gpu station - works flawless..

biochem-fan · 2019-10-31T14:04:03Z

After gdb -p XX, it will show a prompt (gdb). Please type bt there.

ashkumatov · 2019-10-31T14:09:59Z

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f7e28d06bf9 in __GI___poll (fds=0x55a5b2dc5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb)
(gdb) bt
#0 0x00007f7e28d06bf9 in __GI___poll (fds=0x55a5b2dc5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f7e29270403 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20
#2 0x00007f7e2926760b in opal_libevent2022_event_base_loop () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20
#3 0x000055a5b0e984a3 in ?? ()
#4 0x000055a5b0e96aea in ?? ()
#5 0x00007f7e28c13b97 in __libc_start_main (main=0x55a5b0e96aca, argc=43, argv=0x7ffd6eedc358, init=,
fini=, rtld_fini=, stack_end=0x7ffd6eedc348) at ../csu/libc-start.c:310
#6 0x000055a5b0e969ea in ?? ()
(gdb)
#0 0x00007f7e28d06bf9 in __GI___poll (fds=0x55a5b2dc5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f7e29270403 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20
#2 0x00007f7e2926760b in opal_libevent2022_event_base_loop () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20
#3 0x000055a5b0e984a3 in ?? ()
#4 0x000055a5b0e96aea in ?? ()
#5 0x00007f7e28c13b97 in __libc_start_main (main=0x55a5b0e96aca, argc=43, argv=0x7ffd6eedc358, init=,
fini=, rtld_fini=, stack_end=0x7ffd6eedc348) at ../csu/libc-start.c:310
#6 0x000055a5b0e969ea in ?? ()

biochem-fan · 2019-10-31T14:15:18Z

Thanks. Can you try the same with other MPI processes?

ashkumatov · 2019-10-31T14:36:37Z

sure.
my command: which relion_refine_mpi --o Class2D/job037/run --i Extract/job019/particles.star --dont_combine_weights_via_disc --preread_images --pool 300 --pad 2 --ctf --iter 25 --tau2_fudge 1 --particle_diameter 280 --K 40 --flatten_solvent --zero_mask --strict_highres_exp 8 --oversampling 1 --psi_step 12 --offset_range 20 --offset_step 4 --norm --scale --j 1 --gpu "0:1" --pipeline_control Class2D/job037/
After it executed it gets stuck after

root@jekyll:/home# for i in ps -aux | grep emuser | grep relion_refine | awk {'print $2'}; do echo $i; done
6459
6468
6469
6470
root@jekyll:/home# gdb -p 6459
GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 6459
[New LWP 6464]
[New LWP 6465]
[New LWP 6466]
[New LWP 6467]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f72ce84dbf9 in __GI___poll (fds=0x5610720c5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0 0x00007f72ce84dbf9 in __GI___poll (fds=0x5610720c5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f72cedb7403 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20
#2 0x00007f72cedae60b in opal_libevent2022_event_base_loop () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20
#3 0x00005610706d14a3 in ?? ()
#4 0x00005610706cfaea in ?? ()
#5 0x00007f72ce75ab97 in __libc_start_main (main=0x5610706cfaca, argc=43, argv=0x7ffe718cd4e8, init=,
fini=, rtld_fini=, stack_end=0x7ffe718cd4d8) at ../csu/libc-start.c:310
#6 0x00005610706cf9ea in ?? ()
(gdb) quit
A debugging session is active.

Inferior 1 [process 6459] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/bin/orterun, process 6459
[Inferior 1 (process 6459) detached]
root@jekyll:/home# gdb -p 6468
GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 6468
[New LWP 6471]
[New LWP 6472]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
b0x00007f70486024b9 in __brk (addr=0x56499416a000) at ../sysdeps/unix/sysv/linux/x86_64/brk.c:31
31 ../sysdeps/unix/sysv/linux/x86_64/brk.c: No such file or directory.
(gdb) bt
#0 0x00007f70486024b9 in __brk (addr=0x56499416a000) at ../sysdeps/unix/sysv/linux/x86_64/brk.c:31
#1 0x00007f7048602591 in __GI___sbrk (increment=159744) at sbrk.c:56
#2 0x00007f7048587199 in __GI___default_morecore (increment=) at morecore.c:47
#3 0x00007f704857fdac in sysmalloc (nb=nb@entry=160016, av=av@entry=0x7f70488d7c40 <main_arena>) at malloc.c:2489
#4 0x00007f7048580ff0 in _int_malloc (av=av@entry=0x7f70488d7c40 <main_arena>, bytes=bytes@entry=160000) at malloc.c:4125
#5 0x00007f70485832ed in __GI___libc_malloc (bytes=160000) at malloc.c:3065
#6 0x00007f7049155258 in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x0000563f849532c3 in MultidimArray::resize(long, long, long, long) ()
#8 0x0000563f84953ce3 in ExpImage::ExpImage(ExpImage const&) ()
#9 0x0000563f849548d0 in std::vector<ExpImage, std::allocator >::operator=(std::vector<ExpImage, std::allocator > const&) ()
#10 0x0000563f84960101 in void std::__stable_sort<__gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()
#11 0x0000563f8494fd99 in Experiment::randomiseParticlesOrder(int, bool, bool) ()
#12 0x0000563f848cc14b in MlOptimiserMpi::iterate() ()
#13 0x0000563f848895a7 in main ()
(gdb) quit
A debugging session is active.

Inferior 1 [process 6468] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/software/relion/git-relion-3.1_beta/build-relion3.1_beta-20191025_cu92/bin/relion_refine_mpi, process 6468
[Inferior 1 (process 6468) detached]
root@jekyll:/home# gdb -p 6469
GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 6469
[New LWP 6475]
[New LWP 6476]
[New LWP 6498]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:249
249 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:249
#1 0x000055d22dfd1a65 in std::vector<ExpImage, std::allocator >::operator=(std::vector<ExpImage, std::allocator > const&) ()
#2 0x000055d22dfdb497 in __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > > std::__move_merge<ExpParticle*, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(ExpParticle, ExpParticle*, ExpParticle*, ExpParticle*, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)>) ()
#3 0x000055d22dfdc036 in void std::__merge_sort_with_buffer<__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, ExpParticle*, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, ExpParticle*, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)>) ()
#4 0x000055d22dfdcf5b in void std::__stable_sort_adaptive<__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, ExpParticle*, long, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, ExpParticle*, long, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)>) ()
#5 0x000055d22dfdd15a in void std::__stable_sort<__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()
#6 0x000055d22dfccd99 in Experiment::randomiseParticlesOrder(int, bool, bool) ()
#7 0x000055d22df4914b in MlOptimiserMpi::iterate() ()
#8 0x000055d22df065a7 in main ()
(gdb) quit
A debugging session is active.

Inferior 1 [process 6469] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/software/relion/git-relion-3.1_beta/build-relion3.1_beta-20191025_cu92/bin/relion_refine_mpi, process 6469
[Inferior 1 (process 6469) detached]
root@jekyll:/home# gdb -p 6470
GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 6470
[New LWP 6473]
[New LWP 6474]
[New LWP 6499]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:249
249 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:249
#1 0x00005643a715aa65 in std::vector<ExpImage, std::allocator >::operator=(std::vector<ExpImage, std::allocator > const&) ()
#2 0x00005643a7164b29 in void std::__merge_sort_with_buffer<__gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, ExpParticle*, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, ExpParticle*, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)>) ()
#3 0x00005643a7165f5b in void std::__stable_sort_adaptive<__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, ExpParticle*, long, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, ExpParticle*, long, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)>) ()
#4 0x00005643a716615a in void std::__stable_sort<__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()
#5 0x00005643a7155d99 in Experiment::randomiseParticlesOrder(int, bool, bool) ()
#6 0x00005643a70d214b in MlOptimiserMpi::iterate() ()
#7 0x00005643a708f5a7 in main ()
(gdb) quit
A debugging session is active.

Inferior 1 [process 6470] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/software/relion/git-relion-3.1_beta/build-relion3.1_beta-20191025_cu92/bin/relion_refine_mpi, process 6470
[Inferior 1 (process 6470) detached]

biochem-fan · 2019-10-31T14:53:42Z

Thanks. This is very useful.

Another question: how many particles do you have?

ashkumatov · 2019-10-31T14:54:58Z

Another question: how many particles do you have?
moderate amount = 148k

biochem-fan · 2019-10-31T15:02:35Z

OK, probably I understood what is happening.

How many optics groups do you have?
If you have only one: does --random_seed 0 make it faster?

ashkumatov · 2019-10-31T15:07:27Z

# version 30001

data_optics

loop_ 
_rlnOpticsGroupName #1 
_rlnOpticsGroup #2 
_rlnMicrographOriginalPixelSize #3 
_rlnVoltage #4 
_rlnSphericalAberration #5 
_rlnAmplitudeContrast #6 
_rlnImagePixelSize #7 
_rlnImageSize #8 
_rlnImageDimensionality #9 
opticsGroup1            1     0.784000   300.000000     2.550000     0.100000     2.352000          200            2

# version 30001

data_particles

loop_

ashkumatov · 2019-10-31T15:12:10Z

--random_seed 0
i will check later.. But if it's the case with optics groups, would i not had the same issue when ptcls are not read into RAM?

biochem-fan · 2019-10-31T15:14:21Z

The problem seems to be the sorting of particles by optics groups.
https://github.com/3dem/relion/blob/ver3.1/src/exp_model.cpp#L394
When particles are pre-read into RAM, the ExpImage objects become larger and take more time to be copied and might lead to memory fragmentation.

This sorting was not present in RELION 3.0. --random_seed 0 prevents calls to this function.

ashkumatov · 2019-10-31T15:18:59Z

i see. Thanks!

)

biochem-fan · 2019-10-31T15:48:12Z

I made an improvement to the code; can you test it without setting --random_seed 0?

biochem-fan · 2019-11-12T14:59:48Z

The latest version on the repository should fix this issue. If not, please reopen this issue.

ashkumatov · 2019-11-15T17:48:36Z

in the latest RELION-3.1 (commit 9d7525), when reading ptcls to RAM, one still has to specify "--random_seed 0", otherwise it "eats" up all RAM and stalls.. If read from disk - normal behaviour.

biochem-fan · 2019-11-15T17:54:10Z

Although the commit 76fa3d2 reduced the memory usage, nonetheless RELION 3.1 needs more space and operations than 3.0 due to optics groups. I don't think we can reduce further.

If you have plenty of RAM, you can make a RAM disk and use it as a scratch space.

biochem-fan · 2019-11-15T22:54:21Z

Note to self:

Cannot reproduce huge memory consumption locally; although larger than 3.0, the difference is not huge.
Is this compiler dependent?
Can we improve performance by making ExpParticle, Image, MultidimArray, MetaDataTable move-constructible and move-assignable for efficient sorting? This is a HUGE work...
Locally reproducible case is necessary for optimization.

biochem-fan · 2019-11-16T08:12:00Z

@nym2834610, @ashkumatov What is the number of particles? Which compiler did you use?

nym2834610 · 2019-11-16T08:14:22Z

~200 K particles at 1A/pix, box size 400. We use the bash shell.

biochem-fan · 2019-11-16T08:33:57Z

What is the compiler, not the shell?

ashkumatov · 2019-11-16T09:46:11Z

Actually I tried with 6k ptcls, which is about 10Gb. Running it on 4GPUs, using 5CPUs total. If I don’t use the flag “--random_seed 0“, RAM consumption goes up to 240Gb.. and at the peak consumption steps, there’s a really long waiting. if I use the flag, it’s a standard behaviour.

I will check a compiler version on Monday.

biochem-fan · 2019-11-16T10:00:31Z

Is the memory consumption more or less proportional to the number of the particles? How much does it use with 3K particles, for example?

ashkumatov · 2019-11-16T10:08:18Z

It loads into ram proportional amount and then at certain steps it goes to max RAM available..

biochem-fan · 2019-11-16T10:19:20Z

Does it use all RAM and take very long even with say 100 particles?

ashkumatov · 2019-11-16T11:40:19Z

Thanks for you comment! It actually helped me to find the problem: i typically compile two version of RELION - with CUDA8.0 (to be able to run GCTF) and CUDA9.2, which require different version of C complier. Basically, i forgot to switch back to newer C compiler when compiling with CUDA9.2
Now all works. Thanks for your help!

biochem-fan · 2019-11-16T14:39:28Z

@ashkumatov Can you comment on which compiler works and which does not?

nym2834610 · 2019-11-16T17:41:51Z

We use the CUDA 7.5 complier. I'll try other complier versions on Monday and let you know if the problem is gone without randomseed to 0.

biochem-fan · 2019-11-16T19:16:50Z

What is the version of GCC invoked by your CUDA compiler (nvcc)?

ashkumatov · 2019-11-16T21:07:25Z

@biochem-fan actually, i did more tests - still the problem is there. I load to RAM 90Gb of ptcls for 2D classification and at some steps the RAM gets filled up to 180Gb.. so essentially doubles.

biochem-fan · 2019-11-16T21:17:20Z

I think double are reasonable. We need space to move particles arounds.

But earlier you said 10Gb of particles consume "up to 240Gb", which is quite unexpected and something I cannot reproduce locally. Does the memory consumption different between GCC versions (I don't care CUDA versions)? Does it still happen with very very few particles, say 100?

When particles (ExpParticle) are sorted, they are copied. Of course, old particles are freed but memory might get fragmented. Depending on the compiler, malloc and/or std::sort is less efficient and can take more memory and time. By using move semantics in C++11, we can explicitly ask the compiler to move objects, instead of copy and delete, thus saving time and space. This is better but takes huge efforts to implement and test. Unless I can locally reproduce this problem, I cannot investigate further.

biochem-fan · 2020-03-16T14:45:48Z

@ashkumatov @nym2834610 @kaoweichun
In the latest commit 6d9a0da, we improved memory management. In our local tests, the huge spike in the memory usage was eliminated and the time after the M step and the next E step has shortened. Could you please test?

eariascib · 2020-03-31T14:40:50Z

We also had large peaks of RAM usage that stalled the jobs (relion 3.1 downloaded on March 5). The new version solved these issues. Thanks!

biochem-fan added the help wanted label Oct 21, 2019

biochem-fan added a commit that referenced this issue Oct 31, 2019

ExpModel: more efficient sorting (hopefully related to GitHub issue #514

0246053

)

biochem-fan mentioned this issue Oct 31, 2019

speed in Refine3D? #520

Closed

biochem-fan added enhancement and removed help wanted labels Oct 31, 2019

biochem-fan closed this as completed Nov 12, 2019

biochem-fan reopened this Nov 15, 2019

biochem-fan closed this as completed Mar 31, 2020

biochem-fan mentioned this issue Apr 11, 2020

Slow "reading in optimiser.star" when continuing jobs #603

Closed

RELION-3.1 Pre-read all particles into RAM #514

RELION-3.1 Pre-read all particles into RAM #514

Comments

kaoweichun commented Oct 15, 2019 • edited Loading

biochem-fan commented Oct 15, 2019

kaoweichun commented Oct 15, 2019 • edited Loading

biochem-fan commented Oct 16, 2019

kaoweichun commented Oct 16, 2019 • edited Loading

ashkumatov commented Oct 31, 2019

biochem-fan commented Oct 31, 2019 • edited Loading

ashkumatov commented Oct 31, 2019

biochem-fan commented Oct 31, 2019

ashkumatov commented Oct 31, 2019

ashkumatov commented Oct 31, 2019

biochem-fan commented Oct 31, 2019

ashkumatov commented Oct 31, 2019

biochem-fan commented Oct 31, 2019

ashkumatov commented Oct 31, 2019

biochem-fan commented Oct 31, 2019

ashkumatov commented Oct 31, 2019

biochem-fan commented Oct 31, 2019

ashkumatov commented Oct 31, 2019 • edited by biochem-fan Loading

ashkumatov commented Oct 31, 2019

biochem-fan commented Oct 31, 2019 • edited Loading

ashkumatov commented Oct 31, 2019

biochem-fan commented Oct 31, 2019

biochem-fan commented Nov 12, 2019

ashkumatov commented Nov 15, 2019

biochem-fan commented Nov 15, 2019 • edited Loading

biochem-fan commented Nov 15, 2019 • edited Loading

biochem-fan commented Nov 16, 2019

nym2834610 commented Nov 16, 2019

biochem-fan commented Nov 16, 2019

ashkumatov commented Nov 16, 2019

biochem-fan commented Nov 16, 2019

ashkumatov commented Nov 16, 2019

biochem-fan commented Nov 16, 2019

ashkumatov commented Nov 16, 2019

biochem-fan commented Nov 16, 2019

nym2834610 commented Nov 16, 2019

biochem-fan commented Nov 16, 2019

ashkumatov commented Nov 16, 2019

biochem-fan commented Nov 16, 2019 • edited Loading

biochem-fan commented Mar 16, 2020

eariascib commented Mar 31, 2020

kaoweichun commented Oct 15, 2019 •

edited

Loading

kaoweichun commented Oct 15, 2019 •

edited

Loading

kaoweichun commented Oct 16, 2019 •

edited

Loading

biochem-fan commented Oct 31, 2019 •

edited

Loading

ashkumatov commented Oct 31, 2019 •

edited by biochem-fan

Loading

biochem-fan commented Oct 31, 2019 •

edited

Loading

biochem-fan commented Nov 15, 2019 •

edited

Loading

biochem-fan commented Nov 15, 2019 •

edited

Loading

biochem-fan commented Nov 16, 2019 •

edited

Loading