3D multi-body hangs on Expectation interation 1 #543

frozenfas · 2019-11-18T08:26:54Z

When starting a 3D-Multibody refinement in relion 3.1 (f2c3d8) it runs to the first expectation round and appears to hang there. The time to complete never updates past 000/??? and in htop it appears that only one thread is running at 100% but nvidia-top shows both gpus at 95%. There is no error output in run.err. I have tried to start multiple times, sometimes leaving in this state overnight. Also I have tried with and without copy particles to scratch. The prior 3D refinement was run with the MTF files provided and previously the dataset subjected to CtfRefinement with anisotrophic magnification selected.

Environment:

OS: Centos 7
MPI runtime: openMPI 1.10.7
RELION version RELION-3.1-beta-commit- f2c3d8
Memory: 256 GB
GPU: [e.g. GTX 1080Ti]

Dataset:

Box size: mixed = 400, 384, 384
Pixel size: mixed = 1.05, 1.085, 1.085
Number of particles: 389894
Description: ribosome

Job options:

Type of job: 3D Multibody
Number of MPI processes: 3
Number of threads: 2

 ++++ Executing new job on Mon Nov 18 09:11:02 2019
 ++++ with the following command(s): 
`which relion_refine_mpi` --continue Refine3D/job186/run_ct9_it024_optimiser.star --o MultiBody/job198/run 
--solvent_correct_fsc --multibody_masks multibody-mask-refine186.star --oversampling 1 --healpix_order 4 --
auto_local_healpix_order 4 --offset_range 3 --offset_step 1.5 --reconstruct_subtracted_bodies  --dont_combi
ne_weights_via_disc --pool 10 --pad 1  --j 2 --gpu "" --random_seed 0 --pipeline_control MultiBody/job198/
`which relion_flex_analyse` --PCA_orient  --model MultiBody/job198/run_model.star --data MultiBody/job198/r
un_data.star --bodies multibody-mask-refine186.star --o MultiBody/job198/analyse --do_maps  --k 3 --pipelin
e_control MultiBody/job198/
 ++++

Error message:
run.err:

The following warnings were encountered upon command-line parsing: 
WARNING: Option --random_seed	is not a valid RELION argument

run.out:

Precision: BASE=double, CUDA-ACC=single 

 Reading in optimiser.star ...
 === RELION MPI setup ===
 + Number of MPI processes             = 3
 + Number of threads per MPI process   = 2
 + Total number of threads therefore   = 6
 + Master  (0) runs on host            = lab-microe20.cicbiogune.int
 + Slave     1 runs on host            = lab-microe20.cicbiogune.int
 + Slave     2 runs on host            = lab-microe20.cicbiogune.int
 =================
 uniqueHost lab-microe20.cicbiogune.int has 2 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on slave 1 mapped to device 0
 Thread 1 on slave 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on slave 2 mapped to device 1
 Thread 1 on slave 2 mapped to device 1
 Running CPU instructions in double precision. 
 + Initialising multi-body refinement ...
 Auto-refine: Iteration= 1
 Auto-refine: Resolution= 3.13433 (no gain for 1 iter) 
 Auto-refine: Changes in angles= 999 degrees; and in offsets= 999 Angstroms (no gain for 0 iter) 
 Estimating accuracies in the orientational assignment ... 
  16/  16 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 0.416 degrees; offsets= 0.3108 Angstroms
 CurrentResolution= 3.13433 Angstroms, which requires orientationSampling of at least 1.28114 degrees for a
 particle of diameter 280 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 1820
 OrientationalSampling= 3.75 NrOrientations= 140
 TranslationalSampling= 1.65375 NrTranslations= 13
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 58240
 OrientationalSampling= 1.875 NrOrientations= 1120
 TranslationalSampling= 0.826875 NrTranslations= 52
=============================
 Expectation iteration 1
000/??? sec ~~(,_,">                                                          [oo]

I will leave it running like this to see if starts to respond.

The text was updated successfully, but these errors were encountered:

biochem-fan · 2019-11-18T08:33:28Z

Your dataset is large. Did RELION 3.0 run with reasonable speed?

frozenfas · 2019-11-18T08:44:25Z

In relion 3.0 they were processed as 3 individual datasets. I took advantage of relion 3.1 to merge. The previous 3D refinement in relion 3.1 took almost a weak to complete (using a ramdisk as scratch so about half the dataset was in ram and half on a spinning disk). From the last successful 3D refinement the first expectation took 4.63 hours. I started the 3D multibody last night around 9 and at 7:30 this morning the time did not advace past 000/???. I stopped that job, pulled the latest git commits and restarted with "copy to scratch" turned off (I have had issues int he past with this) and restarted. It has been running about 1.5 hours now with no updates but I will leave to running maybe until tomorrow to see if it progresses.

biochem-fan · 2019-11-18T08:51:25Z

I don't think this is RELION 3.1's problem. Simply your dataset is too large or your hardware not strong enough.

If you believe this is RELION 3.1's problem, please try this:

Take one dataset (or even a subset of say 20,000 particles)
Run Refine3D and Multibody refinement in both RELION 3.0 and 3.1 and compare the speed

frozenfas · 2019-11-18T09:07:07Z

Thanks, I will try out this suggestion.

Note on a different dataset (448/448 px; 406695 particles) but the same hardware I was able to complete a 3D refinement with relion 3.0.7 (first expectation is 4.22 hrs with time updates at 0.08 hrs) and also a multibody refinement (first expectation is 7.55 hrs with time updates after 0.13 hrs). I guess through the optic groups increase the requirements though. Thanks though. I will try the suggestions and get back to you.

frozenfas · 2019-11-18T10:32:01Z

Hi. I tried quickly to start the multibody with commit 0841d0 (the version that was able to complete the 3D-auto refine for me; and before MTF fix) and it at least is updating the expected completion time (i.e. it updated 000/??? to 0.12/7.06 hrs). I will let you know if it completes. Are there any issues with using this version for the multi-body refinement?

biochem-fan · 2019-11-18T10:56:56Z

If you are not merging datasets, it is completely fine.

Otherwise, Class2D/3D might give worse results without this MTF fix. Refine3D/MultiBody should be less affected, but we cannot guarantee.

frozenfas · 2019-11-18T13:31:27Z

Ok, so the safest option might be to remove anything related the MTF correction from the optics group table and repeat the Refine3D/MultiBody without the MTF correction?

biochem-fan · 2019-11-18T13:36:20Z

If your three datasets came from the same detector and the same pixel size, you can remove MTF files from the optics group table.

frozenfas · 2019-11-28T09:41:29Z

Good Morning. I have tried your recommended test (i.e using a subset of particles), but so far I have only done it with different commits of ver3.1 though:

Refine 201

# RELION optimiser; version 3.1-beta-commit-f511ad
`which relion_refine_mpi` --o Refine3D/job201/run --auto_refine --split_random_halves --i Select/job197/particles_split10.star --ref Refine3D/job185/run_class001.mrc --ini_high 40  --dont_combine_weights_via_disc --preread_images  --pool 10 --pad 2  --ctf --ctf_corrected_ref --particle_diameter 280 --flatten_solvent --zero_mask --solvent_mask MaskCreate/job180/mask.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 2 --gpu ""  --pipeline_control Refine3D/job201/

Refine 198

# RELION optimiser; version 3.1-beta-commit-0841d0
`which relion_refine_mpi` --o Refine3D/job198/run --auto_refine --split_random_halves --i Select/job197/particles_split10.star --ref Refine3D/job185/run_class001.mrc --ini_high 40  --dont_combine_weights_via_disc --preread_images  --pool 10 --pad 2  --ctf --ctf_corrected_ref --particle_diameter 280 --flatten_solvent --zero_mask --solvent_mask MaskCreate/job180/mask.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 2 --gpu ""  --pipeline_control Refine3D/job198/

The two commands are the same but I just use different builds (3.1-beta-commit-f511ad vs. version 3.1-beta-commit-0841d0). Basically, the 3.1-beta-commit-0841d0 will complete the Refine3D without error such that Expectation iteration 1 takes 59 secs to complete, but with version 3.1-beta-commit-f511ad the first Expectation iteration remains like this:

000/??? sec ~~(,_,">

even if I leave it running for an hour.

In general, I am also having problems completing also a Multibody refinement even when I use
3.1-beta-commit-0841d0. In the multibody refinement, it always crashes about 1/3 the way through the first expectation iteration (I have 3 optic groups with different box/pixel sizes). Could you just check if this optics group table is correct:

# version 30001

data_optics

loop_ 
_rlnOpticsGroup #1 
_rlnOpticsGroupName #2 
_rlnAmplitudeContrast #3 
_rlnSphericalAberration #4 
_rlnVoltage #5 
_rlnImagePixelSize #6 
_rlnImageSize #7 
_rlnImageDimensionality #8 
_rlnMtfFileName #9 
_rlnMicrographOriginalPixelSize #10 
_rlnMagMat00 #11 
_rlnMagMat01 #12 
_rlnMagMat10 #13 
_rlnMagMat11 #14 
           1    em15422     0.100000     2.700000   300.000000     1.050000          400            2 mtf_k2_300kV.star     1.050000     1.002644 -6.57256e-04 -7.78084e-04     0.99
9458 
           2  em17171-3     0.100000     2.700000   300.000000     1.085000          384            2 mtf_falcon3EC_300kV.star     1.085000     0.999117 2.071498e-04 -2.61142e-04  
   0.999814 
           3 em17171-12     0.100000     2.700000   300.000000     1.085000          384            2 mtf_falcon3EC_300kV.star     1.085000     1.001107     0.001209     0.001314  
   1.000985 
 

# version 30001

data_particles

loop_ 
_rlnCoordinateX #1 
_rlnCoordinateY #2 
_rlnImageName #3 
_rlnMicrographName #4 
_rlnCtfMaxResolution #5 
_rlnCtfFigureOfMerit #6 
_rlnDefocusU #7 
_rlnDefocusV #8 
_rlnDefocusAngle #9 
_rlnCtfBfactor #10 
_rlnCtfScalefactor #11 
_rlnPhaseShift #12 
_rlnAngleRot #13 
_rlnAngleTilt #14 
_rlnAnglePsi #15 
_rlnClassNumber #16 
_rlnNormCorrection #17 
_rlnLogLikeliContribution #18 
_rlnMaxValueProbDistribution #19 
_rlnNrOfSignificantSamples #20 
_rlnGroupName #21 
_rlnRandomSubset #22 
_rlnOpticsGroup #23 
_rlnOriginXAngst #24 
_rlnOriginYAngst #25 
_rlnGroupNumber #26

Thanks so much for any advice, you can provide

biochem-fan · 2019-11-28T10:28:30Z

Your optics group table looks fine.

Could you please try commit 1c53280 and f29379e (i.e. before and after Sjors's change on MTF treatment)? That is, git checkout 1c53280 and then build and run the job. If it works, try git checkout f29379e, build and run the test.

frozenfas · 2019-11-28T11:02:27Z

In summary

version 3.1-beta-commit-1c5328 dies with an error Empty MultidimArray! which I think is related to issue: Class3D, Empty MultidimArray Class3D, Empty MultidimArray #541
RELION optimiser; version 3.1-beta-commit-f29379 is currently running but seems stuck in expectation iteration 1 as described above.

Commands and output below:
Refine 204 (error, I think, related to the issue: Class3D, Empty MultidimArray #541)

`#` RELION optimiser; version 3.1-beta-commit-1c5328

`which relion_refine_mpi` --o Refine3D/job204/run --auto_refine --split_random_halves --i Select/job197/particles_split10.star --ref Refine3D/job185/run_class001.mrc --ini_high 40 --dont_combine_weights_via_disc --preread_images  --pool 10 --pad 2  --ctf --ctf_corrected_ref --particle_diameter 280 --flatten_solvent --zero_mask --solvent_mask MaskCreate/job180/mask.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 2 --gpu ""  --pipeline_control Refine3D/job204/
Expectation iteration 1
000/??? sec ~~(,_,">                                                          [oo] Size(Y,X): 42x22 i=[0..41] j=[0..21]
 Empty MultidimArray!
 Size(Y,X): 42x22 i=[0..41] j=[0..21]
 Empty MultidimArray!
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 19701 on node lab-microe20 exited on signal 6 (Aborted).

Refine 205:

# RELION optimiser; version 3.1-beta-commit-f29379
`which relion_refine_mpi` --o Refine3D/job205/run --auto_refine --split_random_halves --i Select/job197/particles_split10.star --ref Refine3D/job185/run_class001.mrc --ini_high 40 --dont_combine_weights_via_disc --preread_images  --pool 10 --pad 2  --ctf --ctf_corrected_ref --particle_diameter 280 --flatten_solvent --zero_mask --solvent_mask MaskCreate/job180/mask.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 2 --gpu ""  --pipeline_control Refine3D/job205/
Expectation iteration 1
000/??? sec ~~(,_,">                                                          [oo]
(in htop only 1 of 3 MPI processes are using CPU resourcess)

Just in case it is helpful this is the output of

ldd ~/local/relion-3.1b/bin/relion_refine_mpi
	linux-vdso.so.1 =>  (0x00007ffe033b1000)
	libcufft.so.10 => /usr/local/cuda/lib64/libcufft.so.10 (0x00007f6a017e0000)
	libmpi_cxx.so.1 => /usr/lib64/openmpi/lib/libmpi_cxx.so.1 (0x00007f6a015c5000)
	libmpi.so.12 => /usr/lib64/openmpi/lib/libmpi.so.12 (0x00007f6a012e1000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f6a010dd000)
	libtiff.so.5 => /lib64/libtiff.so.5 (0x00007f6a00e69000)
	libfftw3.so.3 => /home/CICBIOGUNE/sconnell/local/relion-3.1b/lib/libfftw3.so.3 (0x00007f6a00ae8000)
	libfftw3f.so.3 => /home/CICBIOGUNE/sconnell/local/relion-3.1b/lib/libfftw3f.so.3 (0x00007f6a006da000)
	libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007f69fc679000)
	libpng15.so.15 => /lib64/libpng15.so.15 (0x00007f69fc44e000)
	libcudart.so.10.1 => /usr/local/cuda/lib64/libcudart.so.10.1 (0x00007f69fc1d2000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f69fbecb000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f69fbbc9000)
	libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f69fb9a3000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f69fb78d000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f69fb571000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f69fb1a3000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f69faf9b000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f6a09e1a000)
	libopen-rte.so.12 => /usr/lib64/openmpi/lib/libopen-rte.so.12 (0x00007f69fad1f000)
	libopen-pal.so.13 => /usr/lib64/openmpi/lib/libopen-pal.so.13 (0x00007f69faa7b000)
	libutil.so.1 => /lib64/libutil.so.1 (0x00007f69fa878000)
	libhwloc.so.5 => /lib64/libhwloc.so.5 (0x00007f69fa63b000)
	libjbig.so.2.0 => /lib64/libjbig.so.2.0 (0x00007f69fa42f000)
	libjpeg.so.62 => /lib64/libjpeg.so.62 (0x00007f69fa1da000)
	libz.so.1 => /lib64/libz.so.1 (0x00007f69f9fc4000)
	libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f69f9db9000)
	libltdl.so.7 => /lib64/libltdl.so.7 (0x00007f69f9baf000)

biochem-fan · 2019-11-28T11:54:14Z

Could you please privately share a small subset (~100 particles from each optics group) of your dataset to us? We need to further investigate this issue locally. Please write to Sjors and Takanori (you can find their email addresses in CCPEM).

biochem-fan · 2019-11-29T14:43:50Z

We fixed this issue in commit 3bfef2b.

biochem-fan · 2019-12-03T10:55:31Z

Refine3D was fixed by the above commit but MultiBody issue remains. @scheres is working on it now.

biochem-fan · 2019-12-06T14:06:28Z

Fixed now.

biochem-fan closed this as completed Nov 29, 2019

biochem-fan reopened this Dec 3, 2019

biochem-fan added the bug label Dec 3, 2019

biochem-fan assigned scheres Dec 3, 2019

biochem-fan closed this as completed Dec 6, 2019

Pallesen-IUB mentioned this issue Jan 11, 2022

multi-body refinement FFTW crash #843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3D multi-body hangs on Expectation interation 1 #543

3D multi-body hangs on Expectation interation 1 #543

frozenfas commented Nov 18, 2019 •

edited by biochem-fan

Loading

biochem-fan commented Nov 18, 2019

frozenfas commented Nov 18, 2019

biochem-fan commented Nov 18, 2019

frozenfas commented Nov 18, 2019

frozenfas commented Nov 18, 2019

biochem-fan commented Nov 18, 2019

frozenfas commented Nov 18, 2019

biochem-fan commented Nov 18, 2019

frozenfas commented Nov 28, 2019 •

edited by biochem-fan

Loading

biochem-fan commented Nov 28, 2019

frozenfas commented Nov 28, 2019 •

edited

Loading

biochem-fan commented Nov 28, 2019

biochem-fan commented Nov 29, 2019

biochem-fan commented Dec 3, 2019 •

edited

Loading

biochem-fan commented Dec 6, 2019

3D multi-body hangs on Expectation interation 1 #543

3D multi-body hangs on Expectation interation 1 #543

Comments

frozenfas commented Nov 18, 2019 • edited by biochem-fan Loading

biochem-fan commented Nov 18, 2019

frozenfas commented Nov 18, 2019

biochem-fan commented Nov 18, 2019

frozenfas commented Nov 18, 2019

frozenfas commented Nov 18, 2019

biochem-fan commented Nov 18, 2019

frozenfas commented Nov 18, 2019

biochem-fan commented Nov 18, 2019

frozenfas commented Nov 28, 2019 • edited by biochem-fan Loading

biochem-fan commented Nov 28, 2019

frozenfas commented Nov 28, 2019 • edited Loading

biochem-fan commented Nov 28, 2019

biochem-fan commented Nov 29, 2019

biochem-fan commented Dec 3, 2019 • edited Loading

biochem-fan commented Dec 6, 2019

frozenfas commented Nov 18, 2019 •

edited by biochem-fan

Loading

frozenfas commented Nov 28, 2019 •

edited by biochem-fan

Loading

frozenfas commented Nov 28, 2019 •

edited

Loading

biochem-fan commented Dec 3, 2019 •

edited

Loading