Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3D multi-body hangs on Expectation interation 1 #543

Closed
frozenfas opened this issue Nov 18, 2019 · 15 comments
Closed

3D multi-body hangs on Expectation interation 1 #543

frozenfas opened this issue Nov 18, 2019 · 15 comments
Assignees
Labels

Comments

@frozenfas
Copy link

frozenfas commented Nov 18, 2019

When starting a 3D-Multibody refinement in relion 3.1 (f2c3d8) it runs to the first expectation round and appears to hang there. The time to complete never updates past 000/??? and in htop it appears that only one thread is running at 100% but nvidia-top shows both gpus at 95%. There is no error output in run.err. I have tried to start multiple times, sometimes leaving in this state overnight. Also I have tried with and without copy particles to scratch. The prior 3D refinement was run with the MTF files provided and previously the dataset subjected to CtfRefinement with anisotrophic magnification selected.

Environment:

  • OS: Centos 7
  • MPI runtime: openMPI 1.10.7
  • RELION version RELION-3.1-beta-commit- f2c3d8
  • Memory: 256 GB
  • GPU: [e.g. GTX 1080Ti]

Dataset:

  • Box size: mixed = 400, 384, 384
  • Pixel size: mixed = 1.05, 1.085, 1.085
  • Number of particles: 389894
  • Description: ribosome

Job options:

  • Type of job: 3D Multibody
  • Number of MPI processes: 3
  • Number of threads: 2
 ++++ Executing new job on Mon Nov 18 09:11:02 2019
 ++++ with the following command(s): 
`which relion_refine_mpi` --continue Refine3D/job186/run_ct9_it024_optimiser.star --o MultiBody/job198/run 
--solvent_correct_fsc --multibody_masks multibody-mask-refine186.star --oversampling 1 --healpix_order 4 --
auto_local_healpix_order 4 --offset_range 3 --offset_step 1.5 --reconstruct_subtracted_bodies  --dont_combi
ne_weights_via_disc --pool 10 --pad 1  --j 2 --gpu "" --random_seed 0 --pipeline_control MultiBody/job198/
`which relion_flex_analyse` --PCA_orient  --model MultiBody/job198/run_model.star --data MultiBody/job198/r
un_data.star --bodies multibody-mask-refine186.star --o MultiBody/job198/analyse --do_maps  --k 3 --pipelin
e_control MultiBody/job198/
 ++++ 

Error message:
run.err:

The following warnings were encountered upon command-line parsing: 
WARNING: Option --random_seed	is not a valid RELION argument

run.out:

Precision: BASE=double, CUDA-ACC=single 

 Reading in optimiser.star ...
 === RELION MPI setup ===
 + Number of MPI processes             = 3
 + Number of threads per MPI process   = 2
 + Total number of threads therefore   = 6
 + Master  (0) runs on host            = lab-microe20.cicbiogune.int
 + Slave     1 runs on host            = lab-microe20.cicbiogune.int
 + Slave     2 runs on host            = lab-microe20.cicbiogune.int
 =================
 uniqueHost lab-microe20.cicbiogune.int has 2 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on slave 1 mapped to device 0
 Thread 1 on slave 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on slave 2 mapped to device 1
 Thread 1 on slave 2 mapped to device 1
 Running CPU instructions in double precision. 
 + Initialising multi-body refinement ...
 Auto-refine: Iteration= 1
 Auto-refine: Resolution= 3.13433 (no gain for 1 iter) 
 Auto-refine: Changes in angles= 999 degrees; and in offsets= 999 Angstroms (no gain for 0 iter) 
 Estimating accuracies in the orientational assignment ... 
  16/  16 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 0.416 degrees; offsets= 0.3108 Angstroms
 CurrentResolution= 3.13433 Angstroms, which requires orientationSampling of at least 1.28114 degrees for a
 particle of diameter 280 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 1820
 OrientationalSampling= 3.75 NrOrientations= 140
 TranslationalSampling= 1.65375 NrTranslations= 13
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 58240
 OrientationalSampling= 1.875 NrOrientations= 1120
 TranslationalSampling= 0.826875 NrTranslations= 52
=============================
 Expectation iteration 1
000/??? sec ~~(,_,">                                                          [oo]

I will leave it running like this to see if starts to respond.

@biochem-fan
Copy link
Member

Your dataset is large. Did RELION 3.0 run with reasonable speed?

@frozenfas
Copy link
Author

In relion 3.0 they were processed as 3 individual datasets. I took advantage of relion 3.1 to merge. The previous 3D refinement in relion 3.1 took almost a weak to complete (using a ramdisk as scratch so about half the dataset was in ram and half on a spinning disk). From the last successful 3D refinement the first expectation took 4.63 hours. I started the 3D multibody last night around 9 and at 7:30 this morning the time did not advace past 000/???. I stopped that job, pulled the latest git commits and restarted with "copy to scratch" turned off (I have had issues int he past with this) and restarted. It has been running about 1.5 hours now with no updates but I will leave to running maybe until tomorrow to see if it progresses.

@biochem-fan
Copy link
Member

I don't think this is RELION 3.1's problem. Simply your dataset is too large or your hardware not strong enough.

If you believe this is RELION 3.1's problem, please try this:

  • Take one dataset (or even a subset of say 20,000 particles)
  • Run Refine3D and Multibody refinement in both RELION 3.0 and 3.1 and compare the speed

@frozenfas
Copy link
Author

Thanks, I will try out this suggestion.

Note on a different dataset (448/448 px; 406695 particles) but the same hardware I was able to complete a 3D refinement with relion 3.0.7 (first expectation is 4.22 hrs with time updates at 0.08 hrs) and also a multibody refinement (first expectation is 7.55 hrs with time updates after 0.13 hrs). I guess through the optic groups increase the requirements though. Thanks though. I will try the suggestions and get back to you.

@frozenfas
Copy link
Author

Hi. I tried quickly to start the multibody with commit 0841d0 (the version that was able to complete the 3D-auto refine for me; and before MTF fix) and it at least is updating the expected completion time (i.e. it updated 000/??? to 0.12/7.06 hrs). I will let you know if it completes. Are there any issues with using this version for the multi-body refinement?

@biochem-fan
Copy link
Member

If you are not merging datasets, it is completely fine.

Otherwise, Class2D/3D might give worse results without this MTF fix. Refine3D/MultiBody should be less affected, but we cannot guarantee.

@frozenfas
Copy link
Author

Ok, so the safest option might be to remove anything related the MTF correction from the optics group table and repeat the Refine3D/MultiBody without the MTF correction?

@biochem-fan
Copy link
Member

If your three datasets came from the same detector and the same pixel size, you can remove MTF files from the optics group table.

@frozenfas
Copy link
Author

frozenfas commented Nov 28, 2019

Good Morning. I have tried your recommended test (i.e using a subset of particles), but so far I have only done it with different commits of ver3.1 though:

Refine 201

# RELION optimiser; version 3.1-beta-commit-f511ad
`which relion_refine_mpi` --o Refine3D/job201/run --auto_refine --split_random_halves --i Select/job197/particles_split10.star --ref Refine3D/job185/run_class001.mrc --ini_high 40  --dont_combine_weights_via_disc --preread_images  --pool 10 --pad 2  --ctf --ctf_corrected_ref --particle_diameter 280 --flatten_solvent --zero_mask --solvent_mask MaskCreate/job180/mask.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 2 --gpu ""  --pipeline_control Refine3D/job201/

Refine 198

# RELION optimiser; version 3.1-beta-commit-0841d0
`which relion_refine_mpi` --o Refine3D/job198/run --auto_refine --split_random_halves --i Select/job197/particles_split10.star --ref Refine3D/job185/run_class001.mrc --ini_high 40  --dont_combine_weights_via_disc --preread_images  --pool 10 --pad 2  --ctf --ctf_corrected_ref --particle_diameter 280 --flatten_solvent --zero_mask --solvent_mask MaskCreate/job180/mask.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 2 --gpu ""  --pipeline_control Refine3D/job198/

The two commands are the same but I just use different builds (3.1-beta-commit-f511ad vs. version 3.1-beta-commit-0841d0). Basically, the 3.1-beta-commit-0841d0 will complete the Refine3D without error such that Expectation iteration 1 takes 59 secs to complete, but with version 3.1-beta-commit-f511ad the first Expectation iteration remains like this:

000/??? sec ~~(,_,">

even if I leave it running for an hour.

In general, I am also having problems completing also a Multibody refinement even when I use
3.1-beta-commit-0841d0. In the multibody refinement, it always crashes about 1/3 the way through the first expectation iteration (I have 3 optic groups with different box/pixel sizes). Could you just check if this optics group table is correct:

# version 30001

data_optics

loop_ 
_rlnOpticsGroup #1 
_rlnOpticsGroupName #2 
_rlnAmplitudeContrast #3 
_rlnSphericalAberration #4 
_rlnVoltage #5 
_rlnImagePixelSize #6 
_rlnImageSize #7 
_rlnImageDimensionality #8 
_rlnMtfFileName #9 
_rlnMicrographOriginalPixelSize #10 
_rlnMagMat00 #11 
_rlnMagMat01 #12 
_rlnMagMat10 #13 
_rlnMagMat11 #14 
           1    em15422     0.100000     2.700000   300.000000     1.050000          400            2 mtf_k2_300kV.star     1.050000     1.002644 -6.57256e-04 -7.78084e-04     0.99
9458 
           2  em17171-3     0.100000     2.700000   300.000000     1.085000          384            2 mtf_falcon3EC_300kV.star     1.085000     0.999117 2.071498e-04 -2.61142e-04  
   0.999814 
           3 em17171-12     0.100000     2.700000   300.000000     1.085000          384            2 mtf_falcon3EC_300kV.star     1.085000     1.001107     0.001209     0.001314  
   1.000985 
 

# version 30001

data_particles

loop_ 
_rlnCoordinateX #1 
_rlnCoordinateY #2 
_rlnImageName #3 
_rlnMicrographName #4 
_rlnCtfMaxResolution #5 
_rlnCtfFigureOfMerit #6 
_rlnDefocusU #7 
_rlnDefocusV #8 
_rlnDefocusAngle #9 
_rlnCtfBfactor #10 
_rlnCtfScalefactor #11 
_rlnPhaseShift #12 
_rlnAngleRot #13 
_rlnAngleTilt #14 
_rlnAnglePsi #15 
_rlnClassNumber #16 
_rlnNormCorrection #17 
_rlnLogLikeliContribution #18 
_rlnMaxValueProbDistribution #19 
_rlnNrOfSignificantSamples #20 
_rlnGroupName #21 
_rlnRandomSubset #22 
_rlnOpticsGroup #23 
_rlnOriginXAngst #24 
_rlnOriginYAngst #25 
_rlnGroupNumber #26 

Thanks so much for any advice, you can provide

@biochem-fan
Copy link
Member

Your optics group table looks fine.

Could you please try commit 1c53280 and f29379e (i.e. before and after Sjors's change on MTF treatment)? That is, git checkout 1c53280 and then build and run the job. If it works, try git checkout f29379e, build and run the test.

@frozenfas
Copy link
Author

frozenfas commented Nov 28, 2019

In summary

  1. version 3.1-beta-commit-1c5328 dies with an error Empty MultidimArray! which I think is related to issue: Class3D, Empty MultidimArray Class3D, Empty MultidimArray #541
  2. RELION optimiser; version 3.1-beta-commit-f29379 is currently running but seems stuck in expectation iteration 1 as described above.

Commands and output below:
Refine 204 (error, I think, related to the issue: Class3D, Empty MultidimArray #541)

`#` RELION optimiser; version 3.1-beta-commit-1c5328

`which relion_refine_mpi` --o Refine3D/job204/run --auto_refine --split_random_halves --i Select/job197/particles_split10.star --ref Refine3D/job185/run_class001.mrc --ini_high 40 --dont_combine_weights_via_disc --preread_images  --pool 10 --pad 2  --ctf --ctf_corrected_ref --particle_diameter 280 --flatten_solvent --zero_mask --solvent_mask MaskCreate/job180/mask.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 2 --gpu ""  --pipeline_control Refine3D/job204/
Expectation iteration 1
000/??? sec ~~(,_,">                                                          [oo] Size(Y,X): 42x22 i=[0..41] j=[0..21]
 Empty MultidimArray!
 Size(Y,X): 42x22 i=[0..41] j=[0..21]
 Empty MultidimArray!
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 19701 on node lab-microe20 exited on signal 6 (Aborted).

Refine 205:

# RELION optimiser; version 3.1-beta-commit-f29379
`which relion_refine_mpi` --o Refine3D/job205/run --auto_refine --split_random_halves --i Select/job197/particles_split10.star --ref Refine3D/job185/run_class001.mrc --ini_high 40 --dont_combine_weights_via_disc --preread_images  --pool 10 --pad 2  --ctf --ctf_corrected_ref --particle_diameter 280 --flatten_solvent --zero_mask --solvent_mask MaskCreate/job180/mask.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 2 --gpu ""  --pipeline_control Refine3D/job205/
Expectation iteration 1
000/??? sec ~~(,_,">                                                          [oo]
(in htop only 1 of 3 MPI processes are using CPU resourcess)

Just in case it is helpful this is the output of

ldd ~/local/relion-3.1b/bin/relion_refine_mpi
	linux-vdso.so.1 =>  (0x00007ffe033b1000)
	libcufft.so.10 => /usr/local/cuda/lib64/libcufft.so.10 (0x00007f6a017e0000)
	libmpi_cxx.so.1 => /usr/lib64/openmpi/lib/libmpi_cxx.so.1 (0x00007f6a015c5000)
	libmpi.so.12 => /usr/lib64/openmpi/lib/libmpi.so.12 (0x00007f6a012e1000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f6a010dd000)
	libtiff.so.5 => /lib64/libtiff.so.5 (0x00007f6a00e69000)
	libfftw3.so.3 => /home/CICBIOGUNE/sconnell/local/relion-3.1b/lib/libfftw3.so.3 (0x00007f6a00ae8000)
	libfftw3f.so.3 => /home/CICBIOGUNE/sconnell/local/relion-3.1b/lib/libfftw3f.so.3 (0x00007f6a006da000)
	libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007f69fc679000)
	libpng15.so.15 => /lib64/libpng15.so.15 (0x00007f69fc44e000)
	libcudart.so.10.1 => /usr/local/cuda/lib64/libcudart.so.10.1 (0x00007f69fc1d2000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f69fbecb000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f69fbbc9000)
	libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f69fb9a3000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f69fb78d000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f69fb571000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f69fb1a3000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f69faf9b000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f6a09e1a000)
	libopen-rte.so.12 => /usr/lib64/openmpi/lib/libopen-rte.so.12 (0x00007f69fad1f000)
	libopen-pal.so.13 => /usr/lib64/openmpi/lib/libopen-pal.so.13 (0x00007f69faa7b000)
	libutil.so.1 => /lib64/libutil.so.1 (0x00007f69fa878000)
	libhwloc.so.5 => /lib64/libhwloc.so.5 (0x00007f69fa63b000)
	libjbig.so.2.0 => /lib64/libjbig.so.2.0 (0x00007f69fa42f000)
	libjpeg.so.62 => /lib64/libjpeg.so.62 (0x00007f69fa1da000)
	libz.so.1 => /lib64/libz.so.1 (0x00007f69f9fc4000)
	libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f69f9db9000)
	libltdl.so.7 => /lib64/libltdl.so.7 (0x00007f69f9baf000)

@biochem-fan
Copy link
Member

Could you please privately share a small subset (~100 particles from each optics group) of your dataset to us? We need to further investigate this issue locally. Please write to Sjors and Takanori (you can find their email addresses in CCPEM).

@biochem-fan
Copy link
Member

We fixed this issue in commit 3bfef2b.

@biochem-fan
Copy link
Member

biochem-fan commented Dec 3, 2019

Refine3D was fixed by the above commit but MultiBody issue remains. @scheres is working on it now.

@biochem-fan
Copy link
Member

Fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants