Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RELION race condition when opening GUI #1198

Open
clil16 opened this issue Nov 8, 2024 · 10 comments
Open

RELION race condition when opening GUI #1198

clil16 opened this issue Nov 8, 2024 · 10 comments

Comments

@clil16
Copy link

clil16 commented Nov 8, 2024

Recently we have been seeing people that had established RELION project directories lose the ability to open the RELION GUI in those project directories using the following versions of RELION: 4.0.1-commit-e5c483 and 5.0-beta-4-commit-33b2b0. Each time the behavior was the same but occasionally it would not occur.

We are using NFS v4.2 for our storage.

Environment:

  • OS: Red Hat Enterprise Linux 8.10
  • MPI runtime: OpenMPI 4.1.6
  • RELION version: 4.0.1-commit-e5c483, 5.0-beta-4-commit-33b2b0
  • Memory: N/A
  • GPU: N/A

Dataset:

  • Box size: N/A
  • Pixel size: N/A
  • Number of particles: N/A
  • Description: N/A

Job options:

  • Type of job: Launch GUI
  • Number of MPI processes: N/A
  • Number of threads: N/A
  • Full command (see note.txt in the job directory):
    relion
    

Error message:

in: /tmp/build/relion/src/pipeliner.cpp, line 2341
ERROR:
ERROR: PipeLine:: write reported error in removing directory .relion_lock
=== Backtrace ===
relion(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_1+0x69) [0x47a639]
relion() [0x45a085]
relion (_ZN8PipeLine22checkProcessCompletionEv+0xbd9) [0x58a199]
relion (_ZN13GuiMainWindow14updateJobListsEv+0x10) [0x48a50]
relion (_ZN13GuiMainWindow23cb_select_browsegroup_iEb+0x91) [0x496021]
relion(_ZN13GuiMainWindowC2EiiPKc8FileNameiibbbb+0x30fa) [0x480aa]
relion (main+0x7fb) [0x466ccb]
/lib64/libc.so.6(__libc_start_main+0xe5) [0x7f17b4cd17e5]
relion(_start+0x2e) [0x47289e]
==================
ERROR:
ERROR: PipeLine: write reported error in removing directory .relion_lock

After following relion using strace I saw the following at the end of the strace output:

stat(".relion_lock/lock_default_pipeline.star", {st_mode=S_IFREG|0640, st_size=23, ...}) = 0
unlink(".relion_lock/lock_default_pipeline.star") = 0
rmdir(".relion_lock")                   = -1 ENOTEMPTY (Directory not empty)
futex(0x7f0a518230a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0

Once the program errors and exits the file .relion_lock/lock_default_pipeline.star is in fact removed (as the return code of 0 from unlink indicates) but it seems the rmdir on the directory is occurring too fast which is causing the error.

I was able to fix this issue with the following patch for pipeliner.cpp on the 4.0.2 git tag :

--- src/pipeliner.cpp.org	2024-11-07 13:53:10
+++ src/pipeliner.cpp	2024-11-07 13:58:01
@@ -2300,8 +2300,18 @@
 			REPORT_ERROR("ERROR: PipeLine::write was expecting a file called "+fn_lock+ " but it is no longer there.");
 		if (std::remove(fn_lock.c_str()))
 			REPORT_ERROR("ERROR: PipeLine::write reported error in removing file "+fn_lock);
-		if (rmdir(dir_lock.c_str()))
-			REPORT_ERROR("ERROR: PipeLine::write reported error in removing directory "+dir_lock);
+		int iwait = 0;
+		int status = rmdir(dir_lock.c_str());
+		while (status != 0)
+		{
+			sleep(0.5);
+			int status = rmdir(dir_lock.c_str());
+			iwait++;
+			if (iwait > 9)
+			{
+				REPORT_ERROR("ERROR: Pipeline::write reported error in removing directory " + dir_lock + ".  Waited 5 seconds for the directory to be removed successfully.");
+			}
+		}
 	}
 
 	// Touch a file to indicate to the GUI that the pipeline has just changed

I could create a pull request as well if you'd like.

@biochem-fan
Copy link
Member

biochem-fan commented Nov 8, 2024

This is very interesting. It is possible that a file system change on a host does not immediately become visible on other hosts but I think it is guaranteed by the POSIX specification that they are ordered within a process.

What is the backend of your NFS storage? Cannot you fix this by changing server or client options (e.g. caching parameter)? Because nobody has reported this error before, I suspect this is very specific to your system and we are reluctant to incorporate changes only to address sub-optimally configured system.

@clil16
Copy link
Author

clil16 commented Nov 12, 2024

Thank you for taking the time to look at this issue.

This is very interesting. It is possible that a file system change on a host does not immediately become visible on other hosts but I think it is guaranteed by the POSIX specification that they are ordered within a process.

In this case, I was just opening the RELION GUI so only a single host was involved. I agree it is very interesting and strange but as I understand it NFS is not POSIX compliant.

What is the backend of your NFS storage? Cannot you fix this by changing server or client options (e.g. caching parameter)? Because nobody has reported this error before, I suspect this is very specific to your system and we are reluctant to incorporate changes only to address sub-optimally configured system.

Our NFS storage system is a Dell PowerScale cluster (formerly known as an Isilon) that has been working extremely well for several years and no changes were made recently to any of the systems involved. I'm working on ways to test this more thoroughly but with the issue being intermittent it is difficult to recreate. I don't think our systems are sub-optimally configured but there may be something related to the cache causing weirdness.

@biochem-fan
Copy link
Member

@clil16 Can you contact Dell support on the issue?

@sdrawson
Copy link

sdrawson commented Nov 12, 2024

I just wanted to chip in and mention that we have seen very similar behavior to this starting in the last few months.

We have several instances where the lock removal error is triggered by users running jobs. This is generally with the v5 beta, but several different commits. We have seen this intermittently with users running jobs as usual via the GUI, and frequently when triggering jobs to run via the relion schemer.

We see this on multiple client machines, and from several different storage servers running both NFS v3 and v4.

@clil16
Copy link
Author

clil16 commented Nov 13, 2024

@biochem-fan I can reach out to them regarding this issue but it seems like this may be an issue for others as well given the note by @sdrawson .

@biochem-fan
Copy link
Member

@clil16 Please ask Dell. I believe it is file system's responsibility to keep the ordering of operations within a process. If they say it is not, please ask them what is guaranteed and what is not.

@biochem-fan
Copy link
Member

Just to make sure, aren't you using the async option when mounting NFS? The default is sync.

@clil16
Copy link
Author

clil16 commented Nov 15, 2024

I will reach out to Dell and inquire.

We are using the async option as that is default but we're working on testing the sync option.

For reference, here are the mount options we specify on the client (The hostnames, paths, and IP addresses have been sanitized):
nfsserver.fqdn:/path/to/share on /nfs/share type nfs4 (rw,nosuid,nodev,noatime,nodiratime,vers=4.2,rsize=1047672,wsize=1047532,namlen=255,hard,proto=tcp,nconnect=8,timeo=600,retrans=5,sec=sys,clientaddr=xxx.xxx.xxx.xxx,local_lock=none,addr=xxx.xxx.xxx.xxx)

@filonovd
Copy link

Just for the record, we just tried 'sync' option and it is a lot worse than 'async'. Relion fails with the "error in removing directory .relion_lock" almost instantly.

P.S. I am a coworker of @sdrawson.

@biochem-fan
Copy link
Member

We use auto,nofail,noatime,nolock,intr,tcp,lookupcache=pos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants