RELION race condition when opening GUI #1198

clil16 · 2024-11-08T18:32:58Z

Recently we have been seeing people that had established RELION project directories lose the ability to open the RELION GUI in those project directories using the following versions of RELION: 4.0.1-commit-e5c483 and 5.0-beta-4-commit-33b2b0. Each time the behavior was the same but occasionally it would not occur.

We are using NFS v4.2 for our storage.

Environment:

OS: Red Hat Enterprise Linux 8.10
MPI runtime: OpenMPI 4.1.6
RELION version: 4.0.1-commit-e5c483, 5.0-beta-4-commit-33b2b0
Memory: N/A
GPU: N/A

Dataset:

Box size: N/A
Pixel size: N/A
Number of particles: N/A
Description: N/A

Job options:

Type of job: Launch GUI
Number of MPI processes: N/A
Number of threads: N/A
Full command (see note.txt in the job directory):
```
relion
```

Error message:

in: /tmp/build/relion/src/pipeliner.cpp, line 2341
ERROR:
ERROR: PipeLine:: write reported error in removing directory .relion_lock
=== Backtrace ===
relion(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_1+0x69) [0x47a639]
relion() [0x45a085]
relion (_ZN8PipeLine22checkProcessCompletionEv+0xbd9) [0x58a199]
relion (_ZN13GuiMainWindow14updateJobListsEv+0x10) [0x48a50]
relion (_ZN13GuiMainWindow23cb_select_browsegroup_iEb+0x91) [0x496021]
relion(_ZN13GuiMainWindowC2EiiPKc8FileNameiibbbb+0x30fa) [0x480aa]
relion (main+0x7fb) [0x466ccb]
/lib64/libc.so.6(__libc_start_main+0xe5) [0x7f17b4cd17e5]
relion(_start+0x2e) [0x47289e]
==================
ERROR:
ERROR: PipeLine: write reported error in removing directory .relion_lock

After following relion using strace I saw the following at the end of the strace output:

stat(".relion_lock/lock_default_pipeline.star", {st_mode=S_IFREG|0640, st_size=23, ...}) = 0
unlink(".relion_lock/lock_default_pipeline.star") = 0
rmdir(".relion_lock")                   = -1 ENOTEMPTY (Directory not empty)
futex(0x7f0a518230a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0

Once the program errors and exits the file .relion_lock/lock_default_pipeline.star is in fact removed (as the return code of 0 from unlink indicates) but it seems the rmdir on the directory is occurring too fast which is causing the error.

I was able to fix this issue with the following patch for pipeliner.cpp on the 4.0.2 git tag :

--- src/pipeliner.cpp.org	2024-11-07 13:53:10
+++ src/pipeliner.cpp	2024-11-07 13:58:01
@@ -2300,8 +2300,18 @@
 			REPORT_ERROR("ERROR: PipeLine::write was expecting a file called "+fn_lock+ " but it is no longer there.");
 		if (std::remove(fn_lock.c_str()))
 			REPORT_ERROR("ERROR: PipeLine::write reported error in removing file "+fn_lock);
-		if (rmdir(dir_lock.c_str()))
-			REPORT_ERROR("ERROR: PipeLine::write reported error in removing directory "+dir_lock);
+		int iwait = 0;
+		int status = rmdir(dir_lock.c_str());
+		while (status != 0)
+		{
+			sleep(0.5);
+			int status = rmdir(dir_lock.c_str());
+			iwait++;
+			if (iwait > 9)
+			{
+				REPORT_ERROR("ERROR: Pipeline::write reported error in removing directory " + dir_lock + ".  Waited 5 seconds for the directory to be removed successfully.");
+			}
+		}
 	}
 
 	// Touch a file to indicate to the GUI that the pipeline has just changed

I could create a pull request as well if you'd like.

The text was updated successfully, but these errors were encountered:

biochem-fan · 2024-11-08T23:16:24Z

This is very interesting. It is possible that a file system change on a host does not immediately become visible on other hosts but I think it is guaranteed by the POSIX specification that they are ordered within a process.

What is the backend of your NFS storage? Cannot you fix this by changing server or client options (e.g. caching parameter)? Because nobody has reported this error before, I suspect this is very specific to your system and we are reluctant to incorporate changes only to address sub-optimally configured system.

clil16 · 2024-11-12T18:59:39Z

Thank you for taking the time to look at this issue.

This is very interesting. It is possible that a file system change on a host does not immediately become visible on other hosts but I think it is guaranteed by the POSIX specification that they are ordered within a process.

In this case, I was just opening the RELION GUI so only a single host was involved. I agree it is very interesting and strange but as I understand it NFS is not POSIX compliant.

What is the backend of your NFS storage? Cannot you fix this by changing server or client options (e.g. caching parameter)? Because nobody has reported this error before, I suspect this is very specific to your system and we are reluctant to incorporate changes only to address sub-optimally configured system.

Our NFS storage system is a Dell PowerScale cluster (formerly known as an Isilon) that has been working extremely well for several years and no changes were made recently to any of the systems involved. I'm working on ways to test this more thoroughly but with the issue being intermittent it is difficult to recreate. I don't think our systems are sub-optimally configured but there may be something related to the cache causing weirdness.

biochem-fan · 2024-11-12T22:16:49Z

@clil16 Can you contact Dell support on the issue?

sdrawson · 2024-11-12T22:39:45Z

I just wanted to chip in and mention that we have seen very similar behavior to this starting in the last few months.

We have several instances where the lock removal error is triggered by users running jobs. This is generally with the v5 beta, but several different commits. We have seen this intermittently with users running jobs as usual via the GUI, and frequently when triggering jobs to run via the relion schemer.

We see this on multiple client machines, and from several different storage servers running both NFS v3 and v4.

clil16 · 2024-11-13T19:41:03Z

@biochem-fan I can reach out to them regarding this issue but it seems like this may be an issue for others as well given the note by @sdrawson .

biochem-fan · 2024-11-13T23:29:01Z

@clil16 Please ask Dell. I believe it is file system's responsibility to keep the ordering of operations within a process. If they say it is not, please ask them what is guaranteed and what is not.

biochem-fan · 2024-11-14T02:29:24Z

Just to make sure, aren't you using the async option when mounting NFS? The default is sync.

clil16 · 2024-11-15T18:57:17Z

I will reach out to Dell and inquire.

We are using the async option as that is default but we're working on testing the sync option.

For reference, here are the mount options we specify on the client (The hostnames, paths, and IP addresses have been sanitized):
nfsserver.fqdn:/path/to/share on /nfs/share type nfs4 (rw,nosuid,nodev,noatime,nodiratime,vers=4.2,rsize=1047672,wsize=1047532,namlen=255,hard,proto=tcp,nconnect=8,timeo=600,retrans=5,sec=sys,clientaddr=xxx.xxx.xxx.xxx,local_lock=none,addr=xxx.xxx.xxx.xxx)

filonovd · 2024-11-19T01:05:10Z

Just for the record, we just tried 'sync' option and it is a lot worse than 'async'. Relion fails with the "error in removing directory .relion_lock" almost instantly.

P.S. I am a coworker of @sdrawson.

biochem-fan · 2024-11-19T01:17:41Z

We use auto,nofail,noatime,nolock,intr,tcp,lookupcache=pos.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RELION race condition when opening GUI #1198

RELION race condition when opening GUI #1198

clil16 commented Nov 8, 2024

biochem-fan commented Nov 8, 2024 •

edited

Loading

clil16 commented Nov 12, 2024

biochem-fan commented Nov 12, 2024

sdrawson commented Nov 12, 2024 •

edited

Loading

clil16 commented Nov 13, 2024

biochem-fan commented Nov 13, 2024

biochem-fan commented Nov 14, 2024

clil16 commented Nov 15, 2024 •

edited

Loading

filonovd commented Nov 19, 2024

biochem-fan commented Nov 19, 2024

RELION race condition when opening GUI #1198

RELION race condition when opening GUI #1198

Comments

clil16 commented Nov 8, 2024

biochem-fan commented Nov 8, 2024 • edited Loading

clil16 commented Nov 12, 2024

biochem-fan commented Nov 12, 2024

sdrawson commented Nov 12, 2024 • edited Loading

clil16 commented Nov 13, 2024

biochem-fan commented Nov 13, 2024

biochem-fan commented Nov 14, 2024

clil16 commented Nov 15, 2024 • edited Loading

filonovd commented Nov 19, 2024

biochem-fan commented Nov 19, 2024

biochem-fan commented Nov 8, 2024 •

edited

Loading

sdrawson commented Nov 12, 2024 •

edited

Loading

clil16 commented Nov 15, 2024 •

edited

Loading