Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High cpu usage after viewing a CryFS FUSE filesystem #1543

Closed
peesock opened this issue Dec 14, 2023 · 15 comments
Closed

High cpu usage after viewing a CryFS FUSE filesystem #1543

peesock opened this issue Dec 14, 2023 · 15 comments

Comments

@peesock
Copy link

peesock commented Dec 14, 2023

CryFS is a program to encrypt a folder of files and view them in a FUSE mounted directory.

After lf even glances at a CryFS directory, it will start using about 14.5% cpu on my 8 thread machine, sometimes dropping down to ~3% usage for a while before going back up. Obviously CryFS has to use some cpu to create the file tree, but this is finished quite early on as shown by the cryfs process, and the problem persists even when navigating far away from the cryfs dir.

btop also shows the process's thread count going up over time. One day i woke up to find it using over 1000 threads, which actually seemed to lag my system.

Pressing F1 to open help in the pager temporarily halts all cpu usage, and goes back up on quit. Any ideas?

@Michael-Gallo
Copy link
Contributor

Michael-Gallo commented Dec 17, 2023

I'm able to replicate this, 10% cpu usage if I ever touch the mountdir directory

It seems that this channel is somehow getting flooded with a ridiculous number of directories

When I make a global variable to add one to a counter every time we reach line 384 and print it when the program ends I get a number in the thousands if I open lf in the mountdir for even a few seconds versus 6 if I open any other directory
image

image

image

Edit: also adding even a 5 millsecond delay takes my cpu usage from 10% to 0.8%. The time is actually being spent in app.ui.draw on line 418

Edit pt2: I think I see why this is happening. When a dir is received on dirchan, we call app.nav.checkdir . Then checkDir goes down a codepath where it creates a newDir and passes it into that channel. So it loops on itself.

I'm unsure what checkDir is doing and what the purpose of that go func is so I'm hesistant proposing a solution

@joelim-work
Copy link
Collaborator

checkDir is simply used to check if a directory has been changed (e.g. a new file was created inside it) since the last time it was loaded by lf, and send an update to the dirChan channel to reload it if so. It is called frequently, through calls like app.nav.renew and app.ui.loadFile.

If the directory has been modified, the load timestamp is updated, so theoretically it should not cause an infinite loop:

lf/nav.go

Line 527 in 2a4adb3

dir.loadTime = now

You might want to check the modified time of the directory and the load timestamp to investigate further.

@Michael-Gallo
Copy link
Contributor

Michael-Gallo commented Dec 26, 2023

The modified time of a mounted cryfs directory always seems to be the current time. This is without any processes (other than cryfs) touching it

image

Just a thought, the number of Blocks from os.Stat.Sys() on the mounted directory always shows 0, I can't find any other directories that work like this on my system. Perhaps we can use that as a sign to early return/

@joelim-work
Copy link
Collaborator

That makes sense, if the modified time is constantly updating (to the current time), then the directory will constantly be reloaded. I think it would be acceptable to add some check similar to below to perform an early return for special types of filesystems:

lf/nav.go

Lines 519 to 523 in 4882945

// XXX: Linux builtin exFAT drivers are able to predict modifications in the future
// https://bugs.launchpad.net/ubuntu/+source/ubuntu-meta/+bug/1872504
if s.ModTime().After(now) {
return
}

Thank you for looking into the root cause.

@joelim-work
Copy link
Collaborator

joelim-work commented Jan 4, 2024

So it turns out the root cause comes from cryfs (I'm guessing it's due to some underlying library), and the issue was raised a few years ago, but there hasn't been any updates. I think it is fine to ignore mtime for cryfs directories as a workaround, though I would prefer it if the upstream issue was fixed.

@Michael-Gallo
Copy link
Contributor

Following up on detecting this

Due to this weird behavior, the access, modify time, and change-time are always the same
image

This code successfully flags Cryfs directories for me without giving a false positive in /tmp

func isCryfs(directory string) bool {
	directoryStat, err := os.Stat(directory)

	if err != nil {
		return false
	}

	syscallStat, isSysCallStat := directoryStat.Sys().(*syscall.Stat_t)
	if !isSysCallStat {
		return false
	}

	return syscallStat.Atim == syscallStat.Ctim && syscallStat.Ctim == syscallStat.Mtim
}

@joelim-work
Copy link
Collaborator

Hi @Michael-Gallo

I think it's possible that we can try this approach, though I would still be inclined to keep the block count = 0 check. I'm not exactly keen on this kind of solution, but I think having users run into this high CPU usage situation is much worse.

cc @gokcehan

@Michael-Gallo
Copy link
Contributor

Just a thought

Maybe whether we run this check could be a boolean option of some sort? I think it'll be a very small minority of users that are using CryFS so this check can only cause unintended behavior for the overwhelming majority of users, and is an extra comparison operation they don't need.

@gokcehan
Copy link
Owner

@joelim-work Unfortunately, I can't think of a good solution for this either. I think I'm ok with using the @Michael-Gallo 's new isCryfs function . I'm not a fan of workarounds specific to different filesystems, but sometimes it is inevitable as with the exFAT case. Otherwise, lf becomes basically unusable on these filesystems. This was the first time I heard about CryFS, so I'm not sure if it is common enough to justify the workaround, but we can still give it a try.

Boolean option might also be a good idea, I'm not sure. But if we are thinking of a new option, how about a more general option to be able to disable checkDir function altogether so that it can be used on all filesystems where it becomes a problem. Would that be acceptable?

@Michael-Gallo
Copy link
Contributor

The project only has 1.9k stars on Github ( https://github.com/cryfs/cryfs ) , and I also have never heard of it prior to reading this issue, and fundamentally I would say the bug is on their end. So yea I'm going to agree we probably shouldn't have a file system specific fix here in this case.

Since the issue is fundamentally checkDir inserting a directory into dirChan and then loop calling checkdir and causing an infinite loop; what if we made it so dirs have a boolean property with a name like loops, it'll be true in almost all parts of the program but when checkdir creates nd it sets that property to false, then *app.loop() can check that property so it doesn't call checkDir again when it's triggered by a dir that was itself created by checkDir?

@joelim-work
Copy link
Collaborator

After spending some more time to investigate the issue, I'm not actually sure whether it's necessary for the dirChan to call checkDir again, allowing such an infinite loop.

lf/app.go

Lines 381 to 384 in cf99626

case d := <-app.nav.dirChan:
app.nav.checkDir(d)

According to the commit history, it looks like it was added in 3e36500 to fix issue #338. But the root cause appears to be the fact that directories were loaded asynchronously (via nav.getDirs) before reading the config file, so there is a race condition depending on whether the directories are loaded first, or the config file is read first. In this case the proper fix would have been to ensure the config file is read first before loading any directories, and this was done in 089e6dc.

@Michael-Gallo
Copy link
Contributor

I opened PR #1607 to remove the call to checkDir and confirmed locally that it does fix the high CPU usage. Will wait for @gokcehan to confirm that this solution isn't likely to cause other problems.

@gokcehan
Copy link
Owner

Good catch @joelim-work , thanks for digging through the history, and thanks @Michael-Gallo for submitting the patch. The only case I can think of is that maybe this can now be triggered if an option is changed dynamically without the config file while a directory is being loaded. Unfortunately, I did not have a chance to give it a try yet myself. In any case, we can consider those as niche cases, so I think I'm ok with the patch as it is.

@joelim-work
Copy link
Collaborator

joelim-work commented Feb 20, 2024

I did some testing with the new changes, it looks like a race condition is possible again if set hidden is run dynamically with the following config file definition:

&lf -remote "send $id set hidden"
  1. The directory is loaded and sorted nav.loadDirInternal with the default settings.
  2. set hidden is resolved, so now the value setting has changed. However at this time the list of files is still empty so there is nothing to sort.
  3. The loaded directory is then sent to nav.dirChan (I had to manually add a sleep here to ensure a race condition), but the files are sorted with the original settings, which is now out of date.

I still think it's a niche scenario though. Maybe a more proper fix might be to have nav.checkDir only check mtime, and just sort the directory when receiving it from nav.dirChan, but that would end up being a bigger change.

@joelim-work
Copy link
Collaborator

This is now fixed by #1607

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants