-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccls hangs ps commands and then no responses in gvim #316
Comments
Can you confirm the ccls process in in uninterruptible sleep state? It was probably writing cache files but that shouldn't take long time. https://github.com/MaskRay/ccls/wiki/Debugging
Can you dump a few entries of your |
Well all I can say is when I try "cat /proc/pid/cmdline" on the ccls process it hangs and not even Ctrl-C can get me out of it. The only escape is kill -9 |
See #199 |
I don't think it is the same issue as #199, so I don't think it should be closed:
|
In offline discussion, this is diagnosed as an out-of-memory issue. Fixed with the initialization option {
"index": {
"threads": 4
}
} When using LanguageClient-neovim, this object should be wrapped in |
I'm not sure I would call this a fix. I don't think it is ever acceptable for any program to go into uninterruptible sleep state. ccls should try to prevent this from happening, alert the user and give a hint to reduce number of threads. When I first tried ccls in September 2018, 6 months ago, I ran into this issue and just gave up on ccls. 6 months later when I tried again I posted this issue about it. Took us some time to figure it out what was going on after creating this issue. I think that ccls should perform some checks when it starts up to calculate if there is any possibility of out of memory/uninterruptible sleep to occur. It could multiply the std::hardware_concurrency threads() by some estimate of memory consumption per thread and compare to total memory available in the system. If this is close ccls can put a warning into the log so that user can have some clue as to what is going on. Other option is ccls can refuse to run or change default behavior to use less threads to fit within the memory footprint. I don't think it is acceptable for ccls to ever go into uninterruptible sleep and it should try to prevent this as much as possible. |
I think I have good understanding of both cquery and ccls code bases to make the claim, this isn't a ccls issue. If ccls uses too much memory, you should more likely see Though I don't reopen the issue, I'm still commenting here as I am curious and want to help. You may be less motivated to do that, but here are some tips.
#!/bin/sh
CCLS_TRACEME=s /path/to/ccls/Release/ccls -v=2 -file-log=/tmp/ccls.log ccls will call
|
I believe I found one of the root causes of the uninterruptible sleep! By fixing this bug ccls no longer goes into uninterruptible sleep right away. However it still goes into uninterruptible sleep much much later, but I think I have peeled first layer of the onion on this problem! There is probably another data structure that is not properly protected that is causing the uninterruptible sleep much later. From pipeline.cc, in Indexer_Parse(), around lines 322-325:
I have caught MANY MANY threads trying to write to the path2entry_index at the same time! I have attached 2 screenshots from the same core dump that do this. When I change every instance of locking of project->mtx to use std::unique_lock, then I can delay the uninterruptible sleep by a very long time. It does still happen though, so I think there is probably another data structure in ccls that is not properly protected. Since I have 72 CPU cores and 72 threads, I have many many more threads operating in parallel than other users, so that is probably why I see this problem and others do not. By reducing number of threads, it also makes it less likely that multiple threads access the data structure at the same time. |
I'm able to make the uninterruptible sleep go away completely by changing the locks of g_index_mutex in pipeline.cc! I changed all std::shared_lock on this mutex to std::unique_lock, as well as changing the std::lock_guard to std::unique_lock (not sure that is necessary but I did it anyway). Now I don't go into uninterruptible sleep at all anymore! And I'm using all 72 threads! So I think there are bugs in ccls around the usage of std::shared_lock around sections of code that are writing to data structures. std::shared_lock should only be around reads, but ccls has it around writes! This is causing data corruption and somehow this leads to uninterruptible sleep on my system. Perhaps for others it manifests as "Out of Memory", because who knows what problems corrupted data structures can lead to! So I think ccls should fix this bug. It explains why I see this issue in ccls and not cquery, because ccls is not performing proper locks around data structures accessed by multiple threads. |
Great finding! Thank you for pinpointing this incorrect usage of
I didn't remember why I at some point switched to The main thread (sometimes reader sometimes writer) and indexer threads (always writer) access The race condition was difficult to trigger because the time spent in the critical section is short. With 72 threads the contention becomes likely. IndexUpdate update = IndexUpdate::CreateDelta(nullptr, prev.get());
on_indexed->PushBack(std::move(update),
request.mode != IndexMode::NonInteractive);
if (entry.id >= 0) {
std::shared_lock lock2(project->mtx); ///////////// should use std::lock_guard
project->root2folder[entry.root].path2entry_index[path] = entry.id;
}
} This commit is included in 0.20181225.8 so I should make a new patch release 0.20181225.9
Hope this is the exact issue that leads to "Out of Memory". I'm still very suspicious why it does that... Did you get a chance to dump |
Thanks for your 0.20181225.9 patch! I took a look at it but you only addressed 1 of the 2 shared_mutex that were causing the uninterruptible sleep. There is also g_index_mutex in pipeline.cc. I still got uninterruptible sleep for g_index_mutex until I changed all instances to use std::unique_lock. So I think you should patch that one as well to not use shared_mutex anymore. Otherwise I will still have uninterruptible sleep issue. A review of the code shows that the RemoveCache() function is modifying g_index, so at a minimum a shared_lock should not be used here. I did not try changing just this one, but I changed all instances of g_index_mutex to use unique_lock. Probably safer to not use a shared mutex, the loss in performance is probably minimal and you reduce chances of future error if you don't make g_index_mutex shared. I notice that RemoveCache() function is not in the 20181225.9 release, but is in the latest release. You can see it was coded with shared_lock when it should not have. If you did not make this a shared mutex, this error would not have been made. Maybe better to have a small sacrifice in speed for safety. |
I'm still getting the uninterruptible sleep with 0.20181225.9 tag. I run out of time today to do anything more. To be clear I got rid of the sleep using the 0.20190308 and changing BOTH mtx and g_index_mutex to always use std::unique_lock. |
I think I don't understand it. 0.20190308 is a broken version. I've deleted it. std::optional<std::string> LoadIndexedContent(const std::string &path) {
if (g_config->cacheDirectory.empty()) {
std::shared_lock lock(g_index_mutex);
auto it = g_index.find(path);
if (it == g_index.end())
return {};
return it->second.content;
}
return ReadContent(GetCachePath(path));
}
std::unique_ptr<IndexFile> RawCacheLoad(const std::string &path) {
if (g_config->cacheDirectory.empty()) {
std::shared_lock lock(g_index_mutex);
auto it = g_index.find(path);
if (it == g_index.end())
return nullptr;
return std::make_unique<IndexFile>(it->second.index);
}
According to C++ [container.requirements.dataraces], Besides, if you don't set |
It may not be g_index_mutex that is causing issue with the 20181225 release. Perhaps there is another data structure that needs protecting that was fixed between 20181225 and 20190308? I'll try a few things later when I have time. I can't work on this at the moment. With 20181225.9 I get the freeze right away. With 20190308, after fixing mtx, and not fixing g_index_mutex, it took a long time to get the freeze. Also I noticed that in the 190308 release, the json file that contains my cache directory is not looked at, and the default cache directory under my HOME is used. The 181225 release did respect what was in my JSON file. Not sure this has anything to do with it or not. Another observation is that when you use shared_mutex, when you want the lock for exclusive access, I think you are supposed to use std::unique_lock and not lock_guard? Not sure if that has anything to do with this. |
See my old comments. Would be great if
That can't happen. The cache directory defaults to
Before Feb 21, since
Did you mean
|
I should have been more specific. All I have in my JSON file is "cacheDirectory", and that appeared to be not respected in 190308 release. From what you said above, it needs to change to "cache.directory". I never specified cache.format or cacheFormat before. |
I rerun with 181225.9 release, it freezes right away. I try to see all stack files under /proc/PID, there is only one:
catf script is simply:
|
With 190314 release, no freeze! And indexing is very very fast! And less memory than cquery! 1.8% of memory with ccls vs 2.1% with cquery with 256GByte of RAM. Is it worth pursuing why freeze with 181225 release but not with 190314? Maybe worth it to at least have ccls know what can cause freeze and perhaps print an error before actually freezing? I'm willing to do some experiments to help track it down. With 190314, my cache.directory maybe not respected? I use below json, but I don't see HOME/.ccls_cache or /export/sim/rbresali/cclsdata/LangClient/cache directories existing:
edit: OK, I see the cache under projectroot/.ccls_cache But why doesn't it like my json file? I have this in my .vimrc and it works fine with 181225 release (and cacheDirectory instead):
|
I also noticed last week that when I put the index/threads in the json file it was not respected. I had to move it to --init. I wonder if there is both --init and a json file if the json file is not read? |
I try again today with 190314 release and default 72 threads and now I get the freeze with 190314 release! I then limited to 50 threads, and no freeze. I don't understand. I am not the only user of this system. Some other users were doing builds/links, and I also see root running "rngd" process that is taking significant CPU on 1 thread. Maybe it all depends on how other users/processes are loading the system? If ccls takes all my threads but does not have exclusive access to the machine, maybe it can cause issues? I do notice with 50 threads that often the threads enter D state, but then they come out. But if other processes are vying for my cores, maybe ccls can't get out of these states? Why can cquery use over 110+ threads on my machine without freezing while ccls can't? Also, just tried ccls in debug build, debug build freezes with 50 threads, but ccls in release build did not freeze with 50 threads. |
This may probably be the issue. Can you apply the following patch with diff --git c/src/messages/initialize.cc w/src/messages/initialize.cc
index c0344a34..d91e84a2 100644
--- c/src/messages/initialize.cc
+++ w/src/messages/initialize.cc
@@ -380,7 +380,7 @@ void Initialize(MessageHandler *m, InitializeParam ¶m, ReplyOnce &reply) {
LOG_S(INFO) << "start " << g_config->index.threads << " indexers";
for (int i = 0; i < g_config->index.threads; i++)
- SpawnThread(Indexer, new std::pair<MessageHandler *, int>{m, i});
+ SpawnThread(Indexer, new std::pair<MessageHandler *, int>{m, i}, i > 0);
// Start scanning include directories before dispatching project
// files, because that takes a long time.
diff --git c/src/platform.hh w/src/platform.hh
index 79aca558..9e190aba 100644
--- c/src/platform.hh
+++ w/src/platform.hh
@@ -28,5 +28,5 @@ void FreeUnusedMemory();
// Stop self and wait for SIGCONT.
void TraceMe();
-void SpawnThread(void *(*fn)(void *), void *arg);
+void SpawnThread(void *(*fn)(void *), void *arg, bool low = false);
} // namespace ccls
diff --git c/src/platform_posix.cc w/src/platform_posix.cc
index c172446b..ee14d62c 100644
--- c/src/platform_posix.cc
+++ w/src/platform_posix.cc
@@ -73,7 +73,7 @@ void TraceMe() {
raise(traceme[0] == 's' ? SIGSTOP : SIGTSTP);
}
-void SpawnThread(void *(*fn)(void *), void *arg) {
+void SpawnThread(void *(*fn)(void *), void *arg, bool low) {
pthread_t thd;
pthread_attr_t attr;
struct rlimit rlim;
@@ -85,6 +85,13 @@ void SpawnThread(void *(*fn)(void *), void *arg) {
pthread_attr_setstacksize(&attr, stack_size);
pipeline::ThreadEnter();
pthread_create(&thd, &attr, fn, arg);
+ if (low) {
+#ifdef __linux__
+ sched_param priority;
+ priority.sched_priority = 0;
+ pthread_setschedparam(thd, SCHED_IDLE, &priority);
+#endif
+ }
pthread_attr_destroy(&attr);
}
} // namespace ccls |
Thanks for your patch. I still get the freeze with it and 72 threads :( My current theory is that there is still some race condition somewhere in the code where a data structure that should be protected is not protected. Depending on when I run ccls and the loading of the system, the timing may or may not cause this event to happen. The more threads I run, the more likely it is for this race condition to occur. When I was running over the weekend, system was lightly loaded as my coworkers were not working, so timing was different than when I tried on Monday afternoon when I got the freeze. I think I ran Monday morning and didn't get the freeze, probably also lightly loaded system. One thing is for sure though, when I patched the shared_mutex issue on Sunday, that seemed to make a big difference, as I did the before/after that patch at about the same time. And after that patch, 190308 didn't freeze while 181225 did freeze (with similar patch). Some more data for you - with a debug build yesterday, I had the freeze with 50 threads and also 40 threads. When I went to 30 threads, no more freeze. But with the release build yesterday I was able to do 50 threads without a freeze. I also now understand my JSON issues, had to wrap everything under "initializationOptions" as you did mention earlier, but unfortunately I didn't catch that before. As you know the code best, maybe time for a review to make sure you are properly protecting all shared data structures? |
Some more interesting information: With the patch for SCHED_IDLE, I did the following activity: First try, freezed with 72 threads. Then I try the SCHED_IDLE patch with 30 threads (and dont delete cache directory), freeze. After I remove the patch with SCHED_IDLE (and don't delete cache directory), release build also freezes with 30 threads! Then with the base 190314 code without patch, I delete the cache directory, I go down to 20 threads, no freeze. Then I delete the cache directory, re-apply SCHED_IDLE patch, go back up to 30 threads, no freeze. Then I delete cache directory, use SCHED_IDLE patch, go back to 72 threads, freeze. Then I delete cache directory, remove SCHED_IDLE patch, 72 threads, freeze. Then I delete cache directory, no SCHED_IDLE patch, 30 threads, no freeze. |
I'm not sure how much more that sort of trial and error testing will help find the problem; the failure seems pretty random to me and so doesn't really point in a particular direction. Has anyone considered trying to build ccls with TSAN just to see if it can help find anything? Something like that seems more promising, if the results are usable. |
Thanks @madscientist for the tip. I just tried enabling TSAN by doing this before my ccls build:
I'm using clang-7.0.1 to compile ccls. I re-run with 72 threads and unfortunately it freezes very quickly again and nothing on my stderr :( I tried with 20 threads and it doesn't freeze, but also nothing on my stderr :( |
Below is what I do for thread sanitizer cmake -H. -BTSAN -G Ninja -DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_CXX_COMPILER=~/Android/aosp-master/prebuilts/clang/host/linux-x86/clang-r353983/bin/clang++ \
-DCMAKE_CXX_FLAGS='-fsanitize=thread -gsplit-dwarf' -DCMAKE_EXE_LINKER_FLAGS='-fuse-ld=lld -fsanitize=thread -Wl,--gdb-index' \
-DCMAKE_PREFIX_PATH="$HOME/llvm/Release;$HOME/llvm/Release/tools/clang;$HOME/llvm;$HOME/llvm/tools/clang" I have not found issues so far.. (I don't have many threads on my current laptop so..)
You may just use void tools::linkSanitizerRuntimeDeps(const ToolChain &TC,
ArgStringList &CmdArgs) {
// Force linking against the system libraries sanitizers depends on
// (see PR15823 why this is necessary).
CmdArgs.push_back("--no-as-needed");
// There's no libpthread or librt on RTEMS & Android.
if (TC.getTriple().getOS() != llvm::Triple::RTEMS &&
!TC.getTriple().isAndroid()) {
CmdArgs.push_back("-lpthread");
if (!TC.getTriple().isOSOpenBSD())
CmdArgs.push_back("-lrt");
} |
Observed behavior
I try to use ccls on my extremely large proprietary codebase which I am unable to share.
After a short amount of time, ccls ends up in a state that is very similar to what is described here: https://rachelbythebay.com/w/2014/10/27/ps/
My "ps -ef" command hangs when it reaches "ccls". Also if I do "cat /proc//cmdline", I get a hang. I am using gvim 8.1.328 with "autozimu/LanguageClient-neovim" plugin. When I try commands like "hover" or "goto definition" in gvim, no response.
I do not have these issues when using "cquery" on the same codebase with same compile_commands.json and same set of source files. I am able to browse with "cquery" but not with "ccls" due to this freeze.
If I use ccls with a much much smaller sample project it works fine and I don't get this hang.
I do not have root access on the machine where I do the source code development.
I encountered these issues with ccls originally in September 2018 and was using cquery instead because of this problem. I decided to try ccls again and am still getting this issue.
If I quit gvim, ccls is still hanging my ps commands. A simple "kill " won't stop ccls, I need to perform "kill -9 " to unhang ps.
Expected behavior
Should not get a hang on "ps -ef" when I use ccls, should not get a hang when I do
Steps to reproduce
System information
git describe --tags
):$ git describe --tags
0.20181225.8
$ uname -a
Linux hostname 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
gvim 8.1.328
autozimu/LanguageClient-neovim
I don't know how to get the version? I installed it last fall. It works with cquery.
The text was updated successfully, but these errors were encountered: