Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOG_EVERY_N stops printing when using threads #749

Closed
lucabergamini opened this issue Dec 3, 2021 · 11 comments · Fixed by #770
Closed

LOG_EVERY_N stops printing when using threads #749

lucabergamini opened this issue Dec 3, 2021 · 11 comments · Fixed by #770
Labels
Milestone

Comments

@lucabergamini
Copy link

Hi,
master is currently not working properly when using LOG_EVERY_N and threads

minimal example to reproduce:

#include <gflags/gflags.h>
#include <chrono>
#include <thread>
#include "glog/logging.h"

void f(int id) {
  while (true) {
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
    LOG_EVERY_N(INFO, 10) << "test: " << id;
  }
}

int main(int argc, char* argv[]) {
  gflags::ParseCommandLineFlags(&argc, &argv, true);
  google::InitGoogleLogging(argv[0]);
  LOG(INFO) << "before threads";

  std::thread t1(f, 0);
  std::thread t2(f, 1);

  t1.join();
  t2.join();
}

This will only print for some time before stopping.

The current release does not experience this issue, so it's probably due to a regression.

context:
glog version: 503e3dec8d1fe071376befc62119a837c26612a3
compiler: g++ (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
bazel: 4.2.1

@sergiud
Copy link
Collaborator

sergiud commented Dec 3, 2021

What does some time mean? Minutes, hours, days?

At least logging for a couple of minutes does not show the effect you describe. Also, have you tried the CMake build? Since the Bazel build is untested in contrast to the CMake one, there may be some discrepancy.

@lucabergamini
Copy link
Author

lucabergamini commented Dec 3, 2021

What does some time mean? Minutes, hours, days?

seconds in my case.

We have a system based on bazel, so I can't use CMake easily.

If the bazel build is untested I will just revert to the release and close this

@sergiud
Copy link
Collaborator

sergiud commented Dec 3, 2021

/cc @drigz

@drigz
Copy link
Member

drigz commented Dec 3, 2021

I can repro this sometimes after 30s or so but not reliably, so I haven't been able to bisect. I built with ThreadSantizer but it didn't report any problem, and still crashed after a while. I've run a few times under gdb with and without -c dbg but haven't managed to repro there. I haven't tried with CMake.

@sergiud
Copy link
Collaborator

sergiud commented Dec 3, 2021

Could you change unsigned to int here and see if it helps?

static std::atomic<unsigned> LOG_OCCURRENCES(0), LOG_OCCURRENCES_MOD_N(0); \

@drigz
Copy link
Member

drigz commented Dec 3, 2021

I haven't tried with a signed int but I guess you're on the right track. I caught the bad state in gdb:

(gdb) info locals
occurrences_9 = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 65324}, <No data fields>}
occurrences_mod_n_9 = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 4294731290}, <No data fields>}

_M_i = 4294731290 suggests that LOG_OCCURRENCES_MOD_N -= n happened twice in both threads and it wrapped around to ~MAXINT, and now every thread is executing LOG_OCCURRENCES_MOD_N -= n every time.

A signed int would avoid that problem. OTOH it could plausibly cause undefined behavior through overflows if N is close to MAXINT, but maybe that's just a theoretical concern.

@sergiud
Copy link
Collaborator

sergiud commented Dec 3, 2021

Thanks. Yes, undefined behavior caused by signed integer overflow is why I changed from int to unsigned.

@lucabergamini Could you please check the change?

@sergiud
Copy link
Collaborator

sergiud commented Dec 3, 2021

@drigz Given your analysis, it seems that a typical race condition in the conditional statement occurs because only the increment but not the conditional block as a whole is atomic. This would imply that the problem was always there but was negated by the use of the signed integer.

Therefore, changing the following line

if (++LOG_OCCURRENCES_MOD_N > n) LOG_OCCURRENCES_MOD_N -= n; \

to

if (LOG_OCCURRENCES_MOD_N++ == n) LOG_OCCURRENCES_MOD_N -= n;

and keeping unsigned could probably solve the race condition. Alternatively, we could compare-and-swap which requires an unconditional subtraction but is probably the cleanest solution.

@lucabergamini
Copy link
Author

I can give it a try later @sergiud, I'm using http_archive so I need to start a fork to have a remove commit achive bazel can use

@sergiud sergiud added the bug label Dec 3, 2021
@sergiud sergiud added this to the 0.6 milestone Dec 3, 2021
@drigz
Copy link
Member

drigz commented Dec 3, 2021

if (LOG_OCCURRENCES_MOD_N++ == n) LOG_OCCURRENCES_MOD_N -= n; feels risky to me as well. If a thread terminates between incrementing and subtracting (unlikely but possible?) we would end up in the same position. Or:

  • n == 2 and LOG_OCCURRENCES_MOD_N == 1
  • three threads simultaneously execute LOG_OCCURRENCES_MOD_N++, seeing 2, 3, and 4 respectively
  • the first thread subtracts, the others do nothing, so LOG_OCCURRENCES_MOD_N==2
  • the log message is not emitted until the counter wraps around

I'm wondering if we can somehow salvage the previous behavior, which was working for a long time. Maybe keeping unsigned and checking x=LOG_OCCURRENCES_MOD_N++; if (n < x && x < MAXINT)? This would continue to count up even if we subtract too far. Or maybe we could revert to the signed int, and add an annotation to tell the compiler that LOG_OCCURRENCES_MOD_N -= n is very unlikely to overflow so we don't mind what it does - and maybe adjust the docs to recommend avoiding the combination of N close to MAXINT and multi-threaded logging.

@sergiud
Copy link
Collaborator

sergiud commented Dec 6, 2021

Good points. Switching back to a signed integer is certainly the easiest solution. I'd rather prefer keeping unsigned to avoid undefined behavior which is admittedly unlikely to occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants