Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pylint 2.7 hangs on many-core Windows machines #6965

Closed
randomascii opened this issue Jun 16, 2022 · 13 comments · Fixed by #7035
Closed

pylint 2.7 hangs on many-core Windows machines #6965

randomascii opened this issue Jun 16, 2022 · 13 comments · Fixed by #7035
Labels
Bug 🪲 Crash 💥 A bug that makes pylint crash multiprocessing Windows 🪟 Bug affecting only Windows users
Milestone

Comments

@randomascii
Copy link
Contributor

Bug description

We use PyLint in Chromium and we have been switching to Python 3 and PyLint 2.7 recently. While testing on a many-core (96 logical processor) machine I found that PyLint reliably hangs. This is due to a known Python 3 bug with multi-processing on machines with more than 56 (possibly more than 63 or 64) logical processors. We first hit this bug last year (discussed in crbug.com/1190269) but only recently discovered that PyLint is vulnerable.

If we were running PyLint from the command line then we could just pass a -j value that is capped at 56 (or 63, or 64) but I have not been able to figure out how to cap the jobs value when invoking PyLint through lint.Run().

Internally PyLint should cap its parallelism until the Python 3 bug (not yet reported I believe, sorry) is fixed.

Configuration

96 logical processor Windows machine
The version of Windows does not seem to matter. This only happens on Python 3 and it is believed to affect all versions of Python 3.

Command used

I invoke pylint quite directly using "git cl presubmit --all --force" from Chromium. This ends up invoking many steps including pylint_main.py, like this:

  from pylint import lint  # pylint: disable=bad-option-value,import-outside-toplevel
  lint.Run(argv)

https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:pylint_main.py;l=17?q=pylint_main.py&sq=

Pylint output

No output - it hangs.

Expected behavior

I expect the PyLint run to complete.

Pylint version

pylint 2.7
Python 3.8.10

OS / Environment

Windows 10

Additional dependencies

No response

@randomascii randomascii added the Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling label Jun 16, 2022
@DanielNoord DanielNoord added Bug 🪲 Crash 💥 A bug that makes pylint crash multiprocessing and removed Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling labels Jun 16, 2022
@DanielNoord
Copy link
Collaborator

Thanks for the report @randomascii!

Seems reasonable to fix and also pretty straightforward. I would like to be able to link to a CPython issue or other related issue if possible. I saw in the chromium tracker that somebody had an upstream bug report on their todo list. Do you know if they are still working on that? With the age of the pylint code base we tend to lose track of fixes for CPython (or other interpreters) behaviour without explicit issues that refer to them.

@randomascii
Copy link
Contributor Author

I am assuming that no upstream Python bug was ever filed, and I believe that that person is not currently working on Chromium and therefore will not be filing one.

BTW, if you have any advice on how to pass a jobs count through lint.Run() then that would be helpful. I suspect it is easy but I couldn't divine the incantation. Until then I'm stucking using my machine with just 28 logical processors :-)

@Pierre-Sassoulas
Copy link
Member

It seems easy to fix once we know the precise number we should use. Maybe 50 is close enough ?

Regarding using pylint API there's this documentation:

I suppose it would work with:

from pylint.lint import Run

Run(argv=["--jobs 42", "myfile.py"])

@randomascii
Copy link
Contributor Author

I will run some tests to try to determine the correct number.

Thanks for the pointer. It looks like the correct syntax is this:

lint.Run(['--jobs', '42'] + argv)

That is, '--jobs' and '42' need to be separate arguments. argv is our existing list of files to analyze.

@randomascii
Copy link
Contributor Author

randomascii commented Jun 16, 2022

So, the good news is that the syntax above lets me increase the number of jobs. This means that I can reproduce the hangs on my four-core eight-logical-processor laptop. So, yay! This let me determine that 61 jobs reliably causes hangs, whereas 60 does not.

This odd limit is presumably because WaitForMultipleObjects allows a maximum of MAXIMUM_WAIT_OBJECTS in its array (MAXIMUM_WAIT_OBJECTS == 64) and there must be four other handles (besides the child processes) which multiprocessing ends up waiting on.

The bad news is that the --jobs feature does not let me decrease the number of jobs. That is, if I pass '--jobs', '42' on my 96 logical-processor machine then I still get 96 jobs (verified with procexp.exe and with using pskill.exe when things get hung). I don't know why this is the case - I couldn't find any corresponding logic in run.py, so I guess I'm looking in the wrong place.

I've noticed that spawning the child processes is extremely slow - about 1-2 processes per second. So, 60-way parallelism generally means that there is about a 30 s delay while things spin up, then almost no time spent doing work. This slow startup time was happening despite my CPUs being about 70% idle. This slow startup time presumably explains the bug I saw about how ineffectual multiprocessing was for pylint. You would need a huge batch of files to justify doing significant multi-processing.

I don't know whether the slow startup is a bug in multiprocessing or in pylint. I just measured some of our presubmits and 60-way parallelism more than doubles the time that they take, from ~30 to ~70 s.

I don't know how long the average file takes to process but presumably jobs should be set something like this:

  jobs = _cpu_count()
  if jobs > num_files // 10
    jobs = 1 + num_files // 10
  # multiprocessing hangs on Windows if jobs > 60. Capping at
  # 32 until the bug is understood and because of diminishing returns.
  if jobs > 32:
    jobs = 32

@DanielNoord
Copy link
Collaborator

DanielNoord commented Jun 16, 2022

This odd limit is presumably because WaitForMultipleObjects allows a maximum of MAXIMUM_WAIT_OBJECTS in its array (MAXIMUM_WAIT_OBJECTS == 64) and there must be four other handles (besides the child processes) which multiprocessing ends up waiting on.

This indeed seems like a bug in multiprocessing. You seem to have much more knowledge about this than me (I never heard of logical processors until today). I did some (short) digging in the Python bug tracker and found python/cpython#71090 and then python/cpython#72355.
I think the comment in python/cpython#71090 (comment) might allude to a potential problem here? More events being listened to so the previous fixes no longer work?

Either way, I think a bug report to cpython would be appreciated. Now that they are on Github that should be fairly easy. I'd be happy to do so, but considering you clearly have more knowledge about the topic perhaps it is better if you could do this?

The bad news is that the --jobs feature does not let me decrease the number of jobs. That is, if I pass '--jobs', '42' on my 96 logical-processor machine then I still get 96 jobs.

Hmm, that's weird. If you're able to do so it would be helpful if you could add a print(jobs) at L139 here:
https://github.com/PyCQA/pylint/blob/bc7f8dac5a1d9ac2a1f6966029f5fc134d139f2f/pylint/lint/parallel.py#L140

That's what we are passing to multiprocessing so it would narrow down the issue to either pylint or multiprocessing.

This slow startup time presumably explains the bug I saw about how ineffectual multiprocessing was for pylint. You would need a huge batch of files to justify doing significant multi-processing.

Yeah, this is a big issue, which I'm not actually sure how to resolve. We would need a complete rewrite of the parallel code. We currently spin up a new PyLinter class for every job. That is taking way too long (probably).
But I'm not sure what the best approach is to create a PyLinterLite...

I don't know whether the slow startup is a bug in multiprocessing or in pylint. I just measured some of our presubmits and 60-way parallelism more than doubles the time that they take, from ~30 to ~70 s.

I feel like it is a combination but we could certainly do better here.

@randomascii
Copy link
Contributor Author

Thanks for all the help!

Now that I know enough about this issue I will file a CPython bug and get the root cause fixed.

I dropped in a bit of logging as you suggested and eventually realized that the -jobs weirdness is self inflicted. We are already passing --jobs to PyLint when we run it in our self-defined parallel mode. When we run it in serial mode we don't pass --jobs at all and then PyLint uses its default. So, all I was doing was changing our serial mode into parallel. Sigh... And, it is ironic that we were already passing --jobs to lint.Run, just in a totally different part of the code.

So, I can work around this and I will cap the value at some reasonable number to avoid this bug and reduce startup overhead.

I don't know if PyLint should add a mitigation or if it should just wait for a CPython fix.

@DanielNoord
Copy link
Collaborator

Now that I know enough about this issue I will file a CPython bug and get the root cause fixed.

👍

So, I can work around this and I will cap the value at some reasonable number to avoid this bug and reduce startup overhead.

🎉

I don't know if PyLint should add a mitigation or if it should just wait for a CPython fix.

Well, I don't think this will be classified as a security bug so it won't be backported to all the versions we support. So we should at least mitigate on older versions.
And CPython development can be slow for issues with low interest. Considering nobody has run into this yet I'm not sure it will be top priority 😄

Btw, since you did most of the work here you're obviously more than welcome to make the bug fix for pylint yourself. You've already familiarised yourself with the relevant code. But I can understand if this issue has taken you more time than planned.

@Pierre-Sassoulas
Copy link
Member

Thank you for the analysis @randomascii, much appreciated. I opened #6967 for the other issue with multi-processing startup time.

@randomascii
Copy link
Contributor Author

This is the change I am proposing in Chromium's depot_tools. By limiting cpu_count and _pool_size this may mitigate this python 3 bug across multiple parts of our presubmit system. Plus, I cap the number of jobs at 1 + file_count // 10 to avoid the situation where 96 child processes are created for just a few inputs (I assume that input files are not shared across processes, but I might be wrong). Here is the proposed depot_tools/Chromium CL:

https://chromium-review.googlesource.com/c/chromium/tools/depot_tools/+/3711282

@randomascii
Copy link
Contributor Author

I landed my depot_tools change so Chromium no longer hangs when running presubmits, but as promised I'm investigating what's going on under the hood and how it should be changed. Whoo boy - this whole thing is a bug farm.

For starters, this CPython comment is wrong:
python/cpython@3988986#diff-84a955ea642d41d9f164e9f94499d8eb33c357ec7f4362a341951f6f67ab21a1R114
WaitForMultipleObjects can wait for a maximum of 64 objects, not 63.

However, if the maximum was 63 then given that there are two other handles that multiprocessing waits for then 61 would be the correct maximum pool size, as this comment says:
python/cpython@3988986#diff-eb5cd13d65d2f215d8dc0d95eb1a1341ff69b226ac56ec6099e39582edecc8cfR219

However there is a CPython bug. When I use a pool size of 61 with Python 3.8.10 I get this error:
ValueError: need at most 63 handles, got a sequence of length 63

So, CPython is range-checking with the wrong value (63 instead of 64) and it is (was, actually) checking with >= instead of >. The comparison-operator bug was fixed here (issue was python/cpython#84444):
python/cpython@92b5dc7

Therefore, while the CPython pool-size limit on Windows should be 62, it is actually 60. Once the fix above ships then the limit will go up to 61, which is still not where it should be.

The other problem, however, is that the ValueError exception gets thrown but Python still hangs. If output is not being displayed then the ValueError exception is invisible and we just have a hang.

BTW, after landing my fix I noticed some new Pylint warnings in some scripts that hadn't changed in a long time. I eventually tracked the new warnings to the change in parallelism. That is, when I analyzed these files with jobs=1 I got eight extra warnings that I didn't get when analyzing them with jobs=2 or higher. More details and the change that I had to land to fix these issues are here:
https://chromium-review.googlesource.com/c/chromium/src/+/3712154
Is this a known problem with parallel Pylint? I'm tempted to do serial Pylint everywhere because the parallelism buys us little and risks missing some warnings.

I have a couple of CPython bugs to file, I guess.
python/cpython#94242 - Off-by-one error in _winapi.WaitForMultipleObjects
python/cpython#94243 - Pool constructor should reject too many processes rather than hanging later

As for Pylint, I suspect that a version of this fix that I made to wpt is appropriate:
web-platform-tests/wpt@bdc35e8

This fix wouldn't actually help Chromium (because we passed in our own jobs parameter) but it could help others.

@DanielNoord
Copy link
Collaborator

BTW, after landing my fix I noticed some new Pylint warnings in some scripts that hadn't changed in a long time. I eventually tracked the new warnings to the change in parallelism. That is, when I analyzed these files with jobs=1 I got eight extra warnings that I didn't get when analyzing them with jobs=2 or higher. More details and the change that I had to land to fix these issues are here: https://chromium-review.googlesource.com/c/chromium/src/+/3712154 Is this a known problem with parallel Pylint? I'm tempted to do serial Pylint everywhere because the parallelism buys us little and risks missing some warnings.

Yeah this is sadly a known issue. There are some messages that depend on the state of other files/messages which we don't really handle well for multiprocessing. I know there was a contributor who was planning to look into this for 2.15, but obviously we can't make any promises!

@randomascii
Copy link
Contributor Author

After discussion on python/cpython#94242 it appears that 61 is the correct maximum because Python requires space for three extra handles. The comments give the wrong reason for the 61 limit, but 61 is correct (assuming Python 3.9 or higher).

DanielNoord added a commit that referenced this issue Jun 25, 2022
Creating a multiprocessing Pool with too many processes can hit
ValueError exceptions or hangs or both. The number that counts as "too
many" depends on the Python version so this change uses 56 as a
guaranteed safe limit.

Details in #6965

Note that this only avoids the issue if an explicit jobs count is not
passed in.

Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com>
Pierre-Sassoulas pushed a commit to Pierre-Sassoulas/pylint that referenced this issue Jun 29, 2022
Creating a multiprocessing Pool with too many processes can hit
ValueError exceptions or hangs or both. The number that counts as "too
many" depends on the Python version so this change uses 56 as a
guaranteed safe limit.

Details in pylint-dev#6965

Note that this only avoids the issue if an explicit jobs count is not
passed in.

Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com>
Pierre-Sassoulas pushed a commit that referenced this issue Jun 29, 2022
Creating a multiprocessing Pool with too many processes can hit
ValueError exceptions or hangs or both. The number that counts as "too
many" depends on the Python version so this change uses 56 as a
guaranteed safe limit.

Details in #6965

Note that this only avoids the issue if an explicit jobs count is not
passed in.

Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🪲 Crash 💥 A bug that makes pylint crash multiprocessing Windows 🪟 Bug affecting only Windows users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants