-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Fix out of range access in GetRecycleMemoryInfo (#26873) #26912
Fix out of range access in GetRecycleMemoryInfo (#26873) #26912
Conversation
@303248153 thank you for finding it out. My assumption was that the CPU indexes are always smaller than sysconf(_SC_NPROCESSORS_CONF). Since it is not the case on some kernels, then we also need to fix several places in the GCToOSInterface implementations where we use the same assumption. Could you please fix that too in this PR? Here are the places that need to be fixed: coreclr/src/gc/unix/gcenv.unix.cpp Line 333 in 0d1c180
coreclr/src/gc/unix/gcenv.unix.cpp Line 1213 in 0d1c180
Line 1010 in 0d1c180
The change would be to use |
I think those 3 locations are fine, the number of processors return by sched_getaffinity shoud not be greater than sysconf(_SC_NPROCESSORS_CONF), the problem is sched_getcpu can return a number not in the affinity set (when using openvz kernel), in that case GetProcessorForHeap will just return false. Also as I know the total process number and affinity set can be changed after a process launched, for example cpu hotplug can change the total process number, taskset on linux and SetProcessAffinityMask on windows can change the affinity set, these are special case so .NET core may not support them well, but atleast it should not crash or access invalid memory. |
@303248153 The problem at those locations is that GC would not be able to see processors with indexes higher or equal to the number it gets from sysconf(_SC_NPROCESSORS_CONF) and run GC threads / create GC heaps for those in the server GC mode. Which would cause a performance hit. And not only that. In the case you have described above when the affinity was set to CPUs 2 and 3 and the sysconf(_SC_NPROCESSORS_CONF) returned 2, the g_processAffinitySet in the gcenv.unix.cpp would be initialized to empty and GetProcessorForHeap would always fail. I am not sure what exact effect it would have on the GC. |
80b8eac
to
6e91e54
Compare
@janvorli That's a misunderstanding cause by my inappropriate description. I mean OpenVZ VPS (from hostbrz):
KVM VPS (also from hostbrz):
KVM VPS with
I just fix my comment for clearer description, please review that again, thanks. |
sched_getaffinity, CPU_COUNT returns 1 So you are saying that in this case, the sched_getaffinity would always return cpuSet with CPU #0 set? That sounds strange. I would expect it to return cpuSet with one CPU set, but it can be CPU 0, 1, 2, 3, 4 or 5. If I am right, the GC code needs to be fixed. Consider this case: sched_getaffinity, CPU_COUNT returns 1 In this case, the GC code at the places I've commented on would look through the affinity set only for index 0, so it would never find CPU at index 2. |
@janvorli @echesakovMSFT
sched_getcpu() can return a number not in the cpuSet on OpenVZ VE (see the code and output below). I notice that coreclr will use _SC_NPROCESSORS_CONF for ARM/ARM64 and _SC_NPROCESSORS_ONLN otherwise to get the total cpu count: And I did the tests again, the result surprise me: #define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <stdio.h>
int main() {
int nproc = sysconf(_SC_NPROCESSORS_CONF);
int nprocOnline = sysconf(_SC_NPROCESSORS_ONLN);
printf("nproc: %d\n", nproc);
printf("nprocOnline: %d\n", nprocOnline);
while (1) {
cpu_set_t cpuSet;
int st = sched_getaffinity(getpid(), sizeof(cpu_set_t), &cpuSet);
printf("affinity: ");
for (int x = 0; x < nproc; ++x) {
printf("%d, ", CPU_ISSET(x, &cpuSet));
}
printf("cpuid: %d\n", sched_getcpu());
sleep(1);
}
return 0;
} Output on OpenVZ VPS:
Output on KVM VPS:
Looks like on OpenVZ VE, _SC_NPROCESSORS_CONF reports all cpu cores but _SC_NPROCESSORS_ONLN only report the limited value, so use _SC_NPROCESSORS_CONF on all platforms will also fix this issue. @echesakovMSFT Could you tell me why not use _SC_NPROCESSORS_CONF on x86? @janvorli Should I fix those 3 locations even sched_getcpu() may not in the affinity set? |
@303248153 As far as I remember, before my changes in #18053 sysconf(_SC_NPROCESSORS_ONLN) was used on all platforms to determine the number of available logical processors. This caused issues on ARM/ARM64 where sysconf(_SC_NPROCESSORS_ONLN) can return 1 at the beginning of your process startup if all other cores were "sleeping" and the runtime will use the return value to make some decisions (e.g. what type of write barriers to use - single-threaded or multi-threaded). Then later the same call to sysconf(_SC_NPROCESSORS_ONLN) can report any arbitrary number between 1 and actual number of logical processors depending on the load. As you can see, this would cause hard to find failures. We decided in #18053 to use sysconf(_SC_NPROCESSORS_CONF) instead. However, this could cause another type of issues when running in Docker container with - it would ignore the limitations set by --cpuset-cpus - I believe this was discussed in #10690. So later in #18289 I changed X86/AMD64 to use sysconf(_SC_NPROCESSORS_ONLN) as it was before #18053 and use sysconf(_SC_NPROCESSORS_CONF) on ARM/ARM64. |
@echesakovMSFT Thanks for your explanation! As I test, _SC_NPROCESSORS_ONLN still reports all cores when using --cpuset-cpus inside docker:
There two type of cpu count, the total cpu count (PAL_GetTotalCpuCount) and the logical cpu count (PAL_GetLogicalCpuCountFromOS). I think we can use _SC_NPROCESSORS_CONF for getting total cpu count on all platforms, because the logical cpu count already calculated from sched_getaffinity which works well inside docker with --cpuset-cpus options. ref: https://github.com/dotnet/coreclr/blob/master/src/pal/src/misc/sysinfo.cpp |
@303248153 thank you for the testing source code. The fact that the affinity has nothing to do with the processor we get from sched_getcpu on OpenVZ VPS is really surprising. I want to do some googling to see if that's some weird quirk or if my understanding of the affinity set is wrong. |
I was baffled by the weird affinity set reported on OpenVZ util I've found this: So we can keep your change as-is without modifying the places in GC that I've mentioned, since it is clear that the bits set in the affinity are always in the range from 0.._SC_NPROCESSORS_ONLN-1 and thus everything will be processed correctly. |
@janvorli Thanks for your investigation! There one question remain: Should we use _SC_NPROCESSORS_CONF everywhere? Also I check the usage of GetCurrentProcessorNumber in gc code, heap_select::init_cpu_mapping and heap_select::select_heap will use sched_getcpu() as array index, but the array length is MAX_SUPPORTED_CPUS and by default it will return heap 0, so it should be safe. |
6e91e54
to
3461971
Compare
As for the _SC_NPROCESSORS_CONF vs _ONLINE, I'd create a separate issue for investigating and possibly making such change. Your fix is good as is. |
Glad to hear that, now I just wait for merge. |
.NET Core 3.0 causes a regression on OpenVZ setups compared to 2.2 release, I hope this can be merged to 3.0 as well. |
Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:
|
/azp run coreclr-ci |
Azure Pipelines successfully started running 1 pipeline(s). |
Related to #26873
On linux, GetCurrentProcessorNumber which uses sched_getcpu() can return a value greater than the number of processors reported by sched_getaffinity with CPU_COUNT or sysconf(_SC_NPROCESSORS_ONLN). For example,
taskset -c 2,3 ./MyApp
will make CPU_COUNT be 2 but sched_getcpu() can return 2 or 3, and OpenVZ kernel can makesysconf(_SC_NPROCESSORS_ONLN)
return a limited cpu count but sched_getcpu() still report the real processor number.We should use
GetCurrentProcessorNumber()%NumberOfProcessors
for the array index of pRecycledListPerProcessor to avoid out of range access.Also I hope it could merge to 3.0.x release so my openvz vps can upgrade .NET Core to 3.0 sooner :|