-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return 0 for negative "free" and "total" memory reported by the OS #42725
Conversation
We've had a situation where the MX bean reported negative values for the free memory of the OS, in those rare cases we want to return a value of 0 rather than blowing up later down the pipeline. In the event that there is a serialization or creation error with regard to memory use, this adds asserts so the failure will occur as soon as possible and give us a better location for investigation. Resolves elastic#42157
Pinging @elastic/es-core-features |
Under what circumstances did this happen. I see it was on Linux 3.16.0-8-amd64, a Debian kernel. Did we did into the JDK source to understand what happened, is this a JDK bug, an OS bug, or has our world view been shattered? Are we working around something that is fixed on modern kernels? I don’t like these band-aids without understanding their source. And if we commit the band-aid our code should have a comment telling the story of this investigation, for future passersby. |
It's possible for this native method to return -1, for Apple, and AIX systems: JNIEXPORT jlong JNICALL
Java_com_sun_management_internal_OperatingSystemImpl_getFreePhysicalMemorySize0
(JNIEnv *env, jobject mbean)
{
#ifdef __APPLE__
mach_msg_type_number_t count;
vm_statistics_data_t vm_stats;
kern_return_t res;
count = HOST_VM_INFO_COUNT;
res = host_statistics(mach_host_self(), HOST_VM_INFO, (host_info_t)&vm_stats, &count);
if (res != KERN_SUCCESS) {
throw_internal_error(env, "host_statistics failed");
return -1; // <<< --- HERE
}
return (jlong)vm_stats.free_count * page_size;
#elif defined(_ALLBSD_SOURCE)
/*
* XXBSDL no way to do it in FreeBSD
*/
// throw_internal_error(env, "unimplemented in FreeBSD")
return (128 * MB);
#elif defined(_AIX)
perfstat_memory_total_t memory_info;
if (-1 != perfstat_memory_total(NULL, &memory_info, sizeof(perfstat_memory_total_t), 1)) {
return (jlong)(memory_info.real_free * 4L * 1024L);
}
return -1; // <<< --- HERE
#else // solaris / linux
jlong num_avail_physical_pages = sysconf(_SC_AVPHYS_PAGES);
return (num_avail_physical_pages * page_size);
#endif
} I'm unable able to determine exactly how it overflowed in Linux, however, the manpage for
Additionally:
Meaning that if the
I also checked the linux kernel 3.16.x source code, HPUX would return -1 for the available physical pages (it doesn't support that option). int hpux_sysconf(int which)
{
switch (which) {
case _SC_CPU_VERSION:
return CPU_PA_RISC1_1;
case _SC_OPEN_MAX:
return INT_MAX;
default:
return -EINVAL;
}
} In GNU's libc, the long int
__get_phys_pages (void)
{
/* We have no general way to determine this value. */
__set_errno (ENOSYS);
return -1;
}
libc_hidden_def (__get_phys_pages)
weak_alias (__get_phys_pages, get_phys_pages) The static inline unsigned long global_page_state(enum zone_stat_item item)
{
long x = atomic_long_read(&vm_stat[item]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
#endif
return x;
} I know that we don't support AIX or HPUX, but given that it's possible for an OS not to support these flags (Apple failing to return host statistics for instance, or Linux running a non-standard libc (I didn't check whether musl libc could return negative values)), I think we should be extra cautious about what values we get from the OS and be safe about returning 0 when negative values are returned. If, as might be possible, it's a serialization issue or construction issue, the extra asserts will help us narrow that down as well, since it addresses it from the |
@elasticmachine update branch |
@elasticmachine elasticsearch-ci/1 |
added team discuss to further discuss if we should merge this now Lee has figured out exactly how this can happen. |
Discussed this today and came away with the following points:
We will proceed with this issue and thanks @dakrone for the deep dive into how this is possible. |
x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/MachineLearningTests.java
Show resolved
Hide resolved
LGTM pending @droberts195 comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the extra change
@dakrone Thank you so much for diving into this, that’s the kind of analysis we need for changes like this. One request though, can we transfer some of the analysis into a comment in the code for future passersby? |
@jasontedor certainly, I'll add a comment where we get back the value from the bean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
…42725) * Return 0 for negative "free" and "total" memory reported by the OS We've had a situation where the MX bean reported negative values for the free memory of the OS, in those rare cases we want to return a value of 0 rather than blowing up later down the pipeline. In the event that there is a serialization or creation error with regard to memory use, this adds asserts so the failure will occur as soon as possible and give us a better location for investigation. Resolves #42157 * Fix test passing in invalid memory value * Fix another test passing in invalid memory value * Also change mem check in MachineLearning.machineMemoryFromStats * Add background documentation for why we prevent negative return values * Clarify comment a bit more
We are seeing this issue now on es 7.5.2 running Amazon Linux 2.
|
I'm seeing this error on a 5-node Elasticsearch 7.5.2 on centos 7.8.2003. Any idea if I should file a new issue or if it will be fixed by an upgrade or is it a sign of misconfiguration? This is upgraded from ES 6.8 and I just deleted a large index. I guess I'll try a reboot.
UPDATE: fwiw the |
I also saw this issue on a 13-node cluster with Elasticsearch v7.8.1, Ubuntu 18.04.2 LTS, OpenJDK 64-Bit v11.0.3, using G1GC, with 32GiB of memory, setting 16GB for JVM heap, IHOP 30% and reserving 25% for G1. Upgraded from 7.1.0, but then deleted old data. Fixed on restart. Edited to add a related topic on the forum:
|
We've had a series of bug fixes for cases where an OsProbe gives negative values, most often just -1, to the OsStats class. We added assertions to catch cases where we were initializing OsStats with bad values. Unfortunately, these fixes turned to not be backwards-compatible. In this commit, we simply coerce bad values to 0 when data is coming from nodes that don't have the relevant bug fixes. Relevant PRs: * #42725 * #56435 * #57317 Fixes #73459
We've had a series of bug fixes for cases where an OsProbe gives negative values, most often just -1, to the OsStats class. We added assertions to catch cases where we were initializing OsStats with bad values. Unfortunately, these fixes turned to not be backwards-compatible. In this commit, we simply coerce bad values to 0 when data is coming from nodes that don't have the relevant bug fixes. Relevant PRs: * #42725 * #56435 * #57317 Fixes #73459
We've had a series of bug fixes for cases where an OsProbe gives negative values, most often just -1, to the OsStats class. We added assertions to catch cases where we were initializing OsStats with bad values. Unfortunately, these fixes turned to not be backwards-compatible. In this commit, we simply coerce bad values to 0 when data is coming from nodes that don't have the relevant bug fixes. Relevant PRs: * #42725 * #56435 * #57317 Fixes #73459
We've had a situation where the MX bean reported negative values for the
free memory of the OS, in those rare cases we want to return a value of
0 rather than blowing up later down the pipeline.
In the event that there is a serialization or creation error with regard
to memory use, this adds asserts so the failure will occur as soon as
possible and give us a better location for investigation.
Resolves #42157