Return 0 for negative "free" and "total" memory reported by the OS #42725

dakrone · 2019-05-30T19:21:35Z

We've had a situation where the MX bean reported negative values for the
free memory of the OS, in those rare cases we want to return a value of
0 rather than blowing up later down the pipeline.

In the event that there is a serialization or creation error with regard
to memory use, this adds asserts so the failure will occur as soon as
possible and give us a better location for investigation.

Resolves #42157

We've had a situation where the MX bean reported negative values for the free memory of the OS, in those rare cases we want to return a value of 0 rather than blowing up later down the pipeline. In the event that there is a serialization or creation error with regard to memory use, this adds asserts so the failure will occur as soon as possible and give us a better location for investigation. Resolves elastic#42157

elasticmachine · 2019-05-30T19:21:37Z

Pinging @elastic/es-core-features

jasontedor · 2019-05-31T03:28:44Z

Under what circumstances did this happen. I see it was on Linux 3.16.0-8-amd64, a Debian kernel. Did we did into the JDK source to understand what happened, is this a JDK bug, an OS bug, or has our world view been shattered? Are we working around something that is fixed on modern kernels? I don’t like these band-aids without understanding their source. And if we commit the band-aid our code should have a comment telling the story of this investigation, for future passersby.

dakrone · 2019-05-31T15:44:07Z

It's possible for this native method to return -1, for Apple, and AIX systems:

JNIEXPORT jlong JNICALL
Java_com_sun_management_internal_OperatingSystemImpl_getFreePhysicalMemorySize0
  (JNIEnv *env, jobject mbean)
{
#ifdef __APPLE__
    mach_msg_type_number_t count;
    vm_statistics_data_t vm_stats;
    kern_return_t res;

    count = HOST_VM_INFO_COUNT;
    res = host_statistics(mach_host_self(), HOST_VM_INFO, (host_info_t)&vm_stats, &count);
    if (res != KERN_SUCCESS) {
        throw_internal_error(env, "host_statistics failed");
        return -1; // <<< --- HERE
    }
    return (jlong)vm_stats.free_count * page_size;
#elif defined(_ALLBSD_SOURCE)
    /*
     * XXBSDL no way to do it in FreeBSD
     */
    // throw_internal_error(env, "unimplemented in FreeBSD")
    return (128 * MB);
#elif defined(_AIX)
    perfstat_memory_total_t memory_info;
    if (-1 != perfstat_memory_total(NULL, &memory_info, sizeof(perfstat_memory_total_t), 1)) {
        return (jlong)(memory_info.real_free * 4L * 1024L);
    }
    return -1; // <<< --- HERE
#else // solaris / linux
    jlong num_avail_physical_pages = sysconf(_SC_AVPHYS_PAGES);
    return (num_avail_physical_pages * page_size);
#endif
}

I'm unable able to determine exactly how it overflowed in Linux, however, the manpage for sysconf does specify:

_SC_PHYS_PAGES
The number of pages of physical memory.  Note that it is possible for the product of this value and the value of  _SC_PAGESIZE  to  over‐flow.

Additionally:

The return value of sysconf() is one of the following:
*  On error, -1 is returned and errno is set to indicate the cause of the error (for example, EINVAL, indicating that name is invalid).

Meaning that if the sysconf call for _SC_AVPHYS_PAGES were to fail, it could return -1 and multiply that value by the page_size, returning a negative free memory amount. Additionally, _SC_AVPHYS_PAGES_ is also considered not to be a standard option: "These values also exist, but may not be standard." (for the _SC_-prefixed options).

*  If name corresponds to an option ... -1 is returned if the option is not supported.

I also checked the linux kernel 3.16.x source code, HPUX would return -1 for the available physical pages (it doesn't support that option).

int hpux_sysconf(int which)
{
	switch (which) {
	case _SC_CPU_VERSION:
		return CPU_PA_RISC1_1;
	case _SC_OPEN_MAX:
		return INT_MAX;
	default:
		return -EINVAL;
	}
}

In GNU's libc, the sysconf implementation (the "hidden" implementation, which I admit I definitely do not understand), can return -1 for the __get_phys_pages method which is called by the sysconf implementation (it appears that it sends it off to the linux sysinfo call):

long int
__get_phys_pages (void)
{
  /* We have no general way to determine this value.  */
  __set_errno (ENOSYS);
  return -1;
}
libc_hidden_def (__get_phys_pages)
weak_alias (__get_phys_pages, get_phys_pages)

The sysinfo call can return a negative value on non-SMP kernels (which I don't think we have much of any more?):

static inline unsigned long global_page_state(enum zone_stat_item item)
{
	long x = atomic_long_read(&vm_stat[item]);
#ifdef CONFIG_SMP
	if (x < 0)
		x = 0;
#endif
	return x;
}

I know that we don't support AIX or HPUX, but given that it's possible for an OS not to support these flags (Apple failing to return host statistics for instance, or Linux running a non-standard libc (I didn't check whether musl libc could return negative values)), I think we should be extra cautious about what values we get from the OS and be safe about returning 0 when negative values are returned.

If, as might be possible, it's a serialization issue or construction issue, the extra asserts will help us narrow that down as well, since it addresses it from the Mem object construction side rather than the OS bean side.

dakrone · 2019-06-03T21:52:11Z

@elasticmachine update branch

dakrone · 2019-06-04T17:12:01Z

@elasticmachine elasticsearch-ci/1

jakelandis · 2019-06-06T13:34:54Z

added team discuss to further discuss if we should merge this now Lee has figured out exactly how this can happen.

jakelandis · 2019-06-13T15:26:25Z

Discussed this today and came away with the following points:

It is possible that a supported distribution's kernel could emit a negative value
It is known that some unsupported distribution's can emit a negative value
For unsupported distribution's if there are low friction, low/no maintenance fixes to better enable Elasticsearch to allow it work, we should accept those changes.

We will proceed with this issue and thanks @dakrone for the deep dive into how this is possible.

x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/MachineLearningTests.java

jakelandis · 2019-06-13T17:46:16Z

LGTM pending @droberts195 comment.

droberts195

Thanks for making the extra change

jasontedor · 2019-06-14T20:27:50Z

@dakrone Thank you so much for diving into this, that’s the kind of analysis we need for changes like this. One request though, can we transfer some of the analysis into a comment in the code for future passersby?

dakrone · 2019-06-14T20:41:48Z

@jasontedor certainly, I'll add a comment where we get back the value from the bean

…values

server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java

jasontedor

LGTM.

…42725) * Return 0 for negative "free" and "total" memory reported by the OS We've had a situation where the MX bean reported negative values for the free memory of the OS, in those rare cases we want to return a value of 0 rather than blowing up later down the pipeline. In the event that there is a serialization or creation error with regard to memory use, this adds asserts so the failure will occur as soon as possible and give us a better location for investigation. Resolves #42157 * Fix test passing in invalid memory value * Fix another test passing in invalid memory value * Also change mem check in MachineLearning.machineMemoryFromStats * Add background documentation for why we prevent negative return values * Clarify comment a bit more

taf2 · 2020-01-31T20:35:38Z

We are seeing this issue now on es 7.5.2 running Amazon Linux 2.

[ct@ip-10-55-31-96 ~]# curl localhost:9200/_cat/nodes
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Values less than -1 bytes are not supported: -2b"}],"type":"illegal_argument_exception","reason":"Values less than -1 bytes are not supported: -2b"},"status":400}[root@ip-10-55-31-96 ~]#
[ct@ip-10-55-31-96 ~]# uname -a
Linux ip-10-55-31-96.ec2.internal 4.14.152-127.182.amzn2.x86_64 #1 SMP Thu Nov 14 17:32:43 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[ct@ip-10-55-31-96 ~]#

[ct@ip-10-55-31-96 ~]# curl localhost:9200
{
  "name" : "es7d-23",
  "cluster_name" : "es7",
  "cluster_uuid" : "UJ-qxUlHROCBIZ9pFGagHw",
  "version" : {
    "number" : "7.5.2",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "8bec50e1e0ad29dad5653712cf3bb580cd1afcdf",
    "build_date" : "2020-01-15T12:11:52.313576Z",
    "build_snapshot" : false,
    "lucene_version" : "8.3.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}
[ct@ip-10-55-31-96 ~]#

jamshid · 2020-09-04T22:38:28Z

I'm seeing this error on a 5-node Elasticsearch 7.5.2 on centos 7.8.2003. Any idea if I should file a new issue or if it will be fixed by an upgrade or is it a sign of misconfiguration? This is upgraded from ES 6.8 and I just deleted a large index. I guess I'll try a reboot.

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Values less than -1 bytes are not supported: -108b","stack_trace":"[Values less than -1 bytes are not supported: -108b]; nested: IllegalArgumentException[Values less than -1 bytes are not supported: -108b];
	 org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:644)
	 org.elasticsearch.ElasticsearchException.generateFailureXContent(ElasticsearchException.java:57
...
Caused by: java.lang.IllegalArgumentException: Values less than -1 bytes are not supported: -108b
	 org.elasticsearch.common.unit.ByteSizeValue.<init>(ByteSizeValue.java:72)
	 org.elasticsearch.common.unit.ByteSizeValue.<init>(ByteSizeValue.java:67)
	 org.elasticsearch.index.cache.query.QueryCacheStats.getMemorySize(QueryCacheStats.java:73)
	 org.elasticsearch.rest.action.cat.RestNodesAction.buildTable(RestNodesAction.java:341)
	 org.elasticsearch.rest.action.cat.RestNodesAction$1$1$1.buildResponse(RestNodesAction.java:105)
	 org.elasticsearch.rest.action.cat.RestNodesAction$1$1$1.buildResponse(RestNodesAction.java:102)
	 org.elasticsearch.rest.action.RestResponseListener.processResponse(RestResponseListener.java:37)
	 org.elasticsearch.rest.action.RestActionListener.onResponse(RestActionListener.java:47)
\t... 23 more
"}],"type":"illegal_argument_exception","reason":"Values less than -1 bytes are not supported: -108b","stack_trace":"java.lang.IllegalArgumentException: Values less than -1 bytes are not supported: -108b
	 org.elasticsearch.common.unit.ByteSizeValue.<init>(ByteSizeValue.java:72)
	 org.elasticsearch.common.unit.ByteSizeValue.<init>(ByteSizeValue.java:67)
	 org.elasticsearch.index.cache.query.QueryCacheStats.getMemorySize(QueryCacheStats.java:73)
	 org.elasticsearch.rest.action.cat.RestNodesAction.buildTable(RestNodesAction.java:341)
	 org.elasticsearch.rest.action.cat.RestNodesAction$1$1$1.buildResponse(RestNodesAction.java:105)
...

UPDATE: fwiw the _cat api error seemed to go away on its own when I checked back a couple of days later. Also some of the nodes were on centos 7.6 and 7.7 and had older openjdk-1.8.0 patch levels, maybe contributing factor? Also saw somewhere that maybe sudo swapoff -a stops the error.

RicardoGralhoz · 2020-12-03T02:43:15Z

I also saw this issue on a 13-node cluster with Elasticsearch v7.8.1, Ubuntu 18.04.2 LTS, OpenJDK 64-Bit v11.0.3, using G1GC, with 32GiB of memory, setting 16GB for JVM heap, IHOP 30% and reserving 25% for G1. Upgraded from 7.1.0, but then deleted old data. Fixed on restart.

Edited to add a related topic on the forum:
help-with-unassigned-shards-circuitbreakingexception-values-less-than-1-bytes-are-not-supported

cat cluster_stats.json 

Bad Request. Rejected
{
  "error" : { 
    "root_cause" : [ 
      {   
        "type" : "illegal_argument_exception",
        "reason" : "Values less than -1 bytes are not supported: -9223372036787056125b"
      }   
    ],  
    "type" : "illegal_argument_exception",
    "reason" : "Values less than -1 bytes are not supported: -9223372036787056125b",
    "suppressed" : [ 
      {   
        "type" : "illegal_state_exception",
        "reason" : "Failed to close the XContentBuilder",
        "caused_by" : { 
          "type" : "i_o_exception",
          "reason" : "Unclosed object or array found"
        }
      }   
    ]   
  },  
  "status" : 400 
}

We've had a series of bug fixes for cases where an OsProbe gives negative values, most often just -1, to the OsStats class. We added assertions to catch cases where we were initializing OsStats with bad values. Unfortunately, these fixes turned to not be backwards-compatible. In this commit, we simply coerce bad values to 0 when data is coming from nodes that don't have the relevant bug fixes. Relevant PRs: * #42725 * #56435 * #57317 Fixes #73459

dakrone added :Data Management/Stats Statistics tracking and retrieval APIs v8.0.0 v7.2.0 v7.0.2 v7.1.2 labels May 30, 2019

dakrone added 2 commits May 30, 2019 14:02

Fix test passing in invalid memory value

1797f2c

Fix another test passing in invalid memory value

2c86349

Merge branch 'master' into reset-negative-mem-values

9e86f99

jakelandis added the team-discuss label Jun 6, 2019

jakelandis self-requested a review June 13, 2019 15:20

jakelandis removed the team-discuss label Jun 13, 2019

droberts195 reviewed Jun 13, 2019

View reviewed changes

x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/MachineLearningTests.java Show resolved Hide resolved

Also change mem check in MachineLearning.machineMemoryFromStats

04b27ea

droberts195 approved these changes Jun 14, 2019

View reviewed changes

dakrone added 2 commits June 14, 2019 16:01

Merge remote-tracking branch 'origin/master' into reset-negative-mem-…

929124d

…values

Add background documentation for why we prevent negative return values

2fad425

jasontedor reviewed Jun 17, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java Outdated Show resolved Hide resolved

Clarify comment a bit more

6e87a2d

jasontedor approved these changes Jun 17, 2019

View reviewed changes

dakrone removed the backport pending label Jun 19, 2019

dakrone deleted the reset-negative-mem-values branch June 19, 2019 21:59

jpountz added the >bug label Jul 5, 2019

codebrain mentioned this pull request Aug 5, 2019

[meta] 7.2 Release elastic/elasticsearch-net#3980

Closed

37 tasks

This was referenced May 18, 2020

[7.7] Report used memory as zero when total memory cannot be obtained #56905

Merged

[7.8] Report used memory as zero when total memory cannot be obtained #56909

Merged

jamshid mentioned this pull request Sep 10, 2020

BoolQueryBuilder uses ObjectParser #52880

Merged

jakelandis mentioned this pull request May 10, 2021

Nodes stats returns -1 cpu percent on my desktop #72887

Closed

This was referenced Jun 1, 2021

[8.x] OsStats must be lenient with bad data from older nodes #73610

Merged

[7.x] OsStats must be lenient with bad data from older nodes #73614

Merged

[6.8] OsStats must be lenient with bad data from older nodes #73616

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

droberts195 mentioned this pull request Oct 15, 2021

Allow total memory to be overridden #78750

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return 0 for negative "free" and "total" memory reported by the OS #42725

Return 0 for negative "free" and "total" memory reported by the OS #42725

dakrone commented May 30, 2019

elasticmachine commented May 30, 2019

jasontedor commented May 31, 2019

dakrone commented May 31, 2019

dakrone commented Jun 3, 2019

dakrone commented Jun 4, 2019

jakelandis commented Jun 6, 2019

jakelandis commented Jun 13, 2019

jakelandis commented Jun 13, 2019

droberts195 left a comment

jasontedor commented Jun 14, 2019

dakrone commented Jun 14, 2019

jasontedor left a comment

taf2 commented Jan 31, 2020

jamshid commented Sep 4, 2020 •

edited

Loading

RicardoGralhoz commented Dec 3, 2020 •

edited

Loading

Return 0 for negative "free" and "total" memory reported by the OS #42725

Return 0 for negative "free" and "total" memory reported by the OS #42725

Conversation

dakrone commented May 30, 2019

elasticmachine commented May 30, 2019

jasontedor commented May 31, 2019

dakrone commented May 31, 2019

dakrone commented Jun 3, 2019

dakrone commented Jun 4, 2019

jakelandis commented Jun 6, 2019

jakelandis commented Jun 13, 2019

jakelandis commented Jun 13, 2019

droberts195 left a comment

Choose a reason for hiding this comment

jasontedor commented Jun 14, 2019

dakrone commented Jun 14, 2019

jasontedor left a comment

Choose a reason for hiding this comment

taf2 commented Jan 31, 2020

jamshid commented Sep 4, 2020 • edited Loading

RicardoGralhoz commented Dec 3, 2020 • edited Loading

jamshid commented Sep 4, 2020 •

edited

Loading

RicardoGralhoz commented Dec 3, 2020 •

edited

Loading