Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return 0 for negative "free" and "total" memory reported by the OS #42725

Merged
merged 8 commits into from
Jun 19, 2019

Conversation

dakrone
Copy link
Member

@dakrone dakrone commented May 30, 2019

We've had a situation where the MX bean reported negative values for the
free memory of the OS, in those rare cases we want to return a value of
0 rather than blowing up later down the pipeline.

In the event that there is a serialization or creation error with regard
to memory use, this adds asserts so the failure will occur as soon as
possible and give us a better location for investigation.

Resolves #42157

We've had a situation where the MX bean reported negative values for the
free memory of the OS, in those rare cases we want to return a value of
0 rather than blowing up later down the pipeline.

In the event that there is a serialization or creation error with regard
to memory use, this adds asserts so the failure will occur as soon as
possible and give us a better location for investigation.

Resolves elastic#42157
@dakrone dakrone added :Data Management/Stats Statistics tracking and retrieval APIs v8.0.0 v7.2.0 v7.0.2 v7.1.2 labels May 30, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@jasontedor
Copy link
Member

Under what circumstances did this happen. I see it was on Linux 3.16.0-8-amd64, a Debian kernel. Did we did into the JDK source to understand what happened, is this a JDK bug, an OS bug, or has our world view been shattered? Are we working around something that is fixed on modern kernels? I don’t like these band-aids without understanding their source. And if we commit the band-aid our code should have a comment telling the story of this investigation, for future passersby.

@dakrone
Copy link
Member Author

dakrone commented May 31, 2019

It's possible for this native method to return -1, for Apple, and AIX systems:

JNIEXPORT jlong JNICALL
Java_com_sun_management_internal_OperatingSystemImpl_getFreePhysicalMemorySize0
  (JNIEnv *env, jobject mbean)
{
#ifdef __APPLE__
    mach_msg_type_number_t count;
    vm_statistics_data_t vm_stats;
    kern_return_t res;

    count = HOST_VM_INFO_COUNT;
    res = host_statistics(mach_host_self(), HOST_VM_INFO, (host_info_t)&vm_stats, &count);
    if (res != KERN_SUCCESS) {
        throw_internal_error(env, "host_statistics failed");
        return -1; // <<< --- HERE
    }
    return (jlong)vm_stats.free_count * page_size;
#elif defined(_ALLBSD_SOURCE)
    /*
     * XXBSDL no way to do it in FreeBSD
     */
    // throw_internal_error(env, "unimplemented in FreeBSD")
    return (128 * MB);
#elif defined(_AIX)
    perfstat_memory_total_t memory_info;
    if (-1 != perfstat_memory_total(NULL, &memory_info, sizeof(perfstat_memory_total_t), 1)) {
        return (jlong)(memory_info.real_free * 4L * 1024L);
    }
    return -1; // <<< --- HERE
#else // solaris / linux
    jlong num_avail_physical_pages = sysconf(_SC_AVPHYS_PAGES);
    return (num_avail_physical_pages * page_size);
#endif
}

I'm unable able to determine exactly how it overflowed in Linux, however, the manpage for sysconf does specify:

_SC_PHYS_PAGES
The number of pages of physical memory.  Note that it is possible for the product of this value and the value of  _SC_PAGESIZE  to  over‐flow.

Additionally:

The return value of sysconf() is one of the following:
*  On error, -1 is returned and errno is set to indicate the cause of the error (for example, EINVAL, indicating that name is invalid).

Meaning that if the sysconf call for _SC_AVPHYS_PAGES were to fail, it could return -1 and multiply that value by the page_size, returning a negative free memory amount. Additionally, _SC_AVPHYS_PAGES_ is also considered not to be a standard option: "These values also exist, but may not be standard." (for the _SC_-prefixed options).

*  If name corresponds to an option ... -1 is returned if the option is not supported.

I also checked the linux kernel 3.16.x source code, HPUX would return -1 for the available physical pages (it doesn't support that option).

int hpux_sysconf(int which)
{
	switch (which) {
	case _SC_CPU_VERSION:
		return CPU_PA_RISC1_1;
	case _SC_OPEN_MAX:
		return INT_MAX;
	default:
		return -EINVAL;
	}
}

In GNU's libc, the sysconf implementation (the "hidden" implementation, which I admit I definitely do not understand), can return -1 for the __get_phys_pages method which is called by the sysconf implementation (it appears that it sends it off to the linux sysinfo call):

long int
__get_phys_pages (void)
{
  /* We have no general way to determine this value.  */
  __set_errno (ENOSYS);
  return -1;
}
libc_hidden_def (__get_phys_pages)
weak_alias (__get_phys_pages, get_phys_pages)

The sysinfo call can return a negative value on non-SMP kernels (which I don't think we have much of any more?):

static inline unsigned long global_page_state(enum zone_stat_item item)
{
	long x = atomic_long_read(&vm_stat[item]);
#ifdef CONFIG_SMP
	if (x < 0)
		x = 0;
#endif
	return x;
}

I know that we don't support AIX or HPUX, but given that it's possible for an OS not to support these flags (Apple failing to return host statistics for instance, or Linux running a non-standard libc (I didn't check whether musl libc could return negative values)), I think we should be extra cautious about what values we get from the OS and be safe about returning 0 when negative values are returned.

If, as might be possible, it's a serialization issue or construction issue, the extra asserts will help us narrow that down as well, since it addresses it from the Mem object construction side rather than the OS bean side.

@dakrone
Copy link
Member Author

dakrone commented Jun 3, 2019

@elasticmachine update branch

@dakrone
Copy link
Member Author

dakrone commented Jun 4, 2019

@elasticmachine elasticsearch-ci/1

@jakelandis
Copy link
Contributor

added team discuss to further discuss if we should merge this now Lee has figured out exactly how this can happen.

@jakelandis jakelandis self-requested a review June 13, 2019 15:20
@jakelandis
Copy link
Contributor

Discussed this today and came away with the following points:

  • It is possible that a supported distribution's kernel could emit a negative value
  • It is known that some unsupported distribution's can emit a negative value
  • For unsupported distribution's if there are low friction, low/no maintenance fixes to better enable Elasticsearch to allow it work, we should accept those changes.

We will proceed with this issue and thanks @dakrone for the deep dive into how this is possible.

@jakelandis
Copy link
Contributor

LGTM pending @droberts195 comment.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the extra change

@jasontedor
Copy link
Member

@dakrone Thank you so much for diving into this, that’s the kind of analysis we need for changes like this. One request though, can we transfer some of the analysis into a comment in the code for future passersby?

@dakrone
Copy link
Member Author

dakrone commented Jun 14, 2019

@jasontedor certainly, I'll add a comment where we get back the value from the bean

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

dakrone added a commit that referenced this pull request Jun 19, 2019
…42725)

* Return 0 for negative "free" and "total" memory reported by the OS

We've had a situation where the MX bean reported negative values for the
free memory of the OS, in those rare cases we want to return a value of
0 rather than blowing up later down the pipeline.

In the event that there is a serialization or creation error with regard
to memory use, this adds asserts so the failure will occur as soon as
possible and give us a better location for investigation.

Resolves #42157

* Fix test passing in invalid memory value

* Fix another test passing in invalid memory value

* Also change mem check in MachineLearning.machineMemoryFromStats

* Add background documentation for why we prevent negative return values

* Clarify comment a bit more
@dakrone dakrone deleted the reset-negative-mem-values branch June 19, 2019 21:59
@jpountz jpountz added the >bug label Jul 5, 2019
@taf2
Copy link

taf2 commented Jan 31, 2020

We are seeing this issue now on es 7.5.2 running Amazon Linux 2.

[ct@ip-10-55-31-96 ~]# curl localhost:9200/_cat/nodes
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Values less than -1 bytes are not supported: -2b"}],"type":"illegal_argument_exception","reason":"Values less than -1 bytes are not supported: -2b"},"status":400}[root@ip-10-55-31-96 ~]#
[ct@ip-10-55-31-96 ~]# uname -a
Linux ip-10-55-31-96.ec2.internal 4.14.152-127.182.amzn2.x86_64 #1 SMP Thu Nov 14 17:32:43 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[ct@ip-10-55-31-96 ~]#
[ct@ip-10-55-31-96 ~]# curl localhost:9200
{
  "name" : "es7d-23",
  "cluster_name" : "es7",
  "cluster_uuid" : "UJ-qxUlHROCBIZ9pFGagHw",
  "version" : {
    "number" : "7.5.2",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "8bec50e1e0ad29dad5653712cf3bb580cd1afcdf",
    "build_date" : "2020-01-15T12:11:52.313576Z",
    "build_snapshot" : false,
    "lucene_version" : "8.3.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}
[ct@ip-10-55-31-96 ~]#

@jamshid
Copy link

jamshid commented Sep 4, 2020

I'm seeing this error on a 5-node Elasticsearch 7.5.2 on centos 7.8.2003. Any idea if I should file a new issue or if it will be fixed by an upgrade or is it a sign of misconfiguration? This is upgraded from ES 6.8 and I just deleted a large index. I guess I'll try a reboot.

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Values less than -1 bytes are not supported: -108b","stack_trace":"[Values less than -1 bytes are not supported: -108b]; nested: IllegalArgumentException[Values less than -1 bytes are not supported: -108b];
	 org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:644)
	 org.elasticsearch.ElasticsearchException.generateFailureXContent(ElasticsearchException.java:57
...
Caused by: java.lang.IllegalArgumentException: Values less than -1 bytes are not supported: -108b
	 org.elasticsearch.common.unit.ByteSizeValue.<init>(ByteSizeValue.java:72)
	 org.elasticsearch.common.unit.ByteSizeValue.<init>(ByteSizeValue.java:67)
	 org.elasticsearch.index.cache.query.QueryCacheStats.getMemorySize(QueryCacheStats.java:73)
	 org.elasticsearch.rest.action.cat.RestNodesAction.buildTable(RestNodesAction.java:341)
	 org.elasticsearch.rest.action.cat.RestNodesAction$1$1$1.buildResponse(RestNodesAction.java:105)
	 org.elasticsearch.rest.action.cat.RestNodesAction$1$1$1.buildResponse(RestNodesAction.java:102)
	 org.elasticsearch.rest.action.RestResponseListener.processResponse(RestResponseListener.java:37)
	 org.elasticsearch.rest.action.RestActionListener.onResponse(RestActionListener.java:47)
\t... 23 more
"}],"type":"illegal_argument_exception","reason":"Values less than -1 bytes are not supported: -108b","stack_trace":"java.lang.IllegalArgumentException: Values less than -1 bytes are not supported: -108b
	 org.elasticsearch.common.unit.ByteSizeValue.<init>(ByteSizeValue.java:72)
	 org.elasticsearch.common.unit.ByteSizeValue.<init>(ByteSizeValue.java:67)
	 org.elasticsearch.index.cache.query.QueryCacheStats.getMemorySize(QueryCacheStats.java:73)
	 org.elasticsearch.rest.action.cat.RestNodesAction.buildTable(RestNodesAction.java:341)
	 org.elasticsearch.rest.action.cat.RestNodesAction$1$1$1.buildResponse(RestNodesAction.java:105)
...

UPDATE: fwiw the _cat api error seemed to go away on its own when I checked back a couple of days later. Also some of the nodes were on centos 7.6 and 7.7 and had older openjdk-1.8.0 patch levels, maybe contributing factor? Also saw somewhere that maybe sudo swapoff -a stops the error.

@RicardoGralhoz
Copy link

RicardoGralhoz commented Dec 3, 2020

I also saw this issue on a 13-node cluster with Elasticsearch v7.8.1, Ubuntu 18.04.2 LTS, OpenJDK 64-Bit v11.0.3, using G1GC, with 32GiB of memory, setting 16GB for JVM heap, IHOP 30% and reserving 25% for G1. Upgraded from 7.1.0, but then deleted old data. Fixed on restart.

Edited to add a related topic on the forum:
help-with-unassigned-shards-circuitbreakingexception-values-less-than-1-bytes-are-not-supported

cat cluster_stats.json 

Bad Request. Rejected
{
  "error" : { 
    "root_cause" : [ 
      {   
        "type" : "illegal_argument_exception",
        "reason" : "Values less than -1 bytes are not supported: -9223372036787056125b"
      }   
    ],  
    "type" : "illegal_argument_exception",
    "reason" : "Values less than -1 bytes are not supported: -9223372036787056125b",
    "suppressed" : [ 
      {   
        "type" : "illegal_state_exception",
        "reason" : "Failed to close the XContentBuilder",
        "caused_by" : { 
          "type" : "i_o_exception",
          "reason" : "Unclosed object or array found"
        }
      }   
    ]   
  },  
  "status" : 400 
}

williamrandolph added a commit that referenced this pull request Jun 1, 2021
We've had a series of bug fixes for cases where an OsProbe gives negative
values, most often just -1, to the OsStats class. We added assertions to catch
cases where we were initializing OsStats with bad values. Unfortunately, these
fixes turned to not be backwards-compatible. In this commit, we simply coerce
bad values to 0 when data is coming from nodes that don't have the relevant bug
fixes.

Relevant PRs:
* #42725
* #56435
* #57317

Fixes #73459
williamrandolph added a commit that referenced this pull request Jun 1, 2021
We've had a series of bug fixes for cases where an OsProbe gives negative
values, most often just -1, to the OsStats class. We added assertions to catch
cases where we were initializing OsStats with bad values. Unfortunately, these
fixes turned to not be backwards-compatible. In this commit, we simply coerce
bad values to 0 when data is coming from nodes that don't have the relevant bug
fixes.

Relevant PRs:
* #42725
* #56435
* #57317

Fixes #73459
williamrandolph added a commit that referenced this pull request Jun 1, 2021
We've had a series of bug fixes for cases where an OsProbe gives negative
values, most often just -1, to the OsStats class. We added assertions to catch
cases where we were initializing OsStats with bad values. Unfortunately, these
fixes turned to not be backwards-compatible. In this commit, we simply coerce
bad values to 0 when data is coming from nodes that don't have the relevant bug
fixes.

Relevant PRs:
* #42725
* #56435
* #57317

Fixes #73459
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Negative used memory causes IllegalArgumentException in _cat/nodes
9 participants