env_aws: use best-effort lookup table for CPU performance in EC2 #7828

shoenig · 2020-04-29T00:29:55Z

The current behavior of the CPU fingerprinter in AWS is that it
reads the current speed from /proc/cpuinfo (CPU MHz field).

This is because the max CPU frequency is not available by reading
anything on the EC2 instance itself. Normally on Linux one would
look at e.g. sys/devices/system/cpu/cpuN/cpufreq/cpuinfo_max_freq
or perhaps parse the values from the CPU max MHz field in
/proc/cpuinfo, but those values are not available.

Furthermore, no metadata about the CPU is made available in the
EC2 metadata service.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-categories.html

Since go-psutil cannot determine the max CPU speed it defaults to
the current CPU speed, which could be basically any number between
0 and the true max. This is particularly bad on large, powerful
reserved instances which often idle at ~800 MHz while Nomad does
its fingerprinting (typically IO bound), which Nomad then uses as
the max, which results in severe loss of available resources.

Since the CPU specification is unavailable programmatically (at least
not without sudo) use a best-effort lookup table. This table was
generated by going through every instance type in AWS documentation
and copy-pasting the numbers.
https://aws.amazon.com/ec2/instance-types/

This approach obviously is not ideal as future instance types will
need to be added as they are introduced to AWS. However, using the
table should only be an improvement over the status quo since right
now Nomad miscalculates available CPU resources on all instance types.

Fixes #7681 The current behavior of the CPU fingerprinter in AWS is that it reads the **current** speed from `/proc/cpuinfo` (`CPU MHz` field). This is because the max CPU frequency is not available by reading anything on the EC2 instance itself. Normally on Linux one would look at e.g. `sys/devices/system/cpu/cpuN/cpufreq/cpuinfo_max_freq` or perhaps parse the values from the `CPU max MHz` field in `/proc/cpuinfo`, but those values are not available. Furthermore, no metadata about the CPU is made available in the EC2 metadata service. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-categories.html Since `go-psutil` cannot determine the max CPU speed it defaults to the current CPU speed, which could be basically any number between 0 and the true max. This is particularly bad on large, powerful reserved instances which often idle at ~800 MHz while Nomad does its fingerprinting (typically IO bound), which Nomad then uses as the max, which results in severe loss of available resources. Since the CPU specification is unavailable programmatically (at least not without sudo) use a best-effort lookup table. This table was generated by going through every instance type in AWS documentation and copy-pasting the numbers. https://aws.amazon.com/ec2/instance-types/ This approach obviously is not ideal as future instance types will need to be added as they are introduced to AWS. However, using the table should only be an improvement over the status quo since right now Nomad miscalculates available CPU resources on all instance types.

shoenig · 2020-04-29T01:31:18Z

Running Nomad with this change on a c5.24xlarge like the original reporter:

run 1

2020-04-29T01:22:55.187Z [DEBUG] client.fingerprint_mgr.cpu: detected cpu frequency: MHz=3346
2020-04-29T01:22:55.187Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=96
2020-04-29T01:22:55.201Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu model name: model="3.6 GHz Intel Xeon Scalable"
2020-04-29T01:22:55.201Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu frequency: MHz=3600
2020-04-29T01:22:55.201Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu cores: cores=96
2020-04-29T01:22:55.201Z [DEBUG] client.fingerprint_mgr.env_aws: setting ec2 cpu ticks: ticks=345600

run 2

2020-04-29T01:23:58.743Z [DEBUG] client.fingerprint_mgr.cpu: detected cpu frequency: MHz=1335
2020-04-29T01:23:58.743Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=96
2020-04-29T01:23:58.755Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu model name: model="3.6 GHz Intel Xeon Scalable"
2020-04-29T01:23:58.755Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu frequency: MHz=3600
2020-04-29T01:23:58.755Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu cores: cores=96
2020-04-29T01:23:58.755Z [DEBUG] client.fingerprint_mgr.env_aws: setting ec2 cpu ticks: ticks=345600

run 3

2020-04-29T01:25:19.395Z [DEBUG] client.fingerprint_mgr.cpu: detected cpu frequency: MHz=2218
2020-04-29T01:25:19.395Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=96
2020-04-29T01:25:19.407Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu model name: model="3.6 GHz Intel Xeon Scalable"
2020-04-29T01:25:19.407Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu frequency: MHz=3600
2020-04-29T01:25:19.407Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu cores: cores=96
2020-04-29T01:25:19.407Z [DEBUG] client.fingerprint_mgr.env_aws: setting ec2 cpu ticks: ticks=345600

plumbing works

ubuntu@ip-172-31-84-116:~$ ./nomad node status 17
ID              = 1707aadb-c741-b6ac-b487-b1d713466260
Name            = ip-172-31-84-116
Class           = <none>
DC              = dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 7m3s
Host Volumes    = <none>
CSI Volumes     = <none>
Driver Status   = mock_driver,raw_exec

Node Events
Time                  Subsystem  Message
2020-04-29T01:25:37Z  Cluster    Node registered

Allocated Resources
CPU           Memory       Disk
0/345600 MHz  0 B/185 GiB  0 B/6.5 GiB

Allocation Resource Utilization
CPU           Memory
0/345600 MHz  0 B/185 GiB

Host Resource Utilization
CPU             Memory           Disk
244/345600 MHz  669 MiB/185 GiB  1.2 GiB/7.7 GiB

Allocations
No allocations placed

notnoop

I would love to see some automation tips/scripts for updating the list, and I have some nitpicky logging suggestions. LGTM otherise.

client/fingerprint/env_aws.go

jippi · 2020-04-29T12:45:29Z

Will the client settings still take precedence when configured?

Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>

shoenig · 2020-04-29T17:12:20Z

@jippi Yep, configuring cpu_total_compute continues to take the highest precedence.

github-actions · 2023-01-08T02:17:59Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

shoenig force-pushed the b-ec2-speeds branch 2 times, most recently from c7e35d4 to 2589b85 Compare April 29, 2020 00:47

shoenig force-pushed the b-ec2-speeds branch from 2589b85 to 9230fa9 Compare April 29, 2020 01:01

shoenig marked this pull request as ready for review April 29, 2020 01:31

shoenig requested review from notnoop and jrasell April 29, 2020 01:31

notnoop approved these changes Apr 29, 2020

View reviewed changes

client/fingerprint/env_aws.go Show resolved Hide resolved

client/fingerprint/env_aws.go Outdated Show resolved Hide resolved

client/fingerprint/env_aws.go Outdated Show resolved Hide resolved

client/fingerprint/env_aws.go Outdated Show resolved Hide resolved

jippi reviewed Apr 29, 2020

View reviewed changes

client/fingerprint/env_aws.go Show resolved Hide resolved

shoenig and others added 3 commits April 29, 2020 10:33

env_aws: fixup log line

f47c57f

Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>

env_aws: downgrade log line

0d5d178

Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>

env_aws: combine 3 log lines into 1

a869394

shoenig mentioned this pull request Apr 29, 2020

Create a process to scrape EC2 CPU information out of AWS API #7830

Closed

shoenig merged commit a12eb8f into master Apr 29, 2020

shoenig deleted the b-ec2-speeds branch April 29, 2020 17:25

github-actions bot locked as resolved and limited conversation to collaborators Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

env_aws: use best-effort lookup table for CPU performance in EC2 #7828

env_aws: use best-effort lookup table for CPU performance in EC2 #7828

shoenig commented Apr 29, 2020 •

edited

Loading

shoenig commented Apr 29, 2020

notnoop left a comment

jippi commented Apr 29, 2020

shoenig commented Apr 29, 2020

github-actions bot commented Jan 8, 2023

env_aws: use best-effort lookup table for CPU performance in EC2 #7828

env_aws: use best-effort lookup table for CPU performance in EC2 #7828

Conversation

shoenig commented Apr 29, 2020 • edited Loading

shoenig commented Apr 29, 2020

run 1

run 2

run 3

plumbing works

notnoop left a comment

Choose a reason for hiding this comment

jippi commented Apr 29, 2020

shoenig commented Apr 29, 2020

github-actions bot commented Jan 8, 2023

shoenig commented Apr 29, 2020 •

edited

Loading