-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cloud core hour utilization statistic #1242
Add cloud core hour utilization statistic #1242
Conversation
public function __construct($query_instance = null) | ||
{ | ||
|
||
$sql = 'COALESCE((SUM(jf.core_time) / SUM(DISTINCT jf.core_time_available)) * 100, 0)'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the correct way to calculate utilization. This does not handle the case where there are multiple cloud resources in the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The utilization statistic can be calculated in much the same way as the one in the Jobs realm (except, of course, joining against a table that contains the correct resource availability data). You need a correlated subquery to pull in the correct core_time_available for each resource in the tables. The aggregate query and timeseries queries are similar but not identical.
An example of the sql for a timeseries query is below:
SELECT
jf.day_id,
jf.host_resource_id,
SUM(jf.core_time) / (SELECT
SUM(rf.core_time_available)
FROM
modw_aggregates.resourcespecsfact_by_day rf
WHERE
rf.day_id = jf.day_id
AND FIND_IN_SET(rf.resource_id,
GROUP_CONCAT(DISTINCT jf.host_resource_id)) <> 0) * 100.0 AS utilization,
SUM(jf.core_time) AS core_time,
(SELECT
SUM(rf.core_time_available)
FROM
modw_aggregates.resourcespecsfact_by_day rf
WHERE
rf.day_id = jf.day_id
AND FIND_IN_SET(rf.resource_id,
GROUP_CONCAT(DISTINCT jf.host_resource_id)) <> 0) AS total
FROM
modw_cloud.cloudfact_by_day jf
WHERE
jf.day_id BETWEEN 202000001 AND 202000006
GROUP BY jf.day_id , jf.host_resource_id;
The important bit is the join on the days table to pull in the data for the associated day and the FIND_IN_SET() to get the rows for the corresponding resource.
The aggregate query is similar to what you have already, but you have to use a where condition to limit to the appropriate resource:
SELECT
SUM(jf.core_time) / (SELECT
SUM(rf.core_time_available)
FROM
modw_aggregates.resourcespecsfact_by_day rf
WHERE
rf.day_id BETWEEN 202000001 AND 202000006
AND FIND_IN_SET(rf.resource_id,
GROUP_CONCAT(DISTINCT jf.host_resource_id)) <> 0) * 100.0 AS utilization,
SUM(jf.core_time) AS core_time,
(SELECT
SUM(rf.core_time_available)
FROM
modw_aggregates.resourcespecsfact_by_day rf
WHERE
rf.day_id = jf.day_id
AND FIND_IN_SET(rf.resource_id,
GROUP_CONCAT(DISTINCT jf.host_resource_id)) <> 0) AS total
FROM
modw_cloud.cloudfact_by_day jf
WHERE
jf.day_id BETWEEN 202000001 AND 202000006;
There is no need to add a new column to the cloudfact_ tables.
@@ -45,7 +45,8 @@ | |||
"wallduration": "COALESCE(SUM(${wallduration_case_statement}), 0)", | |||
"submission_venue_id": "sr.submission_venue_id", | |||
"domain_id": "sr.domain_id", | |||
"service_provider": "sr.service_provider" | |||
"service_provider": "sr.service_provider", | |||
"core_time_available": "COALESCE(rsa.core_time_available, 0)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that there is a need to add this column to this table. Typically the values stored in the rows correspond to the contribution to the overall value for the given row, whereas the value of core_time_available is the total amount available for the resource for that row.
}, | ||
{ | ||
"name": "vcpus", | ||
"type": "int(5)", | ||
"nullable": false, | ||
"default": null | ||
"default": null, | ||
"comments": "Number of vcpus available on the associated node." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should explain what a vcpu is here too.
|
||
"indexes": [ | ||
{ | ||
"name": "index_resource", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you do any performance testing to see which (if any) of these indexes are used? I suppose this will change when the utilization calculation is fixed. I suspect that you may want a combined index on (resource_id, period_id) or (period_id, resource_id) for the best query performance.
A state reconstructor design also only needs to see the resource data when it changes and not every day. Why would the state reconstructor need to see data every day even if it didn't change? |
tests/ci/bootstrap.sh
Outdated
@@ -72,6 +72,7 @@ then | |||
then | |||
sudo -u xdmod xdmod-shredder -r openstack -d $REF_DIR/openstack -f openstack | |||
sudo -u xdmod xdmod-shredder -r nutsetters -d $REF_DIR/nutsetters -f openstack | |||
sudo -u xdmod php /data/xdmod/tools/etl/etl_overseer.php -p ingest-cloud-resource-specs -d "CLOUD_RESOURCE_SPECS_DIRECTORY=$REF_DIR/openstack_resource_specs" -r openstack -v debug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably this is intentionally not integrated into the xdmod-shredder
? You'll need to add documentation to explain how and when to use the command and include examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what happens if the user forgets to run this overseer command?
The change of the StateReconstructorTransformIngestor to sql instead of php seems like it should be in a separate pull request. |
SonarCloud Quality Gate failed. 0 Bugs No Coverage information |
…lated. remove unneeded test artifacts and changed column names for more clarity and consistency.
1fd05eb
to
cec0211
Compare
configuration/etl/etl_tables.d/cloud_common/raw_resource_specs.json
Outdated
Show resolved
Hide resolved
…ext changes to headers not changes to utilization numbers
@@ -44,6 +44,9 @@ | |||
"cloud_common/asset.json", | |||
"cloud_common/instance_data.json", | |||
"cloud_common/event_asset.json", | |||
"cloud_common/raw_resource_specs.json", | |||
"cloud_common/cloud_resource_specs.json", | |||
"common/hpcdb/resourcespecsfact_by_.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is file in common/hpcdb
? Is this table intended to be in mod_hpcdb
? It's using the endpoint for modw_cloud
so that isn't happening.
I don't think it should be in mod_hpcdb
. That schema is intended to normalized (non-data warehouse) data. If this is cloud specific it should probably be in modw_cloud
and if it's for other realms as well then it should be in modw_aggregates
.
Looking more closely I don't think you want this in a ManageTables
at all since that is creating a table named resourcespecsfact_by_
:
MariaDB [modw_cloud]> DESC resourcespecsfact_by_;
+------------------------+----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+----------------------+------+-----+---------+-------+
| ${aggregation_unit}_id | int(10) unsigned | NO | MUL | NULL | |
| year | smallint(5) unsigned | NO | | NULL | |
| ${aggregation_unit} | smallint(5) unsigned | NO | MUL | NULL | |
| resource_id | int(11) | NO | MUL | NULL | |
| core_time_available | bigint(42) | NO | | NULL | |
+------------------------+----------------------+------+-----+---------+-------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. Yeah, you're right. This shouldn't be in common/hpcdb and it shouldn't be in mod_hpcdb. I'll change it so the table goes into mod_aggregates.
I think we should rename this from resourcespecs to something else. It is a bit confusing since we already have a resourcespecs in modw that does something a little bit different |
we actually might want to consider how this could either utilize or at least be similar to https://github.com/ubccr/xdmod-hardware since that does node level tracking over time as well... |
docs/cloud.md
Outdated
@@ -11,6 +11,8 @@ The Cloud realm in XDMoD tracks events that occur in cloud infrastructure system | |||
- The average amount of root volume disk space (in bytes) reserved by running sessions, weighted by wall hours. | |||
- Average Wall Hours per Session | |||
- The average wall time that a session was running, in hours. | |||
- Core Hour Utilization: % | |||
- A percentage that shows how many core hours were reserved over a time period against how many core hours a resource had available during that time period. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a bit more explanation as to what is meant by 'reserved'. This has multiple different meanings so it behoves us to be very precise with the definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpwhite4 I updated the description to remove reserved and hopefully make it a little clearer what the statistic is measuring.
"year": "${:YEAR_VALUE}", | ||
"${AGGREGATION_UNIT}": "${:PERIOD_VALUE}", | ||
"resource_id": "crs.resource_id", | ||
"core_time_available": "SUM(((IF(crs.end_day_id <= ${:PERIOD_END_DAY_ID}, crs.end_day_id, ${:PERIOD_END_DAY_ID}) - IF(crs.start_day_id >= ${:PERIOD_START_DAY_ID}, crs.start_day_id, ${:PERIOD_START_DAY_ID})) + 1) * crs.vcpus * 24 * 3600)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you provide a bit of explanation about why this IF statement is needed. I was not expecting it to be here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular when this code is computing the year table how does this IF statement compute the core_time_available if the resource specifications change during the year?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I think I understand how it works now. You cannot multiply by 24 since not all days have 24 hours in them. You need to use the seconds value from the time table to work out how many seconds there are in a given day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT * FROM modw.days where seconds != 86400
basically the days when daylight savings begin and when they end (for timezones that have daylight savings).
}, | ||
"aggregation_period_query": { | ||
"overseer_restrictions": { | ||
"include_only_resource_codes": "resource_id IN ${VALUE}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is also necessary to add the time range restriction information here too:
"overseer_restrictions": {
"last_modified_start_date": "last_modified >= ${VALUE}",
"last_modified_end_date": "last_modified <= ${VALUE}",
@@ -44,6 +44,16 @@ | |||
"name": "end_date", | |||
"type": "date", | |||
"nullable": true | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should put a last_modified column here too so that we only have to reaggregate time periods that need reaggregation.
"start_date": "IF(r1.memory_mb = -1 AND r1.vcpus = -1, r2.fact_date, r1.fact_date)", | ||
"end_date": "CASE WHEN MIN(r2.fact_date) IS NOT NULL AND (r1.memory_mb != -1 AND r1.vcpus != -1) THEN MIN(r2.fact_date) - INTERVAL 1 DAY WHEN r1.memory_mb = -1 AND r1.vcpus = -1 AND (SELECT MAX(fact_date) FROM modw_cloud.raw_resource_specs) != MAX(r2.fact_date) THEN MAX(r1.fact_date) - INTERVAL 1 DAY ELSE CURDATE() END", | ||
"start_day_id": "IF(r1.memory_mb = -1 AND r1.vcpus = -1, YEAR(r2.fact_date) * 100000 + DAYOFYEAR(r2.fact_date), YEAR(r1.fact_date) * 100000 + DAYOFYEAR(r1.fact_date))", | ||
"end_day_id": "CASE WHEN MIN(r2.fact_date) IS NOT NULL AND (r1.memory_mb != -1 AND r1.vcpus != -1) THEN YEAR(MIN(r2.fact_date) - INTERVAL 1 DAY) * 100000 + DAYOFYEAR(MIN(r2.fact_date) - INTERVAL 1 DAY) WHEN r1.memory_mb = -1 AND r1.vcpus = -1 AND (SELECT MAX(fact_date) FROM modw_cloud.raw_resource_specs) != MAX(r2.fact_date) THEN YEAR(MAX(r1.fact_date) - INTERVAL 1 DAY) * 100000 + DAYOFYEAR(MAX(r1.fact_date) - INTERVAL 1 DAY) ELSE YEAR(CURDATE()) * 100000 + DAYOFYEAR(CURDATE()) END" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact_date
column is defined as not null in the definition below. Therefore this "is not null" condition is unecessary. Please double check this function and correct if necessary.
-- This sql statement inserts -1 values for the memory_mb and vcpus for a day that a compute node has been | ||
-- removed from the most recently ingested resource specifications file. The -1 helps when setting start and | ||
-- end times of a cpu and memory configuration for a compute node. | ||
INSERT INTO modw_cloud.raw_resource_specs (hostname, resource_id, memory_mb, vcpus, fact_date) SELECT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer tabs (https://www.reddit.com/r/javascript/comments/c8drjo/nobody_talks_about_the_real_reason_to_use_tabs/), so feel free to completely ignore this.
But for all our other source indentation is done with 4 spaces.
…utilization-metrics
…shredder and xdmod-ingestor
…utilization-metrics
…777/xdmod into cloud-utilization-metrics
…utilization-metrics
…777/xdmod into cloud-utilization-metrics
There are no documentation updates in this change but the bootstrap script was edited to add a new call to the shredder. Will the new metric ''just work'' or are there additional steps that need to be done to ingest/aggregate existing data? Are there any other changes that need to be made to the scripts that run on metrics-dev and metrics for this to work for us? Same question about xdmod-dev and xdmod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to update the script that runs on metrics-dev so that it pull in the cloud utilization metrics.
This PR adds a aggregation of resource specification for a resource a group by for Cloud Core Hours utilization and changes how the date range for a compute nodes memory and cpu specifications are determined.
An aggregate table is created in modw_aggregates that lists the the aggregation time period, resource id, and core time available in seconds for that time period. Cloud aggregation selects the core time available from the appropriate aggregation table and uses that in the statistic calculation.
Determining the date range for the amount of memory and cpus for a compute node is now determined using 2 sql statements instead of the StateReconstructorTransformIngestor. This new way makes it possible to only need the resource specifications on days where the specifications change instead of every day and results in less code. It works by using a self join on any rows with a fact_date in the future that have a different amount of memory or cpus. The minimum date of the joined rows is end date for that specification.
Regression tests have been added for the group by and the component test for the StateReconstructorTransformIngestor has been remove since the StateReconstructorTransformIngestor has been removed as well.
Tests performed
Tested in docker and on metrics-dev. The metrics-dev link is https://metrics-dev.ccr.buffalo.edu:9008
Types of changes
Checklist: