Add cloud core hour utilization statistic #1242

eiffel777 · 2020-03-03T13:44:25Z

This PR adds a aggregation of resource specification for a resource a group by for Cloud Core Hours utilization and changes how the date range for a compute nodes memory and cpu specifications are determined.

An aggregate table is created in modw_aggregates that lists the the aggregation time period, resource id, and core time available in seconds for that time period. Cloud aggregation selects the core time available from the appropriate aggregation table and uses that in the statistic calculation.

Determining the date range for the amount of memory and cpus for a compute node is now determined using 2 sql statements instead of the StateReconstructorTransformIngestor. This new way makes it possible to only need the resource specifications on days where the specifications change instead of every day and results in less code. It works by using a self join on any rows with a fact_date in the future that have a different amount of memory or cpus. The minimum date of the joined rows is end date for that specification.

Regression tests have been added for the group by and the component test for the StateReconstructorTransformIngestor has been remove since the StateReconstructorTransformIngestor has been removed as well.

Tests performed

Tested in docker and on metrics-dev. The metrics-dev link is https://metrics-dev.ccr.buffalo.edu:9008

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project as found in the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

jpwhite4 · 2020-03-03T14:40:06Z

classes/DataWarehouse/Query/Cloud/Statistics/CoreUtilizationStatistic.php

+    public function __construct($query_instance = null)
+    {
+
+        $sql = 'COALESCE((SUM(jf.core_time) / SUM(DISTINCT jf.core_time_available)) * 100, 0)';


This is not the correct way to calculate utilization. This does not handle the case where there are multiple cloud resources in the database.

The utilization statistic can be calculated in much the same way as the one in the Jobs realm (except, of course, joining against a table that contains the correct resource availability data). You need a correlated subquery to pull in the correct core_time_available for each resource in the tables. The aggregate query and timeseries queries are similar but not identical.

An example of the sql for a timeseries query is below:

SELECT jf.day_id, jf.host_resource_id, SUM(jf.core_time) / (SELECT SUM(rf.core_time_available) FROM modw_aggregates.resourcespecsfact_by_day rf WHERE rf.day_id = jf.day_id AND FIND_IN_SET(rf.resource_id, GROUP_CONCAT(DISTINCT jf.host_resource_id)) <> 0) * 100.0 AS utilization, SUM(jf.core_time) AS core_time, (SELECT SUM(rf.core_time_available) FROM modw_aggregates.resourcespecsfact_by_day rf WHERE rf.day_id = jf.day_id AND FIND_IN_SET(rf.resource_id, GROUP_CONCAT(DISTINCT jf.host_resource_id)) <> 0) AS total FROM modw_cloud.cloudfact_by_day jf WHERE jf.day_id BETWEEN 202000001 AND 202000006 GROUP BY jf.day_id , jf.host_resource_id;

The important bit is the join on the days table to pull in the data for the associated day and the FIND_IN_SET() to get the rows for the corresponding resource.

The aggregate query is similar to what you have already, but you have to use a where condition to limit to the appropriate resource:

SELECT SUM(jf.core_time) / (SELECT SUM(rf.core_time_available) FROM modw_aggregates.resourcespecsfact_by_day rf WHERE rf.day_id BETWEEN 202000001 AND 202000006 AND FIND_IN_SET(rf.resource_id, GROUP_CONCAT(DISTINCT jf.host_resource_id)) <> 0) * 100.0 AS utilization, SUM(jf.core_time) AS core_time, (SELECT SUM(rf.core_time_available) FROM modw_aggregates.resourcespecsfact_by_day rf WHERE rf.day_id = jf.day_id AND FIND_IN_SET(rf.resource_id, GROUP_CONCAT(DISTINCT jf.host_resource_id)) <> 0) AS total FROM modw_cloud.cloudfact_by_day jf WHERE jf.day_id BETWEEN 202000001 AND 202000006;

There is no need to add a new column to the cloudfact_ tables.

jpwhite4 · 2020-03-03T14:58:43Z

configuration/etl/etl_action_defs.d/cloud_common/cloud_metrics_aggregation.json

@@ -45,7 +45,8 @@
                "wallduration": "COALESCE(SUM(${wallduration_case_statement}), 0)",
                "submission_venue_id": "sr.submission_venue_id",
                "domain_id": "sr.domain_id",
-                "service_provider": "sr.service_provider"
+                "service_provider": "sr.service_provider",
+                "core_time_available": "COALESCE(rsa.core_time_available, 0)"


I don't think that there is a need to add this column to this table. Typically the values stored in the rows correspond to the contribution to the overall value for the given row, whereas the value of core_time_available is the total amount available for the resource for that row.

jpwhite4 · 2020-03-03T15:00:24Z

configuration/etl/etl_tables.d/cloud_common/raw_resource_specs.json

            },
            {
                "name": "vcpus",
                "type": "int(5)",
                "nullable": false,
-                "default": null
+                "default": null,
+                "comments": "Number of vcpus available on the associated node."


you should explain what a vcpu is here too.

jpwhite4 · 2020-03-03T15:04:42Z

configuration/etl/etl_tables.d/common/hpcdb/resourcespecsfact_by_.json

+
+        "indexes": [
+            {
+                "name": "index_resource",


Did you do any performance testing to see which (if any) of these indexes are used? I suppose this will change when the utilization calculation is fixed. I suspect that you may want a combined index on (resource_id, period_id) or (period_id, resource_id) for the best query performance.

jpwhite4 · 2020-03-03T15:25:38Z

A state reconstructor design also only needs to see the resource data when it changes and not every day. Why would the state reconstructor need to see data every day even if it didn't change?

jpwhite4 · 2020-03-03T15:28:26Z

tests/ci/bootstrap.sh

@@ -72,6 +72,7 @@ then
    then
        sudo -u xdmod xdmod-shredder -r openstack -d $REF_DIR/openstack -f openstack
        sudo -u xdmod xdmod-shredder -r nutsetters -d $REF_DIR/nutsetters -f openstack
+        sudo -u xdmod php /data/xdmod/tools/etl/etl_overseer.php -p ingest-cloud-resource-specs -d "CLOUD_RESOURCE_SPECS_DIRECTORY=$REF_DIR/openstack_resource_specs" -r openstack -v debug


Presumably this is intentionally not integrated into the xdmod-shredder? You'll need to add documentation to explain how and when to use the command and include examples.

Also what happens if the user forgets to run this overseer command?

plessbd · 2020-03-03T16:06:05Z

The change of the StateReconstructorTransformIngestor to sql instead of php seems like it should be in a separate pull request.

tests/ci/bootstrap.sh

sonarcloud · 2020-03-04T19:10:05Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities (and 0 Security Hotspots to review)
0 Code Smells

No Coverage information
6.9% Duplication

…d resource

…lated. remove unneeded test artifacts and changed column names for more clarity and consistency.

configuration/etl/etl_tables.d/cloud_common/raw_resource_specs.json

…ext changes to headers not changes to utilization numbers

…777/xdmod into cloud-utilization-metrics

jtpalmer · 2020-03-26T16:17:12Z

configuration/etl/etl.d/jobs_cloud_common.json

@@ -44,6 +44,9 @@
                "cloud_common/asset.json",
                "cloud_common/instance_data.json",
                "cloud_common/event_asset.json",
+                "cloud_common/raw_resource_specs.json",
+                "cloud_common/cloud_resource_specs.json",
+                "common/hpcdb/resourcespecsfact_by_.json",


Why is file in common/hpcdb? Is this table intended to be in mod_hpcdb? It's using the endpoint for modw_cloud so that isn't happening.

I don't think it should be in mod_hpcdb. That schema is intended to normalized (non-data warehouse) data. If this is cloud specific it should probably be in modw_cloud and if it's for other realms as well then it should be in modw_aggregates.

Looking more closely I don't think you want this in a ManageTables at all since that is creating a table named resourcespecsfact_by_:

MariaDB [modw_cloud]> DESC resourcespecsfact_by_; +------------------------+----------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------------------+----------------------+------+-----+---------+-------+ | ${aggregation_unit}_id | int(10) unsigned | NO | MUL | NULL | | | year | smallint(5) unsigned | NO | | NULL | | | ${aggregation_unit} | smallint(5) unsigned | NO | MUL | NULL | | | resource_id | int(11) | NO | MUL | NULL | | | core_time_available | bigint(42) | NO | | NULL | | +------------------------+----------------------+------+-----+---------+-------+

Whoops. Yeah, you're right. This shouldn't be in common/hpcdb and it shouldn't be in mod_hpcdb. I'll change it so the table goes into mod_aggregates.

plessbd · 2020-03-27T20:08:26Z

I think we should rename this from resourcespecs to something else. It is a bit confusing since we already have a resourcespecs in modw that does something a little bit different

plessbd · 2020-03-27T20:25:20Z

we actually might want to consider how this could either utilize or at least be similar to https://github.com/ubccr/xdmod-hardware since that does node level tracking over time as well...

Requested changes were made.

jpwhite4 · 2020-04-10T13:34:53Z

docs/cloud.md

@@ -11,6 +11,8 @@ The Cloud realm in XDMoD tracks events that occur in cloud infrastructure system
  - The average amount of root volume disk space (in bytes) reserved by running sessions, weighted by wall hours.
 - Average Wall Hours per Session
  - The average wall time that a session was running, in hours.
+- Core Hour Utilization: %
+  - A percentage that shows how many core hours were reserved over a time period against how many core hours a resource had available during that time period.


Can you add a bit more explanation as to what is meant by 'reserved'. This has multiple different meanings so it behoves us to be very precise with the definition.

@jpwhite4 I updated the description to remove reserved and hopefully make it a little clearer what the statistic is measuring.

jpwhite4 · 2020-05-04T17:17:07Z

configuration/etl/etl_action_defs.d/cloud_common/cloud_resource_specs_aggregation.json

+                "year": "${:YEAR_VALUE}",
+                "${AGGREGATION_UNIT}": "${:PERIOD_VALUE}",
+                "resource_id": "crs.resource_id",
+                "core_time_available": "SUM(((IF(crs.end_day_id <= ${:PERIOD_END_DAY_ID}, crs.end_day_id, ${:PERIOD_END_DAY_ID}) - IF(crs.start_day_id >= ${:PERIOD_START_DAY_ID}, crs.start_day_id, ${:PERIOD_START_DAY_ID})) + 1) * crs.vcpus * 24 * 3600)"


Please can you provide a bit of explanation about why this IF statement is needed. I was not expecting it to be here.

In particular when this code is computing the year table how does this IF statement compute the core_time_available if the resource specifications change during the year?

Ok. I think I understand how it works now. You cannot multiply by 24 since not all days have 24 hours in them. You need to use the seconds value from the time table to work out how many seconds there are in a given day.

SELECT * FROM modw.days where seconds != 86400

basically the days when daylight savings begin and when they end (for timezones that have daylight savings).

jpwhite4 · 2020-05-04T17:20:04Z

configuration/etl/etl_action_defs.d/cloud_common/cloud_resource_specs_aggregation.json

+        },
+        "aggregation_period_query": {
+            "overseer_restrictions": {
+                "include_only_resource_codes": "resource_id IN ${VALUE}",


I think it is also necessary to add the time range restriction information here too:

"overseer_restrictions": { "last_modified_start_date": "last_modified >= ${VALUE}", "last_modified_end_date": "last_modified <= ${VALUE}",

jpwhite4 · 2020-05-04T17:23:17Z

configuration/etl/etl_tables.d/cloud_common/cloud_resource_specs.json

@@ -44,6 +44,16 @@
                "name": "end_date",
                "type": "date",
                "nullable": true
+            },


you should put a last_modified column here too so that we only have to reaggregate time periods that need reaggregation.

jpwhite4 · 2020-05-04T17:27:13Z

configuration/etl/etl_action_defs.d/cloud_common/resource_specifications.json

+              "start_date": "IF(r1.memory_mb = -1 AND r1.vcpus = -1, r2.fact_date, r1.fact_date)",
+              "end_date": "CASE WHEN MIN(r2.fact_date) IS NOT NULL AND (r1.memory_mb != -1 AND r1.vcpus != -1) THEN MIN(r2.fact_date) - INTERVAL 1 DAY WHEN r1.memory_mb = -1 AND r1.vcpus = -1 AND (SELECT MAX(fact_date) FROM modw_cloud.raw_resource_specs) != MAX(r2.fact_date) THEN MAX(r1.fact_date) - INTERVAL 1 DAY ELSE CURDATE() END",
+              "start_day_id": "IF(r1.memory_mb = -1 AND r1.vcpus = -1, YEAR(r2.fact_date) * 100000 + DAYOFYEAR(r2.fact_date), YEAR(r1.fact_date) * 100000 + DAYOFYEAR(r1.fact_date))",
+              "end_day_id": "CASE WHEN MIN(r2.fact_date) IS NOT NULL AND (r1.memory_mb != -1 AND r1.vcpus != -1) THEN YEAR(MIN(r2.fact_date) - INTERVAL 1 DAY) * 100000 + DAYOFYEAR(MIN(r2.fact_date) - INTERVAL 1 DAY) WHEN r1.memory_mb = -1 AND r1.vcpus = -1 AND (SELECT MAX(fact_date) FROM modw_cloud.raw_resource_specs) != MAX(r2.fact_date) THEN YEAR(MAX(r1.fact_date) - INTERVAL 1 DAY) * 100000 + DAYOFYEAR(MAX(r1.fact_date) - INTERVAL 1 DAY) ELSE YEAR(CURDATE()) * 100000 + DAYOFYEAR(CURDATE()) END"


The fact_date column is defined as not null in the definition below. Therefore this "is not null" condition is unecessary. Please double check this function and correct if necessary.

plessbd · 2020-05-04T18:23:40Z

configuration/etl/etl_sql.d/cloud_common/resource_specification_end_day.sql

+-- This sql statement inserts -1 values for the memory_mb and vcpus for a day that a compute node has been
+-- removed from the most recently ingested resource specifications file. The -1 helps when setting start and
+-- end times of a cpu and memory configuration for a compute node.
+INSERT INTO modw_cloud.raw_resource_specs (hostname, resource_id, memory_mb, vcpus, fact_date) SELECT


I prefer tabs (https://www.reddit.com/r/javascript/comments/c8drjo/nobody_talks_about_the_real_reason_to_use_tabs/), so feel free to completely ignore this.

But for all our other source indentation is done with 4 spaces.

…utilization-metrics

…shredder and xdmod-ingestor

…utilization-metrics

…777/xdmod into cloud-utilization-metrics

…d parent science

…utilization-metrics

…777/xdmod into cloud-utilization-metrics

jpwhite4 · 2020-06-26T17:55:25Z

There are no documentation updates in this change but the bootstrap script was edited to add a new call to the shredder.

Will the new metric ''just work'' or are there additional steps that need to be done to ingest/aggregate existing data? Are there any other changes that need to be made to the scripts that run on metrics-dev and metrics for this to work for us? Same question about xdmod-dev and xdmod.

jpwhite4

Don't forget to update the script that runs on metrics-dev so that it pull in the cloud utilization metrics.

eiffel777 added Category:Cloud Cloud Realm new feature New functionality labels Mar 3, 2020

eiffel777 added this to the 9.0.0 milestone Mar 3, 2020

eiffel777 requested review from plessbd, jpwhite4 and ryanrath March 3, 2020 13:44

eiffel777 self-assigned this Mar 3, 2020

jpwhite4 requested changes Mar 3, 2020

View reviewed changes

jpwhite4 reviewed Mar 3, 2020

View reviewed changes

plessbd reviewed Mar 3, 2020

View reviewed changes

tests/ci/bootstrap.sh Outdated Show resolved Hide resolved

eiffel777 added 10 commits March 5, 2020 07:32

adding statistic for core hours utiliazation

ff15e57

set start and end times for specifications for compute node in a clou…

eb523bc

…d resource

chaning column size of core_hours_available on cloudfact_by table

615eb91

changes to how time period for cloud resource specifications is calcu…

2f734ce

…lated. remove unneeded test artifacts and changed column names for more clarity and consistency.

adding sql file to add end rows for compute nodes

cc213c3

adding regression tests

111e508

removing code for state reconstruction that is no longer used

f99975d

adding comments and removing debug statments

d3bcd21

fixing merge conflicts

8cc889e

changes to utilization computation. removing unneeded columns

cec0211

eiffel777 force-pushed the cloud-utilization-metrics branch from 1fd05eb to cec0211 Compare March 5, 2020 12:36

jtpalmer suggested changes Mar 10, 2020

View reviewed changes

configuration/etl/etl_tables.d/cloud_common/raw_resource_specs.json Outdated Show resolved Hide resolved

eiffel777 added 2 commits March 16, 2020 13:02

merging xdmod9.0 upstream

e5b60c3

updating to use new group by functionality. tests remade becuase of t…

5e7bf42

…ext changes to headers not changes to utilization numbers

eiffel777 added 4 commits March 25, 2020 11:38

fixing conflicts with xdmod9.0

9e7e6c3

adding utilization to list of metrics available in cloud documentation

1905e29

Merge branch 'cloud-utilization-metrics' of https://github.com/eiffel…

a3944fd

…777/xdmod into cloud-utilization-metrics

fixing json syntax error

1a22ff9

jtpalmer previously requested changes Mar 26, 2020

View reviewed changes

moving resourcespecsfact_by_ to cloud_common instead of common/hpcdb

ef46a4f

jpwhite4 reviewed Apr 10, 2020

View reviewed changes

updating cloud utilization documentation

df24519

jpwhite4 requested changes May 4, 2020

View reviewed changes

plessbd reviewed May 4, 2020

View reviewed changes

eiffel777 and others added 11 commits May 5, 2020 11:25

Merge branch 'xdmod9.0' of https://github.com/ubccr/xdmod into cloud-…

c739a10

…utilization-metrics

Merge branch 'xdmod9.0' into cloud-utilization-metrics

bd4e6d2

updating to use wall duration macro, last_modified_start_date, xdmod-…

5eaedce

…shredder and xdmod-ingestor

Merge branch 'xdmod9.0' of https://github.com/ubccr/xdmod into cloud-…

ab0b1fb

…utilization-metrics

Merge branch 'cloud-utilization-metrics' of https://github.com/eiffel…

a88a276

…777/xdmod into cloud-utilization-metrics

merging in latest xdmod9.0 changes

76af74e

removing conflict markers

60b179b

Merge branch 'xdmod9.0' into cloud-utilization-metrics

2877e88

adding utilization tests for pi, nsf directorate, field of science an…

6aede5e

…d parent science

Merge branch 'xdmod9.0' of https://github.com/ubccr/xdmod into cloud-…

ba72df6

…utilization-metrics

Merge branch 'cloud-utilization-metrics' of https://github.com/eiffel…

cd20cab

…777/xdmod into cloud-utilization-metrics

eiffel777 and others added 2 commits June 29, 2020 08:15

adding documentation

79b6ed9

Merge branch 'xdmod9.0' into cloud-utilization-metrics

53b49fd

jpwhite4 approved these changes Jun 29, 2020

View reviewed changes

eiffel777 merged commit 478505c into ubccr:xdmod9.0 Jun 29, 2020

eiffel777 mentioned this pull request Jul 1, 2020

Change table definition for vcpus field for the cloud_resource_specs table #1361

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cloud core hour utilization statistic #1242

Add cloud core hour utilization statistic #1242

eiffel777 commented Mar 3, 2020

jpwhite4 Mar 3, 2020

jpwhite4 Mar 3, 2020 •

edited

Loading

jpwhite4 Mar 3, 2020

jpwhite4 Mar 3, 2020

jpwhite4 Mar 3, 2020

jpwhite4 commented Mar 3, 2020

jpwhite4 Mar 3, 2020

jpwhite4 Mar 3, 2020

plessbd commented Mar 3, 2020

sonarcloud bot commented Mar 4, 2020

jtpalmer Mar 26, 2020

eiffel777 Mar 26, 2020

plessbd commented Mar 27, 2020 •

edited

Loading

plessbd commented Mar 27, 2020

jpwhite4 Apr 10, 2020

eiffel777 Apr 15, 2020

jpwhite4 May 4, 2020

jpwhite4 May 4, 2020

jpwhite4 May 5, 2020

jpwhite4 May 5, 2020

jpwhite4 May 4, 2020

jpwhite4 May 4, 2020

jpwhite4 May 4, 2020

plessbd May 4, 2020

jpwhite4 commented Jun 26, 2020

jpwhite4 left a comment

Add cloud core hour utilization statistic #1242

Add cloud core hour utilization statistic #1242

Conversation

eiffel777 commented Mar 3, 2020

Tests performed

Types of changes

Checklist:

Choose a reason for hiding this comment

jpwhite4 Mar 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpwhite4 commented Mar 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plessbd commented Mar 3, 2020

sonarcloud bot commented Mar 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plessbd commented Mar 27, 2020 • edited Loading

plessbd commented Mar 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpwhite4 commented Jun 26, 2020

jpwhite4 left a comment

Choose a reason for hiding this comment

jpwhite4 Mar 3, 2020 •

edited

Loading

plessbd commented Mar 27, 2020 •

edited

Loading