Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cloud core hour utilization statistic #1242

Merged
merged 32 commits into from
Jun 29, 2020

Conversation

eiffel777
Copy link
Contributor

This PR adds a aggregation of resource specification for a resource a group by for Cloud Core Hours utilization and changes how the date range for a compute nodes memory and cpu specifications are determined.

An aggregate table is created in modw_aggregates that lists the the aggregation time period, resource id, and core time available in seconds for that time period. Cloud aggregation selects the core time available from the appropriate aggregation table and uses that in the statistic calculation.

Determining the date range for the amount of memory and cpus for a compute node is now determined using 2 sql statements instead of the StateReconstructorTransformIngestor. This new way makes it possible to only need the resource specifications on days where the specifications change instead of every day and results in less code. It works by using a self join on any rows with a fact_date in the future that have a different amount of memory or cpus. The minimum date of the joined rows is end date for that specification.

Regression tests have been added for the group by and the component test for the StateReconstructorTransformIngestor has been remove since the StateReconstructorTransformIngestor has been removed as well.

Tests performed

Tested in docker and on metrics-dev. The metrics-dev link is https://metrics-dev.ccr.buffalo.edu:9008

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project as found in the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@eiffel777 eiffel777 added Category:Cloud Cloud Realm new feature New functionality labels Mar 3, 2020
@eiffel777 eiffel777 added this to the 9.0.0 milestone Mar 3, 2020
@eiffel777 eiffel777 self-assigned this Mar 3, 2020
public function __construct($query_instance = null)
{

$sql = 'COALESCE((SUM(jf.core_time) / SUM(DISTINCT jf.core_time_available)) * 100, 0)';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the correct way to calculate utilization. This does not handle the case where there are multiple cloud resources in the database.

Copy link
Member

@jpwhite4 jpwhite4 Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The utilization statistic can be calculated in much the same way as the one in the Jobs realm (except, of course, joining against a table that contains the correct resource availability data). You need a correlated subquery to pull in the correct core_time_available for each resource in the tables. The aggregate query and timeseries queries are similar but not identical.

An example of the sql for a timeseries query is below:

SELECT 
    jf.day_id,
    jf.host_resource_id,
    SUM(jf.core_time) / (SELECT 
            SUM(rf.core_time_available)
        FROM
            modw_aggregates.resourcespecsfact_by_day rf
        WHERE
            rf.day_id = jf.day_id
                AND FIND_IN_SET(rf.resource_id,
                    GROUP_CONCAT(DISTINCT jf.host_resource_id)) <> 0) * 100.0 AS utilization,
    SUM(jf.core_time) AS core_time,
    (SELECT 
            SUM(rf.core_time_available)
        FROM
            modw_aggregates.resourcespecsfact_by_day rf
        WHERE
            rf.day_id = jf.day_id
                AND FIND_IN_SET(rf.resource_id,
                    GROUP_CONCAT(DISTINCT jf.host_resource_id)) <> 0) AS total
FROM
    modw_cloud.cloudfact_by_day jf
WHERE
    jf.day_id BETWEEN 202000001 AND 202000006
GROUP BY jf.day_id , jf.host_resource_id;

The important bit is the join on the days table to pull in the data for the associated day and the FIND_IN_SET() to get the rows for the corresponding resource.

The aggregate query is similar to what you have already, but you have to use a where condition to limit to the appropriate resource:

SELECT 
    SUM(jf.core_time) / (SELECT 
            SUM(rf.core_time_available)
        FROM
            modw_aggregates.resourcespecsfact_by_day rf
        WHERE
            rf.day_id BETWEEN 202000001 AND 202000006
                AND FIND_IN_SET(rf.resource_id,
                    GROUP_CONCAT(DISTINCT jf.host_resource_id)) <> 0) * 100.0 AS utilization,
    SUM(jf.core_time) AS core_time,
    (SELECT 
            SUM(rf.core_time_available)
        FROM
            modw_aggregates.resourcespecsfact_by_day rf
        WHERE
            rf.day_id = jf.day_id
                AND FIND_IN_SET(rf.resource_id,
                    GROUP_CONCAT(DISTINCT jf.host_resource_id)) <> 0) AS total
FROM
    modw_cloud.cloudfact_by_day jf
WHERE
    jf.day_id BETWEEN 202000001 AND 202000006;

There is no need to add a new column to the cloudfact_ tables.

@@ -45,7 +45,8 @@
"wallduration": "COALESCE(SUM(${wallduration_case_statement}), 0)",
"submission_venue_id": "sr.submission_venue_id",
"domain_id": "sr.domain_id",
"service_provider": "sr.service_provider"
"service_provider": "sr.service_provider",
"core_time_available": "COALESCE(rsa.core_time_available, 0)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that there is a need to add this column to this table. Typically the values stored in the rows correspond to the contribution to the overall value for the given row, whereas the value of core_time_available is the total amount available for the resource for that row.

},
{
"name": "vcpus",
"type": "int(5)",
"nullable": false,
"default": null
"default": null,
"comments": "Number of vcpus available on the associated node."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should explain what a vcpu is here too.


"indexes": [
{
"name": "index_resource",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you do any performance testing to see which (if any) of these indexes are used? I suppose this will change when the utilization calculation is fixed. I suspect that you may want a combined index on (resource_id, period_id) or (period_id, resource_id) for the best query performance.

@jpwhite4
Copy link
Member

jpwhite4 commented Mar 3, 2020

A state reconstructor design also only needs to see the resource data when it changes and not every day. Why would the state reconstructor need to see data every day even if it didn't change?

@@ -72,6 +72,7 @@ then
then
sudo -u xdmod xdmod-shredder -r openstack -d $REF_DIR/openstack -f openstack
sudo -u xdmod xdmod-shredder -r nutsetters -d $REF_DIR/nutsetters -f openstack
sudo -u xdmod php /data/xdmod/tools/etl/etl_overseer.php -p ingest-cloud-resource-specs -d "CLOUD_RESOURCE_SPECS_DIRECTORY=$REF_DIR/openstack_resource_specs" -r openstack -v debug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this is intentionally not integrated into the xdmod-shredder? You'll need to add documentation to explain how and when to use the command and include examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what happens if the user forgets to run this overseer command?

@plessbd
Copy link
Contributor

plessbd commented Mar 3, 2020

The change of the StateReconstructorTransformIngestor to sql instead of php seems like it should be in a separate pull request.

tests/ci/bootstrap.sh Outdated Show resolved Hide resolved
@sonarcloud
Copy link

sonarcloud bot commented Mar 4, 2020

SonarCloud Quality Gate failed.

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

No Coverage information No Coverage information
6.9% 6.9% Duplication

@@ -44,6 +44,9 @@
"cloud_common/asset.json",
"cloud_common/instance_data.json",
"cloud_common/event_asset.json",
"cloud_common/raw_resource_specs.json",
"cloud_common/cloud_resource_specs.json",
"common/hpcdb/resourcespecsfact_by_.json",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is file in common/hpcdb? Is this table intended to be in mod_hpcdb? It's using the endpoint for modw_cloud so that isn't happening.

I don't think it should be in mod_hpcdb. That schema is intended to normalized (non-data warehouse) data. If this is cloud specific it should probably be in modw_cloud and if it's for other realms as well then it should be in modw_aggregates.

Looking more closely I don't think you want this in a ManageTables at all since that is creating a table named resourcespecsfact_by_:

MariaDB [modw_cloud]> DESC resourcespecsfact_by_;
+------------------------+----------------------+------+-----+---------+-------+
| Field                  | Type                 | Null | Key | Default | Extra |
+------------------------+----------------------+------+-----+---------+-------+
| ${aggregation_unit}_id | int(10) unsigned     | NO   | MUL | NULL    |       |
| year                   | smallint(5) unsigned | NO   |     | NULL    |       |
| ${aggregation_unit}    | smallint(5) unsigned | NO   | MUL | NULL    |       |
| resource_id            | int(11)              | NO   | MUL | NULL    |       |
| core_time_available    | bigint(42)           | NO   |     | NULL    |       |
+------------------------+----------------------+------+-----+---------+-------+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. Yeah, you're right. This shouldn't be in common/hpcdb and it shouldn't be in mod_hpcdb. I'll change it so the table goes into mod_aggregates.

@plessbd
Copy link
Contributor

plessbd commented Mar 27, 2020

I think we should rename this from resourcespecs to something else. It is a bit confusing since we already have a resourcespecs in modw that does something a little bit different

@plessbd
Copy link
Contributor

plessbd commented Mar 27, 2020

we actually might want to consider how this could either utilize or at least be similar to https://github.com/ubccr/xdmod-hardware since that does node level tracking over time as well...

@jtpalmer jtpalmer dismissed their stale review March 31, 2020 11:07

Requested changes were made.

docs/cloud.md Outdated
@@ -11,6 +11,8 @@ The Cloud realm in XDMoD tracks events that occur in cloud infrastructure system
- The average amount of root volume disk space (in bytes) reserved by running sessions, weighted by wall hours.
- Average Wall Hours per Session
- The average wall time that a session was running, in hours.
- Core Hour Utilization: %
- A percentage that shows how many core hours were reserved over a time period against how many core hours a resource had available during that time period.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a bit more explanation as to what is meant by 'reserved'. This has multiple different meanings so it behoves us to be very precise with the definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpwhite4 I updated the description to remove reserved and hopefully make it a little clearer what the statistic is measuring.

"year": "${:YEAR_VALUE}",
"${AGGREGATION_UNIT}": "${:PERIOD_VALUE}",
"resource_id": "crs.resource_id",
"core_time_available": "SUM(((IF(crs.end_day_id <= ${:PERIOD_END_DAY_ID}, crs.end_day_id, ${:PERIOD_END_DAY_ID}) - IF(crs.start_day_id >= ${:PERIOD_START_DAY_ID}, crs.start_day_id, ${:PERIOD_START_DAY_ID})) + 1) * crs.vcpus * 24 * 3600)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you provide a bit of explanation about why this IF statement is needed. I was not expecting it to be here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular when this code is computing the year table how does this IF statement compute the core_time_available if the resource specifications change during the year?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think I understand how it works now. You cannot multiply by 24 since not all days have 24 hours in them. You need to use the seconds value from the time table to work out how many seconds there are in a given day.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT * FROM modw.days where seconds != 86400

basically the days when daylight savings begin and when they end (for timezones that have daylight savings).

},
"aggregation_period_query": {
"overseer_restrictions": {
"include_only_resource_codes": "resource_id IN ${VALUE}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is also necessary to add the time range restriction information here too:

     "overseer_restrictions": {
            "last_modified_start_date": "last_modified >= ${VALUE}",
            "last_modified_end_date": "last_modified <= ${VALUE}",

@@ -44,6 +44,16 @@
"name": "end_date",
"type": "date",
"nullable": true
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should put a last_modified column here too so that we only have to reaggregate time periods that need reaggregation.

"start_date": "IF(r1.memory_mb = -1 AND r1.vcpus = -1, r2.fact_date, r1.fact_date)",
"end_date": "CASE WHEN MIN(r2.fact_date) IS NOT NULL AND (r1.memory_mb != -1 AND r1.vcpus != -1) THEN MIN(r2.fact_date) - INTERVAL 1 DAY WHEN r1.memory_mb = -1 AND r1.vcpus = -1 AND (SELECT MAX(fact_date) FROM modw_cloud.raw_resource_specs) != MAX(r2.fact_date) THEN MAX(r1.fact_date) - INTERVAL 1 DAY ELSE CURDATE() END",
"start_day_id": "IF(r1.memory_mb = -1 AND r1.vcpus = -1, YEAR(r2.fact_date) * 100000 + DAYOFYEAR(r2.fact_date), YEAR(r1.fact_date) * 100000 + DAYOFYEAR(r1.fact_date))",
"end_day_id": "CASE WHEN MIN(r2.fact_date) IS NOT NULL AND (r1.memory_mb != -1 AND r1.vcpus != -1) THEN YEAR(MIN(r2.fact_date) - INTERVAL 1 DAY) * 100000 + DAYOFYEAR(MIN(r2.fact_date) - INTERVAL 1 DAY) WHEN r1.memory_mb = -1 AND r1.vcpus = -1 AND (SELECT MAX(fact_date) FROM modw_cloud.raw_resource_specs) != MAX(r2.fact_date) THEN YEAR(MAX(r1.fact_date) - INTERVAL 1 DAY) * 100000 + DAYOFYEAR(MAX(r1.fact_date) - INTERVAL 1 DAY) ELSE YEAR(CURDATE()) * 100000 + DAYOFYEAR(CURDATE()) END"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact_date column is defined as not null in the definition below. Therefore this "is not null" condition is unecessary. Please double check this function and correct if necessary.

-- This sql statement inserts -1 values for the memory_mb and vcpus for a day that a compute node has been
-- removed from the most recently ingested resource specifications file. The -1 helps when setting start and
-- end times of a cpu and memory configuration for a compute node.
INSERT INTO modw_cloud.raw_resource_specs (hostname, resource_id, memory_mb, vcpus, fact_date) SELECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer tabs (https://www.reddit.com/r/javascript/comments/c8drjo/nobody_talks_about_the_real_reason_to_use_tabs/), so feel free to completely ignore this.

But for all our other source indentation is done with 4 spaces.

@jpwhite4
Copy link
Member

There are no documentation updates in this change but the bootstrap script was edited to add a new call to the shredder.

Will the new metric ''just work'' or are there additional steps that need to be done to ingest/aggregate existing data? Are there any other changes that need to be made to the scripts that run on metrics-dev and metrics for this to work for us? Same question about xdmod-dev and xdmod.

Copy link
Member

@jpwhite4 jpwhite4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to update the script that runs on metrics-dev so that it pull in the cloud utilization metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category:Cloud Cloud Realm new feature New functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants