Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating malachite metric for supporting model inference #368

Merged
merged 2 commits into from
Dec 11, 2023

Conversation

justadogistaken
Copy link
Contributor

@justadogistaken justadogistaken commented Nov 20, 2023

There are several things has been done in this pr:

  1. Adding NodeFeatureNames and ContainerFeatureNames to BorweinOptions. Just want @csfldf to have a look here, I just don't know what you mean by "fill it with adaption table" within NewBorweinConfiguration.
  2. Adding some rate metrics like cpu.nr.throttled.container, cpu.nr.period.container, cpu.throttled.time.container etc... There metrics are meaningless as the raw data. And they are put into model training in the form of rate.
  3. Separating cgroup into two files(cgroupv1, cgroupv2). And having them updated corresponding to response from malachite-core.
  4. Having some updates on borwein inference response, which dues to that we may have more model inference demands in the future (ClassificationUnderload is the new demand and it has been added already).

@justadogistaken justadogistaken added the workflow/merge-hold merge-hold: code is ready but still has dependency label Nov 20, 2023
@justadogistaken justadogistaken self-assigned this Nov 20, 2023
@justadogistaken justadogistaken force-pushed the optimize_metrics branch 2 times, most recently from caf470c to 18c7b8c Compare November 27, 2023 04:03
Copy link

codecov bot commented Nov 27, 2023

Codecov Report

Attention: 225 lines in your changes are missing coverage. Please review.

Comparison is base (e8c5fa6) 53.59% compared to head (66dc554) 53.57%.
Report is 5 commits behind head on main.

Files Patch % Lines
...ce/models/borwein/inferencesvc/inference_svc.pb.go 48.16% 95 Missing and 18 partials ⚠️
...server/agent/metric/malachite/fetcher_calculate.go 65.68% 23 Missing and 12 partials ⚠️
pkg/metaserver/agent/metric/malachite/fetcher.go 67.50% 20 Missing and 6 partials ⚠️
...tions/sysadvisor/qosaware/model/borwein/borwein.go 38.46% 14 Missing and 2 partials ⚠️
...in/inference/modelresultfetcher/borwein/borwein.go 57.57% 9 Missing and 5 partials ⚠️
pkg/util/general/file.go 0.00% 10 Missing ⚠️
...aware/resource/helper/modelctrl/borwein/borwein.go 74.19% 4 Missing and 4 partials ⚠️
pkg/agent/sysadvisor/plugin/inference/inference.go 0.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #368      +/-   ##
==========================================
- Coverage   53.59%   53.57%   -0.03%     
==========================================
  Files         444      445       +1     
  Lines       49021    49349     +328     
==========================================
+ Hits        26273    26437     +164     
- Misses      19791    19916     +125     
- Partials     2957     2996      +39     
Flag Coverage Δ
unittest 53.57% <56.14%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@justadogistaken justadogistaken changed the title [WIP]Updating malachite metric for supporting model inference Updating malachite metric for supporting model inference Nov 27, 2023
@justadogistaken justadogistaken added workflow/merge-ready merge-ready: code is ready and can be merged and removed workflow/merge-hold merge-hold: code is ready but still has dependency labels Nov 27, 2023
@waynepeking348
Copy link
Collaborator

@luomingmeng this pr changes too much for fundamental metric-info in meta-server, should it be tested carefully before it can be merged?

@justadogistaken justadogistaken force-pushed the optimize_metrics branch 7 times, most recently from 3597266 to 620cfa3 Compare November 30, 2023 08:43
waynepeking348
waynepeking348 previously approved these changes Nov 30, 2023
@justadogistaken justadogistaken force-pushed the optimize_metrics branch 3 times, most recently from 7e8d55b to 642f53f Compare December 5, 2023 11:46
containerInfo.PodNamespace, containerInfo.PodName, containerInfo.ContainerName)
klog.Warningf("getContainerFeatureValue for pod: %s/%s, container: %s failed, err: %v",
containerInfo.PodNamespace, containerInfo.PodName, containerInfo.ContainerName, err)
goto IgnoreThisContainer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is someone against using "goto"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already got some goto, don't mind

c.InferenceServiceSocketAbsPath = o.InferenceServiceSocketAbsPath
c.NodeFeatureNames = o.NodeFeatureNames
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewBorweinConfiguration has some todo comments, pls delete them

@@ -31,4 +31,5 @@ type BorweinParameter struct {

// the first key is the pod UID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add prefix for this comments

if respContainersCnt != len(requestContainers) {
return nil, fmt.Errorf("count of resp containers: %d and request containers: %d are not same",
klog.Warningf("count of resp containers: %d and request containers: %d are not same",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If respContainersCnt != len(requestContainers), the possibility of abnormal state may increase since denominator may decrease; so we still return error here.

And if some containers lack essential features, we also inference it forcibly.

}

// AddFlags adds flags to the specified FlagSet.
func (o *BorweinOptions) AddFlags(fs *pflag.FlagSet) {
fs.StringVar(&o.InferenceServiceSocketAbsPath, "borwein-inference-svc-socket-path", o.InferenceServiceSocketAbsPath,
"socket path which borwein inference server listens at")
fs.StringSliceVar(&o.NodeFeatureNames, "borwein-node-feature-names", o.NodeFeatureNames,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after discussion, we use config file to configure NodeFeatureNames and ContainerFeatureNames.

if err != nil {
return nil, fmt.Errorf("getContainerFeatureValue for pod: %s/%s, container: %s failed",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after discussion,

  1. getContainerFeatureValue should be refined to return nil err and zero value when input feature isn't a key feature
  2. if err != nil, we just return err here to avoid meaningless request to inference sever

@justadogistaken justadogistaken force-pushed the optimize_metrics branch 2 times, most recently from 85c4ec2 to bc38275 Compare December 7, 2023 05:52
@waynepeking348 waynepeking348 merged commit 6237b55 into kubewharf:main Dec 11, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow/merge-ready merge-ready: code is ready and can be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants