Skip to content

Commit

Permalink
Fix PCS/TRS resource leaks and scaling issues
Browse files Browse the repository at this point in the history
CASMHMS-6299
  • Loading branch information
jwlv authored Nov 25, 2024
1 parent f8f4196 commit 5ddb830
Show file tree
Hide file tree
Showing 55 changed files with 1,967 additions and 849 deletions.
2 changes: 1 addition & 1 deletion .version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.5.0
2.6.0
22 changes: 22 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,31 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [2.6.0] - 2024-11-25

### Fixed

- Updated hms-trs-app-api vendor code (bug fixes and enhancements)
- Passed PCS's log level through to TRS to match PCS's
- Configured TRS to use connection pools for status requests to BMCs
- Renamed PCS_STATUS_HTTP_TIMEOUT to PCS_STATUS_TIMEOUT as it is not an
http timeout
- Added PCS_MAX_IDLE_CONNS and PCS_MAX_IDLE_CONNS_PER_HOST env variables
which allow overriding PCS's connection pool settings in TRS
- Added PCS_BASE_TRS_TASK_TIMEOUT env variable which allows the timeout
for power transitions and capping to be configured
- The above variables are configurable on PCS's helm chart
- At PCS start time, log all env variables that were set
- Added PodName global to facilitate easier debug of log messages
- Log start and end of large batched requests to BMCs
- Fixed many resource leaks associated with making http requests and using TRS
- Update required version of Go to 1.23 to avoid
https://github.com/golang/go/issues/59017

## [2.5.0] - 2024-10-25

### Changed
### Fixed

- Added ability to configure http status timeout and retries with
PCS_STATUS_HTTP_TIMEOUT and PCS_STATUS_HTTP_RETRIES env variables
Expand Down
6 changes: 3 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MIT License
#
# (C) Copyright [2020-2023] Hewlett Packard Enterprise Development LP
# (C) Copyright [2020-2024] Hewlett Packard Enterprise Development LP
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
Expand All @@ -23,7 +23,7 @@
# Dockerfile for building hms-power-control.

# Build base just has the packages installed we need.
FROM artifactory.algol60.net/docker.io/library/golang:1.17-alpine AS build-base
FROM artifactory.algol60.net/docker.io/library/golang:1.23-alpine AS build-base

RUN set -ex \
&& apk -U upgrade \
Expand All @@ -43,7 +43,7 @@ COPY .version $GOPATH/src/github.com/Cray-HPE/hms-power-control/.version
### Build Stage ###
FROM base AS builder

RUN set -ex && go build -v -i -tags musl -o /usr/local/bin/hms-power-control github.com/Cray-HPE/hms-power-control/cmd/hms-power-control
RUN set -ex && go build -v -tags musl -o /usr/local/bin/hms-power-control github.com/Cray-HPE/hms-power-control/cmd/hms-power-control

### Final Stage ###

Expand Down
6 changes: 3 additions & 3 deletions Dockerfile.integration.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MIT License
#
# (C) Copyright [2022-2023] Hewlett Packard Enterprise Development LP
# (C) Copyright [2022-2024] Hewlett Packard Enterprise Development LP
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
Expand All @@ -23,7 +23,7 @@
# Dockerfile for building hms-power-control.

# Build base just has the packages installed we need.
FROM artifactory.algol60.net/docker.io/library/golang:1.17-alpine AS build-base
FROM artifactory.algol60.net/docker.io/library/golang:1.23-alpine AS build-base

RUN set -ex \
&& apk -U upgrade \
Expand All @@ -43,7 +43,7 @@ COPY .version $GOPATH/src/github.com/Cray-HPE/hms-power-control/.version
### Build Stage ###
FROM base AS builder

RUN set -ex && go build -v -i -tags musl -o /usr/local/bin/hms-power-control github.com/Cray-HPE/hms-power-control/cmd/hms-power-control
RUN set -ex && go build -v -tags musl -o /usr/local/bin/hms-power-control github.com/Cray-HPE/hms-power-control/cmd/hms-power-control

### Final Stage ###

Expand Down
4 changes: 2 additions & 2 deletions Dockerfile.test.unit.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MIT License
#
# (C) Copyright [2021-2023] Hewlett Packard Enterprise Development LP
# (C) Copyright [2021-2024] Hewlett Packard Enterprise Development LP
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
Expand All @@ -22,7 +22,7 @@

# This file only exists as a means to run tests in an automated fashion.

FROM artifactory.algol60.net/docker.io/library/golang:1.17-alpine
FROM artifactory.algol60.net/docker.io/library/golang:1.23-alpine

RUN set -ex \
&& apk -U upgrade \
Expand Down
85 changes: 73 additions & 12 deletions cmd/hms-power-control/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,15 +34,15 @@ import (
"syscall"
"time"

"github.com/Cray-HPE/hms-base"
base "github.com/Cray-HPE/hms-base"
"github.com/Cray-HPE/hms-certs/pkg/hms_certs"
"github.com/Cray-HPE/hms-power-control/internal/api"
"github.com/Cray-HPE/hms-power-control/internal/credstore"
"github.com/Cray-HPE/hms-power-control/internal/domain"
"github.com/Cray-HPE/hms-power-control/internal/hsm"
"github.com/Cray-HPE/hms-power-control/internal/logger"
"github.com/Cray-HPE/hms-power-control/internal/storage"
trsapi "github.com/Cray-HPE/hms-trs-app-api/v2/pkg/trs_http_api"
trsapi "github.com/Cray-HPE/hms-trs-app-api/v3/pkg/trs_http_api"
"github.com/namsral/flag"
"github.com/sirupsen/logrus"
)
Expand Down Expand Up @@ -140,23 +140,45 @@ func main() {
//CONFIGURATION
//////////////////////////////

baseTrsTaskTimeout := 40

var envstr string

envstr = os.Getenv("PCS_BASE_TRS_TASK_TIMEOUT")
if (envstr != "") {
tps,err := strconv.Atoi(envstr)
if (err != nil) {
logger.Log.Errorf("Invalid value of PCS_BASE_TRS_TASK_TIMEOUT, defaulting to %d",
baseTrsTaskTimeout)
} else {
logger.Log.Infof("Using PCS_BASE_TRS_TASK_TIMEOUT: %v", tps)
baseTrsTaskTimeout = tps
}
}

var BaseTRSTask trsapi.HttpTask
BaseTRSTask.ServiceName = serviceName
BaseTRSTask.Timeout = 40 * time.Second
BaseTRSTask.Timeout = time.Duration(baseTrsTaskTimeout) * time.Second
BaseTRSTask.Request, _ = http.NewRequest("GET", "", nil)
BaseTRSTask.Request.Header.Set("Content-Type", "application/json")
BaseTRSTask.Request.Header.Add("HMS-Service", BaseTRSTask.ServiceName)

// TODO: We could convert BaseTRSTask to a new connection pool aware TRS
// client, or create a new ConnPoolTRSTask. That all users (status,
// power cap, and transition) could easily share a single client
// definition. Not really necessary though until/if we decide to add
// non-default connection pool support to the power cap and transition
// paths.

//INITIALIZE TRS

trsLogger := logrus.New()
trsLogger.SetFormatter(&logrus.TextFormatter{
FullTimestamp: true,
})
trsLogger.SetLevel(logrus.ErrorLevel) //It could be set to logger.Log.GetLevel()
trsLogger.SetLevel(logger.Log.GetLevel())
trsLogger.SetReportCaller(true)

var envstr string
envstr = os.Getenv("TRS_IMPLEMENTATION")

if envstr == "REMOTE" {
Expand All @@ -166,19 +188,22 @@ func main() {
workerInsec.Logger = trsLogger
TLOC_rf = workerSec
TLOC_svc = workerInsec
logger.Log.Infof("Using TRS_IMPLEMENTATION: REMOTE")
} else {
workerSec := &trsapi.TRSHTTPLocal{}
workerSec.Logger = trsLogger
workerInsec := &trsapi.TRSHTTPLocal{}
workerInsec.Logger = trsLogger
TLOC_rf = workerSec
TLOC_svc = workerInsec
logger.Log.Infof("Using TRS_IMPLEMENTATION: LOCAL")
}

//Set up TRS TLOCs and HTTP clients, all insecure to start with

envstr = os.Getenv("PCS_CA_URI")
if envstr != "" {
logger.Log.Infof("Using PCS_CA_URI: %s", envstr)
caURI = envstr
}
//These are for debugging/testing
Expand Down Expand Up @@ -259,11 +284,17 @@ func main() {
CS.Init(&credStoreGlob)
}

// Capture hostname, which is the name of the pod
podName, err := os.Hostname()
if err != nil {
podName = "unknown_pod_name"
}

//DOMAIN CONFIGURATION
var domainGlobals domain.DOMAIN_GLOBALS
domainGlobals.NewGlobals(&BaseTRSTask, &TLOC_rf, &TLOC_svc, rfClient, svcClient,
rfClientLock, &Running, &DSP, &HSM, VaultEnabled,
&CS, &DLOCK, maxNumCompleted, expireTimeMins)
&CS, &DLOCK, maxNumCompleted, expireTimeMins, podName)

//Wait for vault PKI to respond for CA bundle. Once this happens, re-do
//the globals. This goroutine will run forever checking if the CA trust
Expand Down Expand Up @@ -344,15 +375,19 @@ func main() {

dlockTimeout := 60
pwrSampleInterval := 30
statusHttpTimeout := 30
statusTimeout := 30
statusHttpRetries := 3
maxIdleConns := 4000
maxIdleConnsPerHost := 4 // 4000 / 4 = 4 open conns for each of 1000 BMCs

envstr = os.Getenv("PCS_POWER_SAMPLE_INTERVAL")
if (envstr != "") {
tps,err := strconv.Atoi(envstr)
if (err != nil) {
logger.Log.Errorf("Invalid value of PCS_POWER_SAMPLE_INTERVAL, defaulting to %d",
pwrSampleInterval)
} else {
logger.Log.Infof("Using PCS_POWER_SAMPLE_INTERVAL: %v", tps)
pwrSampleInterval = tps
}
}
Expand All @@ -363,17 +398,19 @@ func main() {
logger.Log.Errorf("Invalid value of PCS_DISTLOCK_TIMEOUT, defaulting to %d",
dlockTimeout)
} else {
logger.Log.Infof("Using PCS_DISTLOCK_TIMEOUT: %v", tps)
dlockTimeout = tps
}
}
envstr = os.Getenv("PCS_STATUS_HTTP_TIMEOUT")
envstr = os.Getenv("PCS_STATUS_TIMEOUT")
if (envstr != "") {
tps,err := strconv.Atoi(envstr)
if (err != nil) {
logger.Log.Errorf("Invalid value of PCS_STATUS_HTTP_TIMEOUT, defaulting to %d",
statusHttpTimeout)
logger.Log.Errorf("Invalid value of PCS_STATUS_TIMEOUT, defaulting to %d",
statusTimeout)
} else {
statusHttpTimeout = tps
logger.Log.Infof("Using PCS_STATUS_TIMEOUT: %v", tps)
statusTimeout = tps
}
}
envstr = os.Getenv("PCS_STATUS_HTTP_RETRIES")
Expand All @@ -383,13 +420,37 @@ func main() {
logger.Log.Errorf("Invalid value of PCS_STATUS_HTTP_RETRIES, defaulting to %d",
statusHttpRetries)
} else {
logger.Log.Infof("Using PCS_STATUS_HTTP_RETRIES: %v", tps)
statusHttpRetries = tps
}
}
envstr = os.Getenv("PCS_MAX_IDLE_CONNS")
if (envstr != "") {
tps,err := strconv.Atoi(envstr)
if (err != nil) {
logger.Log.Errorf("Invalid value of PCS_MAX_IDLE_CONNS, defaulting to %d",
maxIdleConns)
} else {
logger.Log.Infof("Using PCS_MAX_IDLE_CONNS: %v", tps)
maxIdleConns = tps
}
}
envstr = os.Getenv("PCS_MAX_IDLE_CONNS_PER_HOST")
if (envstr != "") {
tps,err := strconv.Atoi(envstr)
if (err != nil) {
logger.Log.Errorf("Invalid value of PCS_MAX_IDLE_CONNS_PER_HOST, defaulting to %d",
maxIdleConnsPerHost)
} else {
logger.Log.Infof("Using PCS_MAX_IDLE_CONNS_PER_HOST: %v", tps)
maxIdleConnsPerHost = tps
}
}

domain.PowerStatusMonitorInit(&domainGlobals,
(time.Duration(dlockTimeout)*time.Second),
logger.Log,(time.Duration(pwrSampleInterval)*time.Second), statusHttpTimeout, statusHttpRetries)
logger.Log,(time.Duration(pwrSampleInterval)*time.Second),
statusTimeout, statusHttpRetries, maxIdleConns, maxIdleConnsPerHost)

domain.StartRecordsReaper()

Expand Down
2 changes: 1 addition & 1 deletion docker-compose.developer.full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ services:
# Emulator
#
emulator-loader:
image: artifactory.algol60.net/docker.io/library/golang:1.17-alpine
image: artifactory.algol60.net/docker.io/library/golang:1.23-alpine
command: >
sh -c "apk add curl && sleep 10 &&
curl -X POST -d '{\"RedfishEndpoints\":[{
Expand Down
2 changes: 1 addition & 1 deletion docker-compose.test.ct.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ services:
# that fit this criteria. Otherwise test failures may occur.
#
emulator-loader:
image: artifactory.algol60.net/docker.io/library/golang:1.17-alpine
image: artifactory.algol60.net/docker.io/library/golang:1.23-alpine
command: >
sh -c "apk add curl && sleep 10 &&
curl -X POST -d '{\"RedfishEndpoints\":[{
Expand Down
2 changes: 1 addition & 1 deletion docker-compose.test.unit.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ services:
# Emulator
#
emulator-loader:
image: artifactory.algol60.net/docker.io/library/golang:1.17-alpine
image: artifactory.algol60.net/docker.io/library/golang:1.23-alpine
command: >
sh -c "apk add curl && sleep 10 &&
curl -X POST -d '{\"RedfishEndpoints\":[{
Expand Down
12 changes: 7 additions & 5 deletions go.mod
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
module github.com/Cray-HPE/hms-power-control

go 1.17
go 1.23

toolchain go1.23.1

//todo hms-base needs to be converted to hms-xname as soon as that package is available.

Expand All @@ -9,14 +11,14 @@ require (
github.com/Cray-HPE/hms-certs v1.3.3
github.com/Cray-HPE/hms-compcredentials v1.11.2
github.com/Cray-HPE/hms-hmetcd v1.10.3
github.com/Cray-HPE/hms-securestorage v1.12.2
github.com/Cray-HPE/hms-securestorage v1.13.0
github.com/Cray-HPE/hms-smd/v2 v2.4.0
github.com/Cray-HPE/hms-trs-app-api/v2 v2.1.1
github.com/Cray-HPE/hms-xname v1.0.0
github.com/Cray-HPE/hms-trs-app-api/v3 v3.0.1
github.com/Cray-HPE/hms-xname v1.3.0
github.com/google/uuid v1.2.0
github.com/gorilla/mux v1.8.0
github.com/namsral/flag v1.7.4-pre
github.com/sirupsen/logrus v1.8.1
github.com/sirupsen/logrus v1.9.3
github.com/stretchr/testify v1.7.2
)

Expand Down
Loading

0 comments on commit 5ddb830

Please sign in to comment.