Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NC | Online Upgrade | Health CLI update config directory and upgrade checks #8532

Merged
merged 1 commit into from
Dec 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/NooBaaNonContainerized/CI&Tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Run `NC mocha tests` with root permissions -
#### NC mocha tests
The following is a list of `NC mocha test` files -
1. `test_nc_nsfs_cli.js` - Tests NooBaa CLI.
2. `test_nc_nsfs_health` - Tests NooBaa Health CLI.
2. `test_nc_health` - Tests NooBaa Health CLI.
3. `test_nsfs_glacier_backend.js` - Tests NooBaa Glacier Backend.
4. `test_nc_with_a_couple_of_forks.js` - Tests the `bucket_namespace_cache` when running with a couple of forks. Please notice that it uses `nc_coretest` with setup that includes a couple of forks.

Expand Down
38 changes: 36 additions & 2 deletions docs/NooBaaNonContainerized/Health.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,14 @@ For more details about NooBaa RPM installation, see - [Getting Started](./Gettin
- Iterating buckets under the config directory.
- Confirming the existence of the bucket's configuration file and its validity as a JSON file.
- Verifying that the underlying storage path of a bucket exists.
- `Config directory health`
- checks if config system and directory data exists
- returns the config directory status
- `Config directory upgrade health`
- checks if config system and directory data exists
- checks if there is ongoing upgrade
- returns error if there is no ongoing upgrade, but the config directory phase is locked
- returns message if there is no ongoing upgrade and the config directory is unlocked

* Health CLI requires root permissions.

Expand Down Expand Up @@ -148,6 +156,11 @@ The output of the Health CLI is a JSON object containing the following propertie
- Enum: 'PERSISTENT' | 'TEMPORARY'
- Description: For TEMPORARY error types, NooBaa attempts multiple retries before updating the status to reflect an error. Currently, TEMPORARY error types are only observed in checks for invalid NooBaa endpoints.

- `config_directory`
- Type: Object {"phase": "CONFIG_DIR_UNLOCKED" | "CONFIG_DIR_LOCKED","config_dir_version": String,
"upgrade_package_version": String, "upgrade_status": Object, "error": Object }.
- Description: An object that consists config directory information, config directory upgrade information etc.
- Example: { "phase": "CONFIG_DIR_UNLOCKED", "config_dir_version": "1.0.0", "upgrade_package_version": "5.18.0", "upgrade_status": { "message": "there is no in-progress upgrade" }}

## Example
```sh
Expand Down Expand Up @@ -225,6 +238,14 @@ Output:
}
],
"error_type": "PERSISTENT"
},
"config_directory": {
"phase": "CONFIG_DIR_UNLOCKED",
"config_dir_version": "1.0.0",
"upgrade_package_version": "5.18.0",
"upgrade_status": {
"message": "there is no in-progress upgrade"
}
}
}
}
Expand All @@ -243,7 +264,8 @@ Output:
- The config file of bucket1 is invalid. Therefore, NooBaa health reports INVALID_CONFIG.
- The underlying file system directory of bucket3 is missing. Therefore, NooBaa health reports STORAGE_NOT_EXIST.


- config_directory:
- the config directory phase is unlocked, config directory version is "1.0.0", matching source code/package version is "5.18.0" and there is no ongoing upgrade.


## Health Errors
Expand Down Expand Up @@ -365,4 +387,16 @@ The following error codes will be associated with a specific Bucket or Account s
- Reasons:
- Bucket missing owner account.
- Resolutions:
- Check for owner_account property in bucket config file.
- Check for owner_account property in bucket config file.

#### 8. Config Directory is invalid
- Error code: `INVALID_CONFIG_DIR`
- Error message: Config directory is invalid
- Reasons:
- System.json is missing - NooBaa was never started
- Config directory property is missing in system.json - the user didn't run config directory upgrade when upgrading from 5.17.z to 5.18.0
- Config directory upgrade error.
- Resolutions:
- Start NooBaa service
- Run `noobaa-cli upgrade`
- Check the in_progress_upgrade the exact reason for the failure.
90 changes: 82 additions & 8 deletions src/manage_nsfs/health.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ const { TYPES } = require('./manage_nsfs_constants');
const { get_boolean_or_string_value, throw_cli_error, write_stdout_response, get_bucket_owner_account_by_id } = require('./manage_nsfs_cli_utils');
const { ManageCLIResponse } = require('./manage_nsfs_cli_responses');
const ManageCLIError = require('./manage_nsfs_cli_errors').ManageCLIError;
const { CONFIG_DIR_LOCKED, CONFIG_DIR_UNLOCKED } = require('../upgrade/nc_upgrade_manager');


const HOSTNAME = 'localhost';
Expand Down Expand Up @@ -60,6 +61,10 @@ const health_errors = {
error_code: 'MISSING_ACCOUNT_OWNER',
error_message: 'Bucket account owner not found',
},
INVALID_CONFIG_DIR: {
error_code: 'INVALID_CONFIG_DIR',
error_message: 'Config directory is invalid',
},
UNKNOWN_ERROR: {
error_code: 'UNKNOWN_ERROR',
error_message: 'An unknown error occurred',
Expand Down Expand Up @@ -117,12 +122,16 @@ class NSFSHealth {
endpoint_state = await this.get_endpoint_response();
memory = await this.get_service_memory_usage();
}
// TODO: add more health status based on system.json, e.g. RPM upgrade issues
const system_data = await this.config_fs.get_system_config_file({ silent_if_missing: true });
const config_directory_status = this._get_config_dir_status(system_data);

let bucket_details;
let account_details;
const response_code = endpoint_state ? endpoint_state.response.response_code : 'NOT_RUNNING';
const service_health = service_status !== 'active' || pid === '0' || response_code !== 'RUNNING' ? 'NOTOK' : 'OK';

const error_code = await this.get_error_code(service_status, pid, response_code);
const endpoint_response_code = (endpoint_state && endpoint_state.response?.response_code) || 'UNKNOWN_ERROR';
const health_check_params = { service_status, pid, endpoint_response_code, config_directory_status };
const service_health = this._calc_health_status(health_check_params);
const error_code = this.get_error_code(health_check_params);
if (this.all_bucket_details) bucket_details = await this.get_bucket_status();
if (this.all_account_details) account_details = await this.get_account_status();
const health = {
Expand All @@ -136,6 +145,7 @@ class NSFSHealth {
endpoint_state,
error_type: health_errors_tyes.TEMPORARY,
},
config_directory_status,
accounts_status: {
invalid_accounts: account_details === undefined ? undefined : account_details.invalid_storages,
valid_accounts: account_details === undefined ? undefined : account_details.valid_storages,
Expand All @@ -161,7 +171,7 @@ class NSFSHealth {
delay_ms: config.NC_HEALTH_ENDPOINT_RETRY_DELAY,
func: async () => {
endpoint_state = await this.get_endpoint_fork_response();
if (endpoint_state.response.response_code === fork_response_code.NOT_RUNNING.response_code) {
if (endpoint_state.response?.response_code === fork_response_code.NOT_RUNNING.response_code) {
romayalon marked this conversation as resolved.
Show resolved Hide resolved
throw new Error('Noobaa endpoint is not running, all the retries failed');
}
}
Expand All @@ -173,13 +183,23 @@ class NSFSHealth {
return endpoint_state;
}

async get_error_code(nsfs_status, pid, endpoint_response_code) {
if (nsfs_status !== 'active' || pid === '0') {
/**
* get_error_code returns the error code per the failed check
* @param {{service_status: String,
* pid: string,
* endpoint_response_code: string,
* config_directory_status: Object }} health_check_params
* @returns {Object}
*/
get_error_code({ service_status, pid, endpoint_response_code, config_directory_status }) {
if (service_status !== 'active' || pid === '0') {
return health_errors.NOOBAA_SERVICE_FAILED;
} else if (endpoint_response_code === 'NOT_RUNNING') {
return health_errors.NOOBAA_ENDPOINT_FAILED;
} else if (endpoint_response_code === 'MISSING_FORKS') {
return health_errors.NOOBAA_ENDPOINT_FORK_MISSING;
} else if (config_directory_status.error) {
return health_errors.CONFIG_DIR_ERROR;
}
romayalon marked this conversation as resolved.
Show resolved Hide resolved
}

Expand Down Expand Up @@ -239,7 +259,7 @@ class NSFSHealth {
const fork_count_response = await this.make_endpoint_health_request(url_path);
if (!fork_count_response) {
return {
response_code: fork_response_code.NOT_RUNNING,
response: fork_response_code.NOT_RUNNING,
total_fork_count: total_fork_count,
running_workers: worker_ids,
};
Expand Down Expand Up @@ -421,6 +441,60 @@ class NSFSHealth {
err_obj
};
}

/**
* _get_config_dir_status returns the config directory phase, version,
* matching package_version, upgrade_status and error if occured.
* @param {Object} system_data
* @returns {Object}
*/
_get_config_dir_status(system_data) {
if (!system_data) return { error: 'system data is missing' };
const config_dir_data = system_data.config_directory;
if (!config_dir_data) return { error: 'config directory data is missing, must upgrade config directory' };
const config_dir_upgrade_status = this._get_config_dir_upgrade_status(config_dir_data);
return {
phase: config_dir_data.phase,
config_dir_version: config_dir_data.config_dir_version,
upgrade_package_version: config_dir_data.upgrade_package_version,
upgrade_status: config_dir_upgrade_status,
error: config_dir_upgrade_status.error || undefined
};
}

/**
* _get_config_dir_upgrade_status returns one of the following
* 1. the status of an ongoing upgrade, if valid it returns an object with upgrade details
* 2. if upgrade is not ongoing but config dir is locked, the error details of the upgrade's last_failure will return
* 3. if upgrade is not ongoing and config dir is unlocked, a corresponding message will return
* @param {Object} config_dir_data
* @returns {Object}
*/
_get_config_dir_upgrade_status(config_dir_data) {
if (config_dir_data.in_progress_upgrade) return { in_progress_upgrade: config_dir_data.in_progress_upgrade };
if (config_dir_data.phase === CONFIG_DIR_LOCKED) {
return { error: 'last_upgrade_failed', last_failure: config_dir_data.upgrade_history.last_failure };
}
if (config_dir_data.phase === CONFIG_DIR_UNLOCKED) {
return { message: 'there is no in-progress upgrade' };
}
}

/**
* _calc_health_status calcs the overall health status of NooBaa NC
* @param {{service_status: String,
* pid: string,
* endpoint_response_code: string,
* config_directory_status: Object }} health_check_params
* @returns {'OK' | 'NOTOK'}
*/
_calc_health_status({ service_status, pid, endpoint_response_code, config_directory_status }) {
const is_unhealthy = service_status !== 'active' ||
pid === '0' ||
endpoint_response_code !== 'RUNNING' ||
config_directory_status.error;
return is_unhealthy ? 'NOTOK' : 'OK';
romayalon marked this conversation as resolved.
Show resolved Hide resolved
}
}

async function get_health_status(argv, config_fs) {
Expand Down
2 changes: 1 addition & 1 deletion src/test/unit_tests/nc_index.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ require('./test_chunk_fs');
require('./test_namespace_fs_mpu');
require('./test_nb_native_fs');
require('./test_nc_nsfs_cli');
require('./test_nc_nsfs_health');
require('./test_nc_health');
require('./test_nsfs_access');
require('./test_nsfs_integration');
require('./test_bucketspace_fs');
Expand Down
2 changes: 1 addition & 1 deletion src/test/unit_tests/sudo_index.js
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,5 @@ require('./test_bucketspace_versioning');
require('./test_bucketspace_fs');
require('./test_nsfs_versioning');
require('./test_nc_nsfs_cli');
require('./test_nc_nsfs_health');
require('./test_nc_health');

Loading
Loading