Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dynamic config): Refactor and make several config fields reloadable at runtime #8240

Merged
merged 112 commits into from
Jan 19, 2023

Conversation

nikurt
Copy link
Contributor

@nikurt nikurt commented Dec 19, 2022

Introduce mutable fields in ClientConfig.
Intoruce the infrastructure to reload config.json and notify Client that certain fields were updated.
Refactoring to inline dyn_config.json into config.json.

yanganto and others added 30 commits November 25, 2022 13:34
This reverts commit efb319c.
Co-authored-by: nikurt <86772482+nikurt@users.noreply.github.com>
Co-authored-by: nikurt <86772482+nikurt@users.noreply.github.com>
This is the next step in the refactoring steps for changing gas profiles
to track gas by parameter (#8033).

Here we make `ExtCostsConfig` opaque and look up parameters by
```rust
pub fn cost(&self, param: ExtCosts) -> Gas
```
instead of using a specific field inside.

There are side-effects for this in
1. parameter definition
2. JSON RPC 
3. parameter estimator

1) We no longer load the parameters through
"parameter table -> JSON -> serde_deser" steps because `ExtCostsConfig`
no longer has serde derives. Instead each `ExtCosts` maps to a
`Parameter` that allows looking up the value directly from
`ParameterTable`.
This explicit mapping also replaces the `Parameter::ext_costs()`
iterator previously used to find all parameters that are ext costs.

We used to define `wasm_read_cached_trie_node` in `53.txt` and fill
old values with serde default. Serde was removed here, so I changed it
to define the parameter in the base files. This is equivalent to the
old behavior, only it is less clear when we added the parameter.

2) JSON RPC must keep the old format. Thus, I added `ExtCostsConfigView`
and `VMConfigView` there. It is a direct copy-paste of the old structs
in the old format but without serde magic to fill in missing values.

3) The estimator generates a `ExtCostsConfig` from estimations. This is
now done through a mapping from estimated costs to `ExtCosts`.

# Testing

The exact JSON output is checked in existing tests
`test_json_unchanged`.
Follow up to #8073, #8109, and #8110 for the two remaining tests in `integration-tests/src/tests/network/routing.rs`.
Added create_test_signer that creates the signer for the given account with the seed that matches the account name.

Can be used in tests only.
* fix a bug in skip procesing

* add test

* remove comments
removed edge field from Connection to avoid redundancy with GraphSnapshot.local_edges
removed sending a duplicate RoutingTableUpdate message in response to RequestUpdateNonce
They were already rejected by wasmer2 before this change, just after preparation. So this should not be a protocol change.

Second attempt at #8029, without a protocol change this time
read_memory and write_memory have to perform the same checks as
fits_memory already does.  So rather than panicking, change the
methods to return an error.  This makes the interface panic-free
and in some situations allows fits_memory call to be skipped.

However, separate fits_memory check may still be necessary so
keep the method but document in detail why it’s needed and, to
keep interfaces consistent, change it to return a Result.

Finally, read_memory_u8 is never used so get rid of it.

While at it, add tests for WasmerMemory.  Essentially copies
of tests from Wasmer2Memory.
in the mirror code, where we're starting an indexer for the target chain, the genesis records file is usually very big since we're forking mainnet or testnet state to run a mocknet test. So starting the indexer with full genesis validation takes quite a long time and we don't really need it.
With the wrapping computation of `new_used_gas` any contract is in full
control of this value, including those that are actually less than
either operand (due to overflows.)

This is not actually a big deal by itself, if not for the fact that it
gives an opportunity for the attacker to nuke the entire network out of
service.

The in-line comments largely explain the reasoning behind the fix and
why it preserves the protocol-facing behaviour, thus not necessitating a
protocol version change. To reiterate what the comments say, basically
the only case that we want to resolve is for when the attacker picks
a value of `new_used_gas` such that it is less than `burnt_gas`. Instead
of asserting and aborting we simply set `self.promises_gas = 0`.

---

The fix in this PR has been included in 1.29.3 (72d4a4d) and 1.30.0-rc.6 (4290758).
The promise_and method processes one promise index at a time.  There’s
no benefit in having all them in memory.  Rather than copying the
indices from Vec<u8> into a new Vec<u64> simply do the little endian
conversion while going through the former vector.  This saves an
allocation.
…ts (#8195)

Bumps [certifi](https://github.com/certifi/python-certifi) from 2021.10.8 to 2022.12.7.
- [Release notes](https://github.com/certifi/python-certifi/releases)
- [Commits](certifi/python-certifi@2021.10.08...2022.12.07)

---
updated-dependencies:
- dependency-name: certifi
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Update the security policy to include hackenproof for security issue submissions.
Adds a separate page `/debug/pages/tier1_network_info` displaying information about TIER1 connections. For TIER1 connections we mostly display the same information as TIER2 connections, except that:
- TIER1 messages don't include height, last block hash, or nonce info, so these columns are omitted
- TIER1 nodes have configured public proxies; a column is added to display their addresses

Includes a couple of minor tweaks to the TIER2 `network_info` page (show PeerId of the current node at the top of the page, rename the connection table's `AccountId` column to `PeerId`).
Deprecating AnnounceAccount in favor of AccountData, when finding routing target. AnnounceAccount will be still used as a fallback. In the next release, once we confirm that AnnounceAccount is not used any more, we will remove the fallback. We will stop broadcasting AnnounceAccount only once MIN_SUPPORTED_VERSION > 57.
@nikurt
Copy link
Contributor Author

nikurt commented Jan 7, 2023

@mm-near PTAL

mm-near
mm-near previously requested changes Jan 9, 2023
nearcore/src/config.rs Show resolved Hide resolved
nearcore/src/dyn_config.rs Outdated Show resolved Hide resolved
nearcore/src/dyn_config.rs Outdated Show resolved Hide resolved
nearcore/src/dyn_config.rs Outdated Show resolved Hide resolved
nearcore/src/dyn_config.rs Show resolved Hide resolved
chain/client/src/client_actor.rs Outdated Show resolved Hide resolved
chain/client/src/client_actor.rs Outdated Show resolved Hide resolved
core/chain-configs/src/updateable_config.rs Show resolved Hide resolved
core/dyn-configs/README.md Show resolved Hide resolved
core/o11y/src/lib.rs Show resolved Hide resolved
@nikurt nikurt requested a review from mm-near January 10, 2023 12:43
core/chain-configs/src/updateable_config.rs Outdated Show resolved Hide resolved
core/chain-configs/src/updateable_config.rs Show resolved Hide resolved
core/chain-configs/src/updateable_config.rs Outdated Show resolved Hide resolved
core/dyn-configs/src/lib.rs Outdated Show resolved Hide resolved
core/dyn-configs/src/lib.rs Outdated Show resolved Hide resolved
core/dyn-configs/src/lib.rs Outdated Show resolved Hide resolved
core/dyn-configs/src/lib.rs Outdated Show resolved Hide resolved
None
}
};
let updateable_client_config =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better UX if the updateable configs are read through a separate file instead of the original config file. It has a few benefits:

  • Users won't be able to modify a config field that is not actually upgradeable. Right now, if user does that, the code just silently drops the update, which could be a source of confusion to the user. On top of that, the node can be in trouble when they restart.
  • Users can't accidentally change the config file to a bad state, so they won't be in trouble when they restart the node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the suggestion is to have config.json which is assumed to be in a good state; and to duplicate the updateable fields into a separate file, e.g. dyn_config.json.

If in the future we decide to make max_num_peers updateable at runtime, will that mean that the same field will be configurable in config.json as network.max_num_peers and in dyn_config.json as max_num_peers?

I would find that UX confusing, because reading config.json is no longer enough to understand the node's configuration. There is always a chance that some field is overridden in dyn_config.json.
On the other hand, if I want to change max_num_peers, why shouldn't I change it in both config.json and dyn_config.json?
Which means that to read the configuration I need to read two config files; and to update the configuration I need to update two config files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my main concern with let user update config.json directly is that in that case, the values in config.json may not actually reflect what the config the node is using. But I also see your point about two sources of truth.

What about we let the user update the dyn_config.json but the code will automatically update config.json to reflect the updated value after dyn_config.json is applied. This way, the user only needs to rely on config.json for the source of truth. And they don't have to worry about messing up the config.json when the node is running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it even safer, the program can also save a copy of the old config value, so the user has something to fall back to in case they make a mistake and want to revert to the previous config.

Copy link
Contributor

@yanganto yanganto Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea about this concern. Could the following rules be an easier solution?

  • Introduce a new config field allow_config_change: true (default is false) in the original config.json, and this value is not allowed changed in runtime. Just in case people do not want to change config in runtime and keep same behavior as current.
  • If the dyn_config is valid and applied. Could we save a config snapshot as config.{timesampts}.json?
  • If the snapshot can not store, neard will panic in case of any config changes without a clear snapshot.
  • (Optional) When the neard start and allow_config_change is true, it also creates config.{timesampts}.json such that the behavior will much more consistent.

The reason to use config.{timesampts}.json but not config_log.txt is

  • user will easier to reuse the config.{timesampts}.json by renaming it to config.json.
  • it is easier to use diff tools for two config files
  • keep a single place to store the log
  • once the dyn_config is applied, the snapshot is a kind of receipt, and we can just write a one-line log config.{timesampts}.json applied without verbose.

If these look good from your end, I am happy to implement this. 😄

Copy link
Contributor Author

@nikurt nikurt Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzhangmzz

the code will automatically update config.json

I find it very confusing when a program modifies its own configs.

the user has something to fall back to in case they make a mistake

Unclear how useful this is. I rely on the (persistent) undo history. Some users make and manage backups of the configs.
In general, I don't want to prescribe how the users should manage their config.json file. There are too many exceptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanganto

allow_config_change

Seems redundant when a SIGHUP is required to trigger a config reload. If the user doesn't send a signal, nothing will happen.

save a config snapshot

I believe many users would like to manage config versions themselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the idea of a single 'config.json' file.

For the config{timestamp}.json - the part I don't like, is that it pollutes the directory a little - what we could do instead, is to have the /config HTTP endpoint (or /debug/config) that would always report the current config. (this will allow people to do diffs in an easy way if needed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/config is already work in progress: https://pagodaplatform.atlassian.net/browse/ND-273

@mzhangmzz
Copy link
Contributor

mzhangmzz commented Jan 12, 2023

Please also update the description of the PR.

Ok(())
}
// TODO: Add more config validation.
// TODO: Validate `ClientConfig` instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for validating ClientConfig instead. Ideally, the validation should happen as soon as ClientConfig is loaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind if I leave this to a follow-up code change?

}
if let Some(statistics) = statistics {
rocksdb_metrics::export_stats_as_metrics(statistics);
}

if let Some(config_updater) = &config_updater {
config_updater.log_error();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about passing the error all the way from neard to client and to here just for logging. Without this, the logic for ConfigUpdater is much simpler as the struct does not contain any state. I see two other options:

  • Simply let the node panic in neard after loading the new config if the config is updated to an invalid setting. This may be a little harsh, but I think it is also reasonable. When restarting the node, it panics if the config is invalid. So I think it is ok to terminate the node if the updated config is invalid.
  • Log the error into a separate file such as config_log.txt. This way, the error message will not be lost in the other log messages, so you don't have store the message and log it every time with the info().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a tradeoff between code complexity and making it more user-friendly. Or rather, more difficult to shoot yourself in the foot and not notice it.

I lean strongly on the side of user-friendliness.
There is also a metric exported about the config being invalid.

let the node panic

That is user-anti-friendly. I can imagine updating the config during an incident, and accidentally killing a very valuable node. Don't want that.
And if a validator accidentally kills a node and misses blocks, that is not good either.

config_log.txt

That is non-trivial to use properly 🤔
I would like to use it like

killall -SIGHUP neard ; cat ~/.near/config_log.txt || echo 'Config valid'

But it may take longer than a moment to apply that config change, parse the config and validate it.
To be sure it's applied a delay is needed.

killall -SIGHUP neard ; sleep 10; cat ~/.near/config_log.txt || echo 'Config valid'

But I also cannot guarantee that 10 seconds is a sufficient delay.
And anyway, if I can remember to check config_log.txt, I can remember to check the Prometheus metric.

@nikurt nikurt requested a review from mzhangmzz January 13, 2023 17:55
chain/client/src/config_updater.rs Outdated Show resolved Hide resolved
chain/client/src/config_updater.rs Outdated Show resolved Hide resolved
chain/client/src/config_updater.rs Outdated Show resolved Hide resolved
chain/client/src/config_updater.rs Outdated Show resolved Hide resolved
chain/client/src/config_updater.rs Outdated Show resolved Hide resolved
core/dyn-configs/README.md Show resolved Hide resolved
@near-bulldozer near-bulldozer bot merged commit 748ac49 into near:master Jan 19, 2023
near-bulldozer bot pushed a commit that referenced this pull request Jan 30, 2023
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
ppca pushed a commit to ppca/nearcore that referenced this pull request Jan 30, 2023
Co-authored-by: Jakob Meier <mail@jakobmeier.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.