Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

bad memory behaviour substrate 3.0.0 #8117

Closed
darkfriend77 opened this issue Feb 14, 2021 · 34 comments
Closed

bad memory behaviour substrate 3.0.0 #8117

darkfriend77 opened this issue Feb 14, 2021 · 34 comments
Labels
J2-unconfirmed Issue might be valid, but it’s not yet known.

Comments

@darkfriend77
Copy link

darkfriend77 commented Feb 14, 2021

If been installing the https://github.com/paritytech/substrate/releases/tag/v3.0.0 on two nodes and they have both a bad memory behavior, the same installation is running smoothly with substrate 2.0.0 on two other nodes, see comparison.

NEW NODE

same configuration same code, but with substrate 2.0.0
OLD NODE

seem there is something bad happening ... how can I help to find the bug? The nodes crash every hour, when mem is running full.

Installed both on 2 GB Memory / 50 GB Disk - Ubuntu 20.04 (LTS) x64

NEW NODES
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fmogiway-01.dotmog.com#/explorer
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fmogiway-02.dotmog.com#/explorer

OLD NODES
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fnode01.dotmog.com#/explorer
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fnode02.dotmog.com#/explorer

@github-actions github-actions bot added the J2-unconfirmed Issue might be valid, but it’s not yet known. label Feb 14, 2021
@bkchr
Copy link
Member

bkchr commented Feb 14, 2021

Where is your node implementation? How do you run the node? (which CLI flags)

@darkfriend77
Copy link
Author

https://github.com/dotmog/substrate/tree/dotmog_v3.0.0

I run the bin/node/

compiled with

cargo build --release -p node-cli

running it with this service ...

[Unit] Description=DOT Mogwai Validator [Service] WorkingDirectory='$REPO_DIR' ExecStart='$NODE_TEMPLATE' --base-path '$VALIDATOR_DIR' --chain='$CHAIN_SPEC' --port 30333 --ws-port 9944 --rpc-port 9933 --validator --rpc-methods=Unsafe --name "'$NODE_NAME'" --rpc-cors all Restart=always RestartSec=120 [Install] WantedBy=multi-user.target

./target/release/substrate --base-path /home/dotmog/substrate/tmp/validator-chain --chain=/home/dotmog/substrate/bin/node/scripts/chain-spec/dotmog_livenet_raw.json --port 30333 --ws-port 9944 --rpc-port 9933 --validator --rpc-methods=Unsafe --name "MOGIWAY-01" --rpc-cors all

@darkfriend77
Copy link
Author

darkfriend77 commented Feb 14, 2021

I reproduced the issue running the node clean on the tag v3.0.0 on my own machine

substrate --dev

mem usage climbs up for around 15MB/min

Not sure if this the way it should be ...

anything I can help?

time 20:03
20:03
time 20:13
20:13
time 20:23
20:23
time 20:33
20:33

@bkchr
Copy link
Member

bkchr commented Feb 14, 2021

This is windows with the Linux subsystem?

@darkfriend77
Copy link
Author

Its a windows vm setup like described here (guide), that was just a test to reproduce? The nodes who crash are running on digital ocean, ubuntu. Is there something I can do on my ubuntu nodes to gather more informations about this behaviour?

@darkfriend77
Copy link
Author

darkfriend77 commented Feb 15, 2021

forgot to stop the node on my vm ... 14 h later, pretty sure this is a bug, can somebody confirm?

This is just the clean build of tag v3.0.0, not any extrinsic or any communication done with the node during that time ...

image

@darkfriend77
Copy link
Author

darkfriend77 commented Feb 15, 2021

I created a fresh new vm on digital ocean applied following commands to install substrate ...

4 GB Memory / 80 GB Disk / LON1 - Ubuntu 20.04 (LTS) x64

curl -sSL https://repos.insights.digitalocean.com/install.sh | sudo bash
sudo apt-get update -y
sudo apt-get upgrade -y
curl https://getsubstrate.io -sSf | bash -s -- --fast
source ~/.cargo/env
rustup update nightly
rustup update stable
rustup target add wasm32-unknown-unknown --toolchain nightly
git clone https://github.com/dotmog/substrate.git
cd substrate
git checkout tags/v3.0.0
cargo build --release -p node-cli

then

./target/release/substrate --dev

... monitoring with top and digital oceans monitoring tool.

start
image

10 min later
image

image

100 min
image

image

more comming

@darkfriend77
Copy link
Author

image

stopping it now ...

image

@2075
Copy link
Contributor

2075 commented Feb 16, 2021

on my end (tested on macos) I have an initial footprint of around 90MB and after roughly one hour 1.8GB. is this due to the state kept in memory, or is there a better explanation I am not aware of? your mem chart looks more like leaking. weird, as it is kind of dynamic in the first hour or so.

@darkfriend77
Copy link
Author

darkfriend77 commented Feb 16, 2021

the first hour is setting up the machine and compiling substrate ... time from digital ocean monitoring is 1 hour shifted in the grafics, so the substrate process was started on 13:05 in the blue diagram.

@2075 thx, for confirming the issue.

@gnunicorn
Copy link
Contributor

gnunicorn commented Feb 16, 2021

@darkfriend77 you are running a single --dev-instance without any connections to other nodes? Did you run the nodes previously under the same conditions? Could you try connecting a two nodes with --alice and --bob? Just want to make this isn't just a bunch of network messages queuing up (--dev is a special scenario that deactivates a few safeguards that we have in place for disconnected nodes).

On the top, when the graph says 100% memory, how much physical memory is that? Substrate/Polkadot use about 500Mb RAM for us here...

@darkfriend77
Copy link
Author

darkfriend77 commented Feb 16, 2021

First I noticed the bad behaviour on a basic installation with two nodes which where only updated to 3.0.0 and they where working fine on 2.0.0 .. post

Then I reproduced the behaviour on my local machine under windows, only using substrate --dev post

And the latest was reproducing it with a vm on digital ocean only using substrate --dev, on a 4gb mem machine.
I tryed to show all steps here that I used for the latest test. post

@darkfriend77 you are running a single --dev-instance without any connections to other nodes? Did you run the nodes previously under the same conditions? Could you try connecting a two nodes with --alice and --bob? Just want to make this isn't just a bunch of network messages queuing up (--dev is a special scenario that deactivates a few safeguards that we have in place for disconnected nodes).

The first test is a basic 2 peer installation, so it's not --dev related.

@darkfriend77
Copy link
Author

Here an attempt to profile it.

image

image

@darkfriend77
Copy link
Author

@gnunicorn @bkchr can one of you guys confirm the issue concerning bad memory usage on the Substrate v3.0/0.9 – Apollo 14 release, please.

@bkchr
Copy link
Member

bkchr commented Feb 20, 2021

@darkfriend77 We are aware that there is something. We don't know yet what it is. However, we already know that something on Windows with Linux subsystem triggers some huge memory leak as you already detected.

@darkfriend77
Copy link
Author

@darkfriend77 We are aware that there is something. We don't know yet what it is. However, we already know that something on Windows with Linux subsystem triggers some huge memory leak as you already detected.

@bkchr Okay, thank you for the answer.

Just to clarify, I've reproduced the bug only on WSL. I observed it on the native linux nodes first runing substrate 3.0.0. Only then I started to look at it the bug.

Ubuntu 20.04 (LTS) x64 / 4 GB Mem / 80 GB Disk memory leak LINK

@bkchr
Copy link
Member

bkchr commented Feb 20, 2021

Yeah I have seen this. We are investigating it, but it is rather hard, because we can not reproduce it that easily.

@arkpar
Copy link
Member

arkpar commented Feb 21, 2021

@bkchr Profiling screenshot suggests It could be caused by a TLS leak in RegEx cache when sr_25519_vrf_sign is called.
If for some reason we spawn lots of short-lived threads (again), or there's an issue with ThreadLocal on a particular platform, it may end up caching regex data for each new thread and never destroying it.
This issue could be relevant: rust-lang/regex#362

  1. Why do we even parse a key from string on each signing?
  2. Regex instance should be compiled once and cached somewhere.
  3. We should check if tokio runtime is configured properly and that it does not create new threads all the time. My wild guess is that this might be happening on machines with only 1 or 2 CPU cores.

@bkchr
Copy link
Member

bkchr commented Feb 22, 2021

This issue really looks related. However, I also checked our nodes and the number of activate threads is around 15. So, we don't have an unlimited number of active threads and if I read this issue correctly, cleaning up the thread local should work.

@arkpar
Copy link
Member

arkpar commented Feb 22, 2021

Apparently it still does not work on Windows, if you read the last few comments on the issue.

The number of active threads at any given moment may be hovering at 15, but if some of them are shut down and restarted all the time that may be a problem. Also how many cores do our nodes have? I suspect the issue might only surface on underpowered machines with low CPU count. E.g. On a machine with 2 cores tokio thread pool only keeps around 2 threads and spawns additional short lived threads when required for blocking tasks, or when the load is high. On a machine with 8 cores it keeps 8 threads and there's rarely any need to spawn additional. That's just my guess but I wouldn't be surprised if there's some weird interaction like that going on.

@darkfriend77
Copy link
Author

I will do a test on 3 machines native Ubuntu 20.04 (LTS) x64 on Digital Ocean... same setup Linux only

curl -sSL https://repos.insights.digitalocean.com/install.sh | sudo bash
sudo apt-get update -y
sudo apt-get upgrade -y
curl https://getsubstrate.io -sSf | bash -s -- --fast
source ~/.cargo/env
rustup update nightly
rustup update stable
rustup target add wasm32-unknown-unknown --toolchain nightly
git clone https://github.com/dotmog/substrate.git
cd substrate
git checkout tags/v3.0.0
cargo build --release -p node-cli
./target/release/substrate --dev
  1. MACHINE 2 CPU 4 GB RAM native Ubuntu 20.04 (LTS) x64
  2. MACHINE 4 CPU 8 GB RAM native Ubuntu 20.04 (LTS) x64
  3. MACHINE 8 CPU 16 GB RAM native Ubuntu 20.04 (LTS) x64

Will let them run for a few hours then post the monitoring graphs and informations ... maybe this helps you to elaborate more ... on this issue

@darkfriend77
Copy link
Author

darkfriend77 commented Feb 22, 2021

start

image

... after a few blocks

image

... after ~2 hours

image

@anvie
Copy link
Contributor

anvie commented Feb 22, 2021

@darkfriend77 I'm curious with this issue, but unfortunately I can't reproduce your result, can you apply this patch, recompile and retest?, please update us with the result. The patch I made just to validate my assumption of my early investigation.

issue-8117-workaround.patch.zip

@darkfriend77
Copy link
Author

@anvie will do that.

@arkpar
Copy link
Member

arkpar commented Feb 22, 2021

The number of active threads at any given moment may be hovering at 15, but if some of them are shut down and restarted all the time that may be a problem. Also how many cores do our nodes have? I suspect the issue might only surface on underpowered machines with low CPU count. E.g. On a machine with 2 cores tokio thread pool only keeps around 2 threads and spawns additional short lived threads when required for blocking tasks, or when the load is high. On a machine with 8 cores it keeps 8 threads and there's rarely any need to spawn additional. That's just my guess but I wouldn't be surprised if there's some weird interaction like that going on.

I've checked the code to confirm this. Tokio blocking pool uses idle worker threads if available. And if not, it spawns a temporary thread. And we do execute all network handlers as blocking tasks. Having short lived temporary threads is a bad idea in general, as it leads to hard-to-diagnose issues, such as this one. I'd suggest reverting #5909 and maybe setting a fixed minimum of worker threads instead.
@bkchr @tomaka

@arkpar
Copy link
Member

arkpar commented Feb 22, 2021

@anvie making Regex static won't help with the leak I'm afraid. It still caches some data per thread in the TLS. At least it still leaks in the example given in this comment, even though Regex instance is shared.

@tomaka
Copy link
Contributor

tomaka commented Feb 22, 2021

I'd suggest reverting #5909

#5909 pins the network worker to one long-lived thread, which seems to be exactly what you want?

@tomaka
Copy link
Contributor

tomaka commented Feb 22, 2021

As a general opinion, I don't think we should avoid spawning threads because of unfixed leaks with thread-local storage. That's killing the messenger.

@darkfriend77
Copy link
Author

here what it looks like after 2 hours ... #8117 (comment)

@darkfriend77
Copy link
Author

after around 4 hours ... it seems that the machine setup doesn't make a difference, just the process will get killed when memory reachs max limit, happend on the 4gb machine.

image

@arkpar
Copy link
Member

arkpar commented Feb 22, 2021

I'd suggest reverting #5909

#5909 pins the network worker to one long-lived thread, which seems to be exactly what you want?

Right, I've misinterpreted it. The short-lived threads have nothing to do with it indeed. It looks to be an issue in thread-local recently fixed here: Amanieu/thread_local-rs#30
Which means making regexps static should fix it as well. @anvie do you want to make a PR with your patch?

@gnunicorn
Copy link
Contributor

so, if I read this correctly, there is no code-change necessary for the fix from our side, but a cargo update (make sure it updates thread_local) would do (as 1.1.3 should resolve for ^1.1.2). Could you do that, build that and confirm that fixes it for you, @darkfriend77 ?

We don't even have to release a new version, but we could tag for a new docker-image with the updated dependency, if that shows to fix the error.

@darkfriend77
Copy link
Author

darkfriend77 commented Feb 23, 2021

@gnunicorn updateing and creating the test in a few minutes
cargo update -p thread_local
thread_local v1.1.2 -> v1.1.3

start

image

after 1.5 hourse seems fixed .. ^^

image

:-)

@darkfriend77
Copy link
Author

I will close the issue ...

image

cargo update -p thread_local

liuchengxu added a commit to chainx-org/ChainX that referenced this issue Feb 24, 2021
liuchengxu added a commit to chainx-org/ChainX that referenced this issue Mar 18, 2021
* Enable clippy check

* Revert clippy check

* Remove CC1 (#446)

* Bump 2.0.8-1 (#449)

* Update Substrate to bb0fb2965f9eb59c305ce0c294962d3317a29ece

but incomplete, basically make it compile, there are still some FIXMEs and TODOs.

* Fix format

* Fix test compile

* Remove debugging print

* Workaround for a potential Substrate glitch

* Add malan runtime (#463)

* ChainX TC0

* Update btc testnet header

* Use Testnet in xpallet-system

* Add ptc0 chain

* Bump spec version to 7

* Fast governance processure

* Add malan runtime

* Add malan.json

* Rename chainx-dev-runtime to dev-runtime

* Use dns in bootnode

* Fix clippy under runtime-benchmarks feature

* Update error message for invalid chain option

* Add two more malan bootnodes

* Rebuild malan genesis config

* Update btc genesis

* Use new manlan config

* Update malan.json

* Use new malan.json

* Rebuild malan runtime

Since we currently has only one executor(`chainx`), we must use
spec_name = "chainx" in the runtime code.

* Update malan.json

* Use malan.json

* Clean up unused code

* Revert chainx.json changes

* Fix malan runtime build

* Add btc_2.0.json

* Handle multi runtime properly (#473)

* Handle multi runtime properly

* Split out service crate

* Run cargo fmt

* Reenable new_light()

* Split out service/client.rs

* Cargo fmt

* Clean up

* Update Substrate to 93b231e79f5b4e551c34234e89fa4a2e5e9c1510

* Remove tmp btc_2.0.json

* Update Substrate to 81ca765646c35c7676ec2f86e718bf1f6a5cc274

* Update Substrate to bc69520ae4abb78ca89a57ff9af8887d598ce048

* Update Substrate to 075796f75f754a712ebb417c8b17633f7b88adf1

* Update Substrate to f14488dfca012659297d2b4676fab91c179095dd

* Fix test

* Fix todo! with current impl

* Update Substrate to 3.0.0

* MIgrate assets-registrar test to construct_runtime!

* Migrate xpallet_assets test to construct_runtime!

* Migrate all tests to construct_runtime!

* Update orml-* to 0.4.0

* Remove patches in Cargo.toml

* Impl ReportLongevity properly

* Cargo fmt

* Reset malan (#479)

* Add Proxy module (#478)

* Add Proxy module

* Cargo fmt

* Bump spec version

* Update to the latest master of Substrate

Ref paritytech/substrate#8117

* Fix wrong feature gate on grandpa after splitting out service crate

* Cargo fmt

* Replace sc-finality-grandpa-wrap-sync git dep using crates.io dep
liuchengxu added a commit to chainx-org/ChainX that referenced this issue Mar 19, 2021
* Fix wrong feature gate about grandpa in service

* Update btc bridge 2.0 to Substrate 3.0.0 (#509)

* Enable clippy check

* Revert clippy check

* Remove CC1 (#446)

* Bump 2.0.8-1 (#449)

* Update Substrate to bb0fb2965f9eb59c305ce0c294962d3317a29ece

but incomplete, basically make it compile, there are still some FIXMEs and TODOs.

* Fix format

* Fix test compile

* Remove debugging print

* Workaround for a potential Substrate glitch

* Add malan runtime (#463)

* ChainX TC0

* Update btc testnet header

* Use Testnet in xpallet-system

* Add ptc0 chain

* Bump spec version to 7

* Fast governance processure

* Add malan runtime

* Add malan.json

* Rename chainx-dev-runtime to dev-runtime

* Use dns in bootnode

* Fix clippy under runtime-benchmarks feature

* Update error message for invalid chain option

* Add two more malan bootnodes

* Rebuild malan genesis config

* Update btc genesis

* Use new manlan config

* Update malan.json

* Use new malan.json

* Rebuild malan runtime

Since we currently has only one executor(`chainx`), we must use
spec_name = "chainx" in the runtime code.

* Update malan.json

* Use malan.json

* Clean up unused code

* Revert chainx.json changes

* Fix malan runtime build

* Add btc_2.0.json

* Handle multi runtime properly (#473)

* Handle multi runtime properly

* Split out service crate

* Run cargo fmt

* Reenable new_light()

* Split out service/client.rs

* Cargo fmt

* Clean up

* Update Substrate to 93b231e79f5b4e551c34234e89fa4a2e5e9c1510

* Remove tmp btc_2.0.json

* Update Substrate to 81ca765646c35c7676ec2f86e718bf1f6a5cc274

* Update Substrate to bc69520ae4abb78ca89a57ff9af8887d598ce048

* Update Substrate to 075796f75f754a712ebb417c8b17633f7b88adf1

* Update Substrate to f14488dfca012659297d2b4676fab91c179095dd

* Fix test

* Fix todo! with current impl

* Update Substrate to 3.0.0

* MIgrate assets-registrar test to construct_runtime!

* Migrate xpallet_assets test to construct_runtime!

* Migrate all tests to construct_runtime!

* Update orml-* to 0.4.0

* Remove patches in Cargo.toml

* Impl ReportLongevity properly

* Cargo fmt

* Reset malan (#479)

* Add Proxy module (#478)

* Add Proxy module

* Cargo fmt

* Bump spec version

* Update to the latest master of Substrate

Ref paritytech/substrate#8117

* Fix wrong feature gate on grandpa after splitting out service crate

* Cargo fmt

* Replace sc-finality-grandpa-wrap-sync git dep using crates.io dep

* Cargo fmt
tash-2s added a commit to tash-2s/open-emoji-battler that referenced this issue Mar 25, 2021
liuchengxu added a commit to chainx-org/ChainX that referenced this issue May 6, 2021
* Update Substrate

* Fix cargo fmt

* Add runtime common

* Rename Trait to Config

* Fix tests

* Fix runtime benchmarks

* Make it compile

* Make tests compile

* Fix format

* Fix bitcoin header test

* Enable clippy check

* Revert clippy check

* Update Substrate to bb0fb2965f9eb59c305ce0c294962d3317a29ece

but incomplete, basically make it compile, there are still some FIXMEs and TODOs.

* Fix format

* Fix test compile

* Remove debugging print

* Workaround for a potential Substrate glitch

* Fix malan runtime build

* Add btc_2.0.json

* Update Substrate to 93b231e79f5b4e551c34234e89fa4a2e5e9c1510

* Remove tmp btc_2.0.json

* Update Substrate to 81ca765646c35c7676ec2f86e718bf1f6a5cc274

* Update Substrate to bc69520ae4abb78ca89a57ff9af8887d598ce048

* Update Substrate to 075796f75f754a712ebb417c8b17633f7b88adf1

* Update Substrate to f14488dfca012659297d2b4676fab91c179095dd

* Fix test

* Fix todo! with current impl

* Update Substrate to 3.0.0

* MIgrate assets-registrar test to construct_runtime!

* Migrate xpallet_assets test to construct_runtime!

* Migrate all tests to construct_runtime!

* Update orml-* to 0.4.0

* Remove patches in Cargo.toml

* Impl ReportLongevity properly

* Cargo fmt

* Update to the latest master of Substrate

Ref paritytech/substrate#8117

* Fix wrong feature gate on grandpa after splitting out service crate

* Cargo fmt

* Replace sc-finality-grandpa-wrap-sync git dep using crates.io dep

* Link the Substrate issue in comment

* Do some cleaning

* Use sp_staking::SessionIndex and uprade wasm-builder to 4.0
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
J2-unconfirmed Issue might be valid, but it’s not yet known.
Projects
None yet
Development

No branches or pull requests

7 participants