Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

enable lto, panic = "abort" in release build #10356

Closed
wants to merge 5 commits into from

Conversation

ordian
Copy link
Collaborator

@ordian ordian commented Feb 14, 2019

  • shaves 11 MB off binary size (from 44MB to 33MB for RUSTFLAGS=" -C link-arg=-s" cargo build --release, -8MB thanks to LTO)
  • run time should be faster, but I haven't measured

Last time we disabled panic = "abort" in #8983 (cc @sorpaas) due to failing build.

@ordian ordian added A0-pleasereview 🤓 Pull request needs code review. A2-insubstantial 👶 Pull request requires no code review (e.g., a sub-repository hash update). F7-footprint 🐾 An enhancement to provide a smaller (system load, memory, network or disk) footprint. labels Feb 14, 2019
@ordian ordian added this to the 2.4 milestone Feb 14, 2019
@ordian ordian changed the title enable lto, panic = "abort" enable lto, panic = "abort" in release build Feb 14, 2019
@5chdn 5chdn added B1-patch-beta 🕷🕷 B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. labels Feb 14, 2019
Copy link
Collaborator

@niklasad1 niklasad1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome but it increases the build-time significantly (roughly 2x) for me but definitely worth the effort, however I'm not really aware of the exact differences between panic=unwind and panic=abort I'll leave it to the pro's

# this branch
➜  parity-ethereum git:(optimize-release-build) ✗ size --format=SysV ./target/release/parity
size --format=SysV ./target/release/parity
./target/release/parity  :
section                  size       addr
.interp                    28        736
.note.ABI-tag              32        764
.note.gnu.build-id         36        796
.gnu.hash                1004        832
.dynsym                  9744       1840
.dynstr                  9346      11584
.gnu.version              812      20930
.gnu.version_r            768      21744
.rela.dyn             1237392      22512
.rela.plt                4584    1259904
.init                      23    1265664
.plt                     3072    1265696
.plt.got                  504    1268768
.text                25521701    1269312
.fini                       9   26791016
.rodata               4913113   26791936
.eh_frame_hdr          168428   31705052
.eh_frame             1142048   31873480
.gcc_except_table       66692   33015528
.tdata                    818   33089920
.tbss                     680   33090752
.init_array              1064   33090752
.fini_array                 8   33091816
.data.rel.ro           980640   33091840
.dynamic                  624   34072480
.got                     5616   34073104
.data                   11408   34078720
.bss                    31744   34090144
.comment                   56          0
Total                34111994
# master
parity-ethereum git:(master) ✗ size --format=SysV ./target/release/parity
./target/release/parity  :
section                  size       addr
.interp                    28        736
.note.ABI-tag              32        764
.note.gnu.build-id         36        796
.gnu.hash                1004        832
.dynsym                  9960       1840
.dynstr                  9537      11800
.gnu.version              830      21338
.gnu.version_r            800      22168
.rela.dyn             1708824      22968
.rela.plt                4608    1731792
.init                      23    1736704
.plt                     3088    1736736
.plt.got                  504    1739824
.text                28453253    1740352
.fini                       9   30193608
.rodata               8373294   30195712
.eh_frame_hdr          525348   38569008
.eh_frame             3148480   39094360
.gcc_except_table     1451248   42242840
.tdata                    752   43701120
.tbss                     672   43701888
.init_array              1064   43701888
.fini_array                 8   43702952
.data.rel.ro          1221272   43702976
.dynamic                  624   44924248
.got                    49208   44924872
.data                   11466   44974080
.bss                    31760   44985568
.comment                   56          0
Total                45007788

@niklasad1 niklasad1 requested a review from arkpar February 14, 2019 15:57
@ordian
Copy link
Collaborator Author

ordian commented Feb 14, 2019

This is awesome but it increases the build-time significantly (roughly 2x) for me

Don't we build --release only on releasable branches (stable and beta)?
Also, IIUC, profile.release shouldn't affect cargo test --release (should be covered by profile.bench).
EDIT: I might be wrong

The release profile, used for cargo build --release (and the dependencies
for cargo test --release, including the local library or binary)

Copy link
Collaborator

@sorpaas sorpaas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time when we disabled abort, besides building fails, one of the other reason is that we got bug reports that some state trie items are missing, or blocks are missing, which triggers the client to panic. One of the probable reason is that the client doesn't shutdown correctly (through another panic, for example), and left in the middle of an import. In that case, we can be in a situation where we commit some metadata, but not actually committed the block itself, or we may in the middle of committing the state trie. Using rewind would potentially allows the shutdown to be happen more gracefully in case a panic happened (while the panicking thread would fail, other threads can receive the shutdown signal and do stuff if needed), but I also haven't really done deep analysis of how much benefits we would gain with rewind.

We also discussed enabling lto last time. The reason why we didn't at that time was that devops worried that lto would make release build machine fail, because it can take a large amount of memory. But if devops confirm that this is fine now (if we have upgraded our machines!), then this shouldn't be an issue any more.

@ordian
Copy link
Collaborator Author

ordian commented Feb 14, 2019

Regarding LTO:

./test.sh on CI on this branch:

7326.09user 2098.40system 14:40.37elapsed 1070%CPU 5.8 GB mem

./test.sh on CI for other PRs:

4002.26user 536.6.9system 11:39.47elapsed 648%CPU 4.3 GB mem
4006.64user 1955.58system 15:45.66elapsed 630%CPU 5.4 GB mem link
4752.03user 447.98system 18:36.40elapsed 465%CPU link

Looks like the overall time for building + running tests hasn't changed much. Max RSS does not seem to be too large.
Alternatively, we can backport this PR to beta and stable without merging into master.

Now on to panic = "abort":

@arkpar argues here that:

Database has to be resistant to power loss anyway.

But now, in addition to KeyValueDB, we have blooms-db, which is not power loss resistant. So that's a valid concern, I guess.

EDIT: OTOH, we're using parking_lot, which is not panic safe, so panic = "abort" could potentially make unclean shutdown situation better as well.

@arkpar
Copy link
Collaborator

arkpar commented Feb 14, 2019

When I last checked (around 2017) panic="abort" did have a substantial effect of performance. Block import was around 5-10% faster.
The fact that there are three databases now makes atomic updates much harder. It would have been much better to use column families instead, as these can be written with a single atomic transaction.
Even with panic="unwind"the three databases my go out of sync because of the power loss or process termination. There needs to be some additional code that would detect this by e.g. comparing best block number or hash stored in all of the databases. In case of mismatch it would either have to rewind the main database to an older block, or rebuild the missing traces.

Anyway, panic and abort in the middle of an import won't lead to main database corruption because transaction atomicity is guaranteed by rocksdb, and as far as I know all of the block import is done in a single transaction.

@ordian
Copy link
Collaborator Author

ordian commented Feb 15, 2019

Ok, it seems like we all agree on the LTO change, if someone from devops can confirm it won't kill our machines (@gabreal ?).

Anyway, panic and abort in the middle of an import won't lead to main database corruption because transaction atomicity is guaranteed by rocksdb, and as far as I know all of the block import is done in a single transaction.

This matches what @debris said here:

Existence of blooms is not consensus critical. If one of the writes fail (or partially fail), the particular block will be reimported on the next launch of parity and the database will overwrite corrupted data

If we agree on panic = "abort", should #8999 be reverted?
If we don't, I can revert that change, or, alternatively, port it only to beta, to see if it would lead to more bug reports. Any thoughts?

@gabreal
Copy link
Contributor

gabreal commented Feb 15, 2019

As far as I can see, there should be no problem of memory anymore, but be aware that the test job run time increases approx about 7min (for this job there is only one sample). @TriplEight might also like to know.

@5chdn
Copy link
Contributor

5chdn commented Feb 15, 2019

Don't we build --release only on releasable branches

tests are run in release mode afaik.

@TriplEight
Copy link
Collaborator

TriplEight commented Feb 18, 2019

Thanks for mentioning @gabreal .

tests are run in release mode afaik.

Yes and here. And would be really good if we could avoid it to make this features work only for the release builds.

Currently merging this will slow down our already slow test pipelines.

Please read about LTO here, really a good read, and there are some benchmarks of different features.

Compilers build and optimize compilation units individually. LTO is a mechanism for taking those 
individually-optimized object files and finding more opportunities for optimization when they are 
considered as a group. This is great if you want to have the fastest possible program, but the price you 
pay are higher compilation times. In Rust, the compilation unit is the crate; if the heavy computations 
are all done in your crate, then maybe LTO isn’t necessary.

Regarding panic handling, panic = abort just does not include unwind libraries to the release, that's how it becomes lighter.

By removing support for unwinding, you'll get smaller binaries. You will lose the ability to catch panics. 

That way we are refusing bug stack traces from users.

@ordian
Copy link
Collaborator Author

ordian commented Feb 18, 2019

Currently merging this will slow down our already slow test pipelines.

That's why I suggested to only backport it to stable and beta, w/o merging into master.

That way we are refusing bug stack traces from users.

catch panics != stack traces, we'll see stack traces

@TriplEight
Copy link
Collaborator

That way it makes sense @ordian .

@TriplEight
Copy link
Collaborator

TriplEight commented Feb 18, 2019

Just had a talk with Pierre. He suggested that we should set the

[profile.test]
opt-level=0 # or 1
panic=unwind
lto=false

that way we can get rid of --release for all the tests, in general it should make tests faster.

@ordian
Copy link
Collaborator Author

ordian commented Feb 18, 2019

in general it should make tests faster.

It would make tests faster to compile but slower to run, I think the reason we're using cargo test --release in the first place is because the overall (compile + run) time is faster than the one w/o --release, but it would be nice to see if it can be reduced with opt-level=1.

@TriplEight
Copy link
Collaborator

I'll try this opt-level=1 locally, will report

@gabreal
Copy link
Contributor

gabreal commented Feb 18, 2019

I think the reason we're using cargo test --release in the first place is because the overall (compile + run) time is faster than the one w/o --release,

In fact I changed that a while ago to have the tests perform as close as possible as they are used in the wild.

@ordian ordian force-pushed the optimize-release-build branch from 94e5824 to 3866c81 Compare February 20, 2019 10:05
@ordian ordian force-pushed the optimize-release-build branch from 3866c81 to e91ff1f Compare February 20, 2019 10:19
@ordian
Copy link
Collaborator Author

ordian commented Feb 20, 2019

I've split the change into two commits, the second commit also removes the std::process::abort() call from the panic hook.

My suggestion is to backport the first commit to stable and both to beta.

@TriplEight
Copy link
Collaborator

@5chdn 5chdn removed this from the 2.4 milestone Feb 21, 2019
@5chdn 5chdn modified the milestones: 2.5, 2.6 Feb 21, 2019
5chdn and others added 3 commits March 26, 2019 23:31
* fix broken sync

* correct seal fields

* ethcore: fix comment

* parity: remove duplicate params

* clique: fix whitespaces

* ethcore: fix goerli chain spec

* refactor signer_snapshot into pending/finalized state

* move close_block_extra_data after seal is applied

* refactor most of the logic into the signer_snapshot

* clique: refactor locking logic out of the consensus engine interface

* Fix jsonspec and add an unittest

* Replace space with tabs

* Unbroke sync

* Fix broken sync

* 1/2 state tracking without votes

* 2/2 implement vote tracking

* ci: use travis for goerli

* ci: setup a clique network

* ci: sync a görli node

* add clique deploy script

* ci: fix paths in clique deploy script

* ci: use docker compose

* ci: fix travis job names

* ci: fix build deps

* ci: massively reduce tests

* Revert "ci: massively reduce tests"

This reverts commit 6369f0b.

* ci: run cargo test directly

* ci: separate build and test stages

* ci: cache rust installation

* ci: simplify ci stages

* ci: make clique deploy script executable

* ci: shutdown goerli sync after 20min

* ci: remove slow sync stage

* ci: use timeout to finish jobs

* ci: fix build path

* ci: use absolute paths to end this confusion

* ci: add geth and parity to path

* ci: be more verbose

* ci: allow for more relaxed caching timeout

* ci: update repositories for custom ppa

* ci: fix typo in file name

* ci: fix docker compose file

* ci: add ethkey to docker

* ci: make sure deploy script is up to date with upstream

* ci: stop docker container after certain time

* ci: force superuser to update permissions on docker files

* ci: reduce run time of script to ~30 min

* ci: remove duplicate caching in travis

* remove trace statements

* clique: add more validation involving the recent signer list

* ethcore: enable constantinople for rinkeby

* ethcore: fix whitespaces in rinkeby spec

* ethcore: reformat goerli.json

* Revert "ci: remove duplicate caching in travis"

This reverts commit a562838.

* tmp commit

* another tmp commit

* it builds!

* add sealing capabilities

* add seal_header hook to allow separation of block seal/importing code paths

* clique: remove populate_from_parent.

* add panic

* make turn delay random

* initialize OpenBlock properly in 'enact'

* misc: remove duplicate lines

* misc: fix license headers

* misc: convert spaces to tabs

* misc: fix tabs

* Update Cargo.toml

* Update Cargo.toml

* Update Cargo.toml

* clique: ensure validator restores state before trying to seal

* clique: make 'state' return an Error.  Make some error messages more clear

* Fix compile error after rebase & toolchain upgrade

* fix a bunch of import warnings

* Refactor code

* Fix permissions

* Refactoring syncing

* Implement full validator checks

* Refactor util functions to seperate file

* mining 1

* ethcore: add chainspec for kotti

* ethcore: rename pre-goerli configs

* ethcore: load kotti chain spec

* cli: add kotti to params

* Implement working local sealing

* making sealing & syncing work together

* Relax timestamp checking

* ethcore: prepare for the real goerli to launch

* Implement NOTURN wiggle properly & cleanupnup warnings

* Implement vote casting

* Update docs & skip signing if no signer

* Optimize step-service interval

* Record state on local sealed block

* Fix script filemode

* Cleaning up codebase

* restore enact trace logging

* Delete clique.sh and move sync.sh

* remove travis.yml

* Remove dead code

* Cleanup compile warning

* address review comments

* adding more comments and removing unwrap()

* ci: remove sync script

* Address review comments

* fix compile error

* adding better debugging for timing

* Implement an dedicated thread for sealing timing

* fix(add helper for timestamp overflows) (#10330)

* fix(add helper timestamp overflows)

* fix(simplify code)

* fix(make helper private)

* snap: official image / test (#10168)

* official image / test

* fix / test

* bit more necromancy

* fix paths

* add source bin/df /test

* add source bin/df /test2

* something w paths /test

* something w paths /test

* add source-type /test

* show paths /test

* copy plugin /test

* plugin -> nil

* install rhash

* no questions while installing rhash

* publish snap only for release

* fix(docker): fix not receives SIGINT (#10059)

* fix(docker): fix not receives SIGINT

* fix: update with reviews

* update with review

* update

* update

* Don't add discovery initiators to the node table (#10305)

* Don't add discovery initiators to the node table

* Use enums for tracking state of the nodes in discovery

* Dont try to ping ourselves

* Fix minor nits

* Update timeouts when observing an outdated node

* Extracted update_bucket_record from update_node

* Fixed typo

* Fix two final nits from @todr

* change docker image based on debian instead of ubuntu due to the chan… (#10336)

* change docker image based on debian instead of ubuntu due to the changes of the build container

* role back docker build image and docker deploy image to ubuntu:xenial based (#10338)

* Bundle protocol and packet_id together in chain sync (#10315)

Define a new `enum` where devp2p subprotocol packet ids (currently eth and par) are defined. Additionally provide functionality to query id value and protocol of a given id object.

* snap: prefix version and populate candidate channel (#10343)

* snap: populate candidate releases with beta snaps to avoid stale channel

* snap: prefix version with v*

* addressing review comments

* engine: fix copyright header

* scripts: restore permissions on sign command

* ethcore: enforce tabs

* ethcore: enforce tabs

* ethcore: enforce tabs

* addressing comments

* addressing comments

* addressing more comments

* addressing more comments

* addressing more comments

* addressing more comments

* addressing more comments

* json-spec: fix clique epoch to non-zero u64

* ci: enable travis for parity goerli

* ci: don't separate build and test step

* ci: don't run c++ tests on travis

* ci: simplify cargo test to squeeze into travis timeout

* ci: don't run tests on travis at all

* style(fixes)

* fix(add tests)

* fix(recent_signer bug)

* fix(complete all tests)

* fix(nits)

* fix(simplify asserts)

* fix(cliqueState): simplify code

* fix(nits)

* docs(comments what's need to fixed)

* fix(revert unintended changes)

* fix(tests)

* fix(logs): voting logs

* fix(readability + more logs)

* fix(sync)

* docs(add missing licens header)

* fix(log): info! -> trace!

* docs(fix nits) + fix(remove assert)

* perf(use counter instead of vec)

* fix(remove needless block in match)

* fix(faulty comment)

* grumbles(docs for tests)

* fix(nits)

* fix(revert_vote): only remove vote when votes == 0

* fix(vote counter): checked arithmetics

* fix(simplify tests)

* fix(nits)

* fix(clique): err types

* fix(clique utils): make use of errors

* fix(cleanup nits)

* fix(clique sealing): don't read state no signer

* fix(replace Vec<Signers> with BTreeSet<Signers>)

* fix(tests): BTreeSet and more generic helpers

* fix(nits)

* fix(ethcore_block_seal): remove needless `Box`

* fix(faulty log): info -> trace

* fix(checked SystemTime): prevent SystemTime panics

* style(chain cfg): space after `:`

* style(fn enact): fix whitespace

* docs(clique): StepService

* docs(nit): fix faulty comment

* docs(fix typo)

* style(fix bad indentation)

* fix(bad regex match)

* grumble(on_seal_block): make `&mut` to avoid clone

* docs(on_seal_block): fix faulty documentation

* Delete .travis.yml

* docs: remove eth hf references in spec

* Update client.rs

* fix(nits)

* fix(clique step): `RwLock` -> `AtomicBool`

* fix(clique): use `Duration::as_millis`

* Clean up some Clique documentation
* Add trace information to eth_estimateGas

* replace unwrap better version

* change vm::Error formatter to more user-friendly

* remove extra error format

* use map_or instead sequence of map/unwrap_or
@soc1c soc1c modified the milestones: 2.5, 2.6 Apr 2, 2019
@dvdplm
Copy link
Collaborator

dvdplm commented Jun 4, 2019

To my ears it sounds like we want lto = true for release builds and lto = false for tests (but otherwise keep the same optimisation level as release for tests).

Panic handling seems more contentious and it sounds like we need more testing and hand-wringing on that one.

@ordian can you PR those two changes separately? The diff here is a little confusing with all the Clique code in.

@ordian
Copy link
Collaborator Author

ordian commented Jun 4, 2019

To my ears it sounds like we want lto = true for release builds and lto = false for tests (but otherwise keep the same optimisation level as release for tests).

I'm not sure how to enable it this way, since we run tests with --release flag if I'm not mistaken. So I suggested to not this PR into master, but rather backport commits into stable and beta. What do you think?

The diff here is a little confusing with all the Clique code in.

Yeah, that's due to force rebase on master.

@ordian ordian closed this Jun 4, 2019
@dvdplm
Copy link
Collaborator

dvdplm commented Jun 4, 2019

since we run tests with --release flag

Forgive me, but I'm slow. Can't we stop running tests that way and instead switch to:

[profile.test]
# proper optimization levels to match --release here
lto = false

…i.e. switch CI to run cargo test instead of cargo test --release? I must be missing something!

@ordian ordian deleted the optimize-release-build branch June 4, 2019 13:25
@TriplEight
Copy link
Collaborator

@dvdplm yes, this thread is hard.

Can't we stop running tests that way and instead switch to

AFAIU we can do that.

@TriplEight
Copy link
Collaborator

TriplEight commented Jun 4, 2019

@dvdplm, @ordian something went wrong, I suggest falling back.
Pipeline with this change: https://gitlab.parity.io/parity/parity-ethereum/pipelines/39507/builds
Any other pipeline: https://gitlab.parity.io/parity/parity-ethereum/pipelines/39408/builds
Check out the time it took to test, it's 48 vs 11 min!

@dvdplm
Copy link
Collaborator

dvdplm commented Jun 5, 2019

@TriplEight the links are the same there, not sure where to look. Can fix? :)

@TriplEight
Copy link
Collaborator

@dvdplm I edited in the GitHub UI

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-pleasereview 🤓 Pull request needs code review. A2-insubstantial 👶 Pull request requires no code review (e.g., a sub-repository hash update). B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. F7-footprint 🐾 An enhancement to provide a smaller (system load, memory, network or disk) footprint.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants