Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Switch to Github Actions and remove Travis CI #4357

Merged
merged 37 commits into from
Jan 28, 2021

Conversation

cdecker
Copy link
Member

@cdecker cdecker commented Jan 26, 2021

With Travis CI refusing to build any of our configurations it's time to switch to something hopefully a bit more stable.

Copy link
Contributor

@rustyrussell rustyrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements in here! Pretty epic, in fact.

I will test on my local machine to see if any of the test fixes broke things for me.

.github/workflows/ci.yaml Outdated Show resolved Hide resolved
.github/workflows/ci.yaml Outdated Show resolved Hide resolved
.github/workflows/ci.yaml Show resolved Hide resolved
tests/test_pay.py Show resolved Hide resolved
We weren't waiting for the transactions to enter the mempool which
could cause all of our fine-tuned block counts to be off. Now just
waiting for the expected number of txs.
We were getting a number of incompatibility warning due to the
dependencies being expressed too rigidly. This losens the requirement
definitions to being compatible with a known good version, and while
we're at it we also bump all outdated requirements.
The timeout on the pay future was too short under valgrind.
We weren't waiting for the `dev_fail` transaction to hit the mempool,
throwing the results off.
Synching with the blockchain was slower than our timeout...
It requires `--dev-force-features` which isn't available without
`DEVELOPER=1`
We also make the logic a bit nicer to read. The failure was due to
more than one status message being present if we look at the wrong
time:

```
arr = ['CLOSINGD_SIGEXCHANGE:We agreed on a closing fee of 20334 satoshi for tx:17f1e9d377840edf79d8b6f1ed0faba59bb307463461...9b98', 'CLOSINGD_SIGEXCHANGE:Waiting for another closing fee offer: ours was 20334 satoshi, theirs was 20332 satoshi,']                                                                                                   │

     def only_one(arr):
         """Many JSON RPC calls return an array; often we only expect a single entry
         """
 >       assert len(arr) == 1
 E       AssertionError
```
They were using TIMEOUT / 2 which may be way too long (hit against
test timeout), so we use a still ludicrous 30 seconds instead.
We were getting a couple of starvations, so we need a fair filelock. I
also wasn't too happy with the lock as is, so I hand-coded it quickly.

Should be correct, but the overall timeout will tell us how well we
are doing on CI.
We were sometimes waiting only 5 seconds, which is way too short on a
heavily loaded machine such as CI. Making it 30 seconds and collecting
it in a single place so we can adjust more easily.
We were using a lot of docker conventions, which are not necessary in
the script itself.
We have a couple of very heavy tests bunched together, randomization
could potentially lessen the peak load.
This should really be set by the environment by creating either a
pytest.ini or setting PYTEST_OPTS envvar.
If we're quick (or the node is slow) we end up reconnecting before our
counterparty has realized the state transition, resulting in an
unexpected re-establish.
The test was not considering that concurrent sendrawtx of the same tx
is not stable, and either endpoint will submit it first. Now just
checking state transitions and the mempool.
Reconnections and unsynchronized states where causing us some issues.
openchannel internally generates blocks, which may cause nodes to be
out of sync and ignore "future" channel announcements, resulting in
bad gossip.
rerunfailures keeps not working.
Can be used in a second stage to generate stats and detect flaky
tests.
We don't have a good way of referring to the configuration that
failed, so let's give them a numberic ID. Particularly useful for the
artifacts that'd be overwritten otherwise.
We are printing `repr(obj)` which is not pretty-printed, hard to read,
and can't even be copied and inspected to JSON tools. We now print the
JSONified and indented calls and responses for easier debugging based
on solely the logs (useful for CI!).

Changelog-Added: pyln-testing: The RPC client will now pretty-print requests and responses to facilitate log-based debugging.
@rustyrussell
Copy link
Contributor

Tried rerunning, and it breaks in different ways every time. I can't see how to run a single job, it only lets me do them all :(

@cdecker
Copy link
Member Author

cdecker commented Jan 28, 2021

Tried rerunning, and it breaks in different ways every time. I can't see how to run a single job, it only lets me do them all :(

Yep, that's a common feature request on Github's own tracker. Hopefully they'll implement it some time soon. I'll continue stabilizing tests after this is merged, making it more reliable and reducing the number of reruns we need. So far I have a failure rate of about 1/4, but each fixed test has a big effect on this.

Since we have the reports generated as artifacts we can easily integrate them into my flaky test tracking tool (http://46.101.246.115:5001/) and then we can hunt them down more easily.

@cdecker cdecker marked this pull request as ready for review January 28, 2021 10:57
@cdecker
Copy link
Member Author

cdecker commented Jan 28, 2021

I should have mentioned that I have been rescheduling the runs a couple of times to see if new flakes show up, and to fix known ones, so not surprisingly some of the runs will stick to a failed state because I follow them up with a fix :-)

@rustyrussell
Copy link
Contributor

Ack f2b8355

@rustyrussell rustyrussell merged commit a5f16ab into ElementsProject:master Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants