Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

[meta] Synapse does not currently pass all of Complement's tests #8421

Closed
13 tasks done
anoadragon453 opened this issue Sep 29, 2020 · 16 comments
Closed
13 tasks done

[meta] Synapse does not currently pass all of Complement's tests #8421

anoadragon453 opened this issue Sep 29, 2020 · 16 comments
Assignees
Labels
A-Testing Issues related to testing in complement, synapse, etc

Comments

@anoadragon453
Copy link
Member

anoadragon453 commented Sep 29, 2020

Synapse is currently failing to pass all of the tests in Complement. Fixing this would be a requirement to removing the "soft-fail" attribute for running it in CI.

As of now, the tests Synapse is failing are:

  • TestRegistration/parallel/POST_/register_returns_the_same_device_id_as_that_in_the_request
  • TestRegistration/parallel/POST_/register_can_create_a_user
  • TestRegistration/parallel/POST_/register_downcases_capitals_in_usernames
  • TestOutboundFederationProfile
  • TestOutboundFederationIgnoresMissingEventWithBadJSONForRoomVersion6
  • TestJoinViaRoomIDAndServerName
  • TestOutboundFederationSend
  • TestMediaWithoutFileName/parallel/Can_upload_without_a_file_name
  • TestMediaWithoutFileName/parallel/Can_download_without_a_file_name_locally
  • TestMediaWithoutFileName/parallel/Can_download_without_a_file_name_over_federation
  • TestRoomCreationReportsEventsToMyself
  • TestKeysQueryWithDeviceIDAsObjectFails
  • TestJoinFederatedRoomWithUnverifiableEvents

Anecdotally, it's worth mentioning that all tests failed when my personal VPN was on, failing with:

2020/09/29 13:19:04 Failed to create docker network: Error response from daemon: could not find an available, non-overlapping IPv4 address pool among the defaults to assign to the network
    federation_room_send_test.go:22: Deploy: Failed to construct blueprint: failed to find built images via ImageList: did they all build ok?
@clokep
Copy link
Member

clokep commented Oct 14, 2020

@anoadragon453 Was this just using the ./scripts-dev/complement.sh script? Anything else to do?

@clokep
Copy link
Member

clokep commented Oct 20, 2020

From some discussion with @kegsay he thinks the error above is due to a VPN.

I was able to get an actual error out (at least from a particular test):

    client.go:196: Making POST request to http://localhost:32932/_matrix/client/r0/join/%23flibble:host.docker.internal?access_token=MDAxMWxvY2F0aW9uIGhzMQowMDEzaWRlbnRpZmllciBrZXkKMDAxMGNpZCBnZW4gPSAxCjAwMWRjaWQgdXNlcl9pZCA9IEBhbGljZTpoczEKMDAxNmNpZCB0eXBlID0gYWNjZXNzCjAwMjFjaWQgbm9uY2UgPSBLM1ljSmV6NTNiZGFXOm5UCjAwMmZzaWduYXR1cmUggOcv-jrBDB-mNGvv7tELo-w6mh1t9ob3CK0ZeYxDNFEK
    client.go:196: Request body: {}
    client.go:262: POST /_matrix/client/r0/join/#flibble:host.docker.internal => 401 Unauthorized (6.531592ms)
    client.go:196: HTTP/1.1 401 Unauthorized
        Transfer-Encoding: chunked
        Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, Authorization
        Access-Control-Allow-Methods: GET, HEAD, POST, PUT, DELETE, OPTIONS
        Access-Control-Allow-Origin: *
        Cache-Control: no-cache, no-store, must-revalidate
        Content-Type: application/json
        Date: Tue, 20 Oct 2020 18:55:35 GMT
        Server: Synapse/1.21.2
        
        55
        {"errcode":"M_UNKNOWN_TOKEN","error":"Unrecognised access token","soft_logout":false}
        0

So looks like something is going run with logins, which would not be great. :)

@clokep
Copy link
Member

clokep commented Oct 29, 2020

I took a look at this I think what is happening is that the test run is starting with a fresh database (instead of re-using the database from the blueprint run that creates users). This is with a sqlite database and @kegsay said that the Docker flow makes a full copy of the file system, so to look into whether the data is being flushed to sqlite3 before the registration endpoint returns a response.

I believe that Python flushes when you commit, which we do before the endpoint returns so I was a bit surprised by this not working.

I was running the following to only run a subset of tests:

COMPLEMENT_DEBUG=1 COMPLEMENT_BASE_IMAGE=complement-synapse go test -v -count=1 -run '^(TestMediaWithoutFileName)$' -parallel=1 -failfast ./tests/

I also tried making some local changes and running the following in my synapse dir:

docker build -t matrixdotorg/synapse:latest -f docker/Dockerfile .

@anoadragon453
Copy link
Member Author

@anoadragon453 Was this just using the ./scripts-dev/complement.sh script? Anything else to do?

Sorry yes, just this. Thanks for providing some more commands.

I took a look at this I think what is happening is that the test run is starting with a fresh database (instead of re-using the database from the blueprint run that creates users).

This sounds plausible.

@anoadragon453
Copy link
Member Author

anoadragon453 commented Nov 2, 2020

So I believe this is indeed happening due to the container being copied before the SQLite3 database is written to. This likely won't be a problem with Postgres as we now autocommit while running.

I've tracked down the meat of this to the following steps.

We create all necessary homeserver containers from the provided image: https://github.com/matrix-org/complement/blob/0e345aa81e79dbf147a8a4e92198f2a62223457d/internal/docker/builder.go#L220-L224

and run all blueprint commands in them (creating users, rooms, etc): https://github.com/matrix-org/complement/blob/0e345aa81e79dbf147a8a4e92198f2a62223457d/internal/docker/builder.go#L262-L288

The homeserver in each container is ready and has all the information we need to run the requested test against it. However, this information has not actually been committed to the db yet. It's just sitting inside of Synapse - in RAM effectively.

Information about each created container is stored in results, which is then iterated over in the next loop: https://github.com/matrix-org/complement/blob/0e345aa81e79dbf147a8a4e92198f2a62223457d/internal/docker/builder.go#L244-L251

Each of the containers is committed, which creates a new image from the changes in the container. However, only the filesystem is copied here. Everything in RAM is thrown away.

I believe stopping the container before committing, thus shutting down Synapse and allowing it time to flush all changes to the db, will help here.

Now excuse me as I broke my docker in the meantime and can't finishing debugging this :^)

Edit: Rebooted to fix docker. Though unfortunately it seems stopping the container before committing it still doesn't carry the necessary database changes over to the deployed image that's used in the test 🤔

@anoadragon453
Copy link
Member Author

Right, forget all of the above. It's because the database is put in /data, which is a docker volume.

This is defined at: https://github.com/matrix-org/synapse/blob/develop/docker/Dockerfile#L75

This won't be saved on committing to an image, so you end up with an entirely new database on startup.

@anoadragon453
Copy link
Member Author

anoadragon453 commented Nov 2, 2020

DB fix is at matrix-org/complement#29

@anoadragon453
Copy link
Member Author

anoadragon453 commented Nov 2, 2020

I seem to get:

2020/11/02 20:47:48  : Failed to kill container : Error response from daemon: page not found

randomly, which fails all tests and requires a reboot to fix once it starts. Anyone have a solution handy? :)

According to JC this is due to the container not starting correctly, which the defer ... killContainer bit then returns an HTTP 404 error code on (because it can't kill a container it never created).

Go's HTTP client then reports "page not found". So... why isn't the container getting started? I don't know, problem for tomorrow :)

@anoadragon453
Copy link
Member Author

So it turns out Synapse can sometimes take some time to start, and Complement gives up waiting for it after 2500ms.

Setting the COMPLEMENT_VERSION_CHECK_ITERATIONS environment variable to something higher than the default 50 helps (I've set it to 100).

@anoadragon453
Copy link
Member Author

anoadragon453 commented Nov 3, 2020

The next problem to solve is certificate verification failure:

2020-11-03 11:54:53,784 - synapse.http.matrixfederationclient - 544 - INFO - GET-2 - {GET-O-1} [host.docker.internal] Request failed: GET matrix://host.docker.internal/_matrix/media/r0/download/host.docker.internal/PlainTextFile?allow_remote=false: ResponseNeverReceived:[Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])]
2020-11-03 11:54:53,785 - synapse.util.retryutils - 215 - INFO - GET-2 - Connection to host.docker.internal was unsuccessful (<class 'synapse.api.errors.RequestSendFailed'>(Failed to send request: ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')]>])); backoff now 600000
2020-11-03 11:54:53,786 - synapse.rest.media.v1.media_repository - 392 - WARNING - GET-2 - Request failed fetching remote media host.docker.internal/PlainTextFile: RequestSendFailed("Failed to send request: ResponseNeverReceived: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')]>]")
2020-11-03 11:54:53,788 - synapse.http.server - 74 - INFO - GET-2 - <SynapseRequest at 0x7f89919db190 method='GET' uri='/_matrix/media/r0/download/host.docker.internal/PlainTextFile?access_token=<redacted>' clientproto='HTTP/1.1' site=8008> SynapseError: 502 - Failed to fetch remote media

We need to configure Synapse - much like in the demo scripts - to be able to allow self-signed certificates for federation traffic.

@anoadragon453
Copy link
Member Author

anoadragon453 commented Nov 3, 2020

It looks like we actually try to create certificates that each homeserver will trust through a common CA - so I'll aim to get that working rather than just disabling verification everywhere.

@anoadragon453
Copy link
Member Author

Looks like getting certificate validation working won't be possible until matrix-org/complement#28 or similar lands.

This is because we're not testing Synapse against itself, but a fake complement federation server. This server has its own dummy CA which is generating certificates we can't trust.

The above linked PR will expose the certificate of that dummy CA to the homeserver containers to trust - at that point we can turn federation certificate verification back on.

After this there seem to be no more fundamental issues plaguing Synapse support for Complement 🎉 Just actual issues which will take Synapse and spec work to solve :)

@anoadragon453
Copy link
Member Author

PR for certification verification disabling: matrix-org/complement#30

@anoadragon453
Copy link
Member Author

anoadragon453 commented Nov 9, 2020

There's been a few PRs landing in Complement to fix some more of the wide-sweeping test failures Synapse is experiencing:

Testing with these all merged, I've updated the test results in the original comment.

@anoadragon453
Copy link
Member Author

anoadragon453 commented Dec 9, 2020

Note that the original comment is being continuously updated with the current state of the tests and the PRs that will allow them to pass.

@anoadragon453
Copy link
Member Author

As of now, Synapse develop now passes all of Complements tests. The PRs that made this possible were:

Phew!

@richvdh richvdh added the A-Testing Issues related to testing in complement, synapse, etc label Aug 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Testing Issues related to testing in complement, synapse, etc
Projects
None yet
Development

No branches or pull requests

3 participants