Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

User-Interactive Auth sessions need to be replicated across the processes of a worker-mode deployment #6705

Closed
anoadragon453 opened this issue Jan 14, 2020 · 17 comments
Assignees
Labels
z-bug (Deprecated Label)

Comments

@anoadragon453
Copy link
Member

Description

So I was trying to figure out why the /auth/xxx/fallback/web was working on a local homeserver instance but not on matrix.org.

In this flow, there's a sessions dict which governs the current state of your login/registration progress. When you complete a auth stage, your session in this dict is modified. This sessions dict is available to registration and login methods.

Turns out that in the recommended worker setup, you can end up with two workers, each with their own instance of sessions. One tracks your login progress, and another that gets modified when you successfully complete web fallback. However, because they are separate, you don't actually see any progress when you query /register after successfully completing web fallback registration stages.

Essentially /auth/xxx/fallback/web needs to get routed to the same worker as /register.

Synapse v1.8.0

@anoadragon453
Copy link
Member Author

Also slightly worrying as I think mobile use this to do recaptcha...?

@neilisfragile neilisfragile added z-bug (Deprecated Label) z-p2 (Deprecated Label) labels Jan 22, 2020
@richvdh
Copy link
Member

richvdh commented Feb 23, 2020

We should investigate if this is actually breaking clients.

@deepbluev7
Copy link
Contributor

deepbluev7 commented Feb 23, 2020

Yes, it is breaking clients. I can confirm that, as I just implemented the user interactive auth part of registration in nheko. Basically we try the first stage, but after the user completed the challenge and clicks okay, synapse still returns completed: []. I'm pretty sure I fixed our bugs and the fault is now with synapse/matrix.org in this case. If you want to verify that, it should be possible to test it with the latest nightly build of nheko (still building atm).

@richvdh
Copy link
Member

richvdh commented Mar 6, 2020

thinking about this again: it doesn't need any more investigation. It just needs an update to the documentation, and an update to the matrix.org deployment (and worker-based modular.im deployments).

@clokep
Copy link
Member

clokep commented Mar 6, 2020

I'm going to do this.

@richvdh
Copy link
Member

richvdh commented Mar 7, 2020

Thanks for picking this up @clokep. Just a reminder to us both that the update needs deploying on the matrix.org and modular.im infrastructure - I suggest you ping @benbz and @jaywink next week to see about getting that done.

@deepbluev7
Copy link
Contributor

Does this fix all cases, where auth fallbacks are used? I'd imagine all requests, that use interactive authentification, need to be routed to the same endpoint as the request, that requires auth. For example I believe that the auth fallback for device deletion needs to be routed to the same worker, that handles that request. Or am I missing something like this not being handled by a worker, so it should work fine?

@richvdh
Copy link
Member

richvdh commented Mar 7, 2020

ohhhhh gosh that's a good point. this is going to be much harder to fix.

@clokep
Copy link
Member

clokep commented Mar 9, 2020

I took another look at what uses the sessions object from the auth handler (nothing directly uses it outside of the class, but there's a variety of indirect users via _get_session_info and _save_session).

I think the following patterns use it:

  • /register$
  • /auth/.*/fallback/web$
  • /account/password$
  • /account/deactivate$
  • /account/3pid/add$
  • /delete_devices
  • /devices/.+$ (deleting a device)
  • /keys/device_signing/upload$
  • /login$ (for non-JWT and non-token logins)

The first two on this list are now documented, not sure if we just need to document the rest of these or if there's more to do. Just documenting them seems fairly fragile though (we will likely run into this again in the future).

@deepbluev7
Copy link
Contributor

Yeah, those endpoints match my expectation, of where I know UI auth is used. I think just documenting them is a bit brittle. I can't even find some of them in the doc, so I guess those are handled by the master process? I think it would make the most sense, if all of those could be handled by the client_reader and successive requests from the same client to the client reader were routed to the same instance of the client_reader, but I don't really know where the cut-off for that would be (5 minutes?). I think there should be some way for workers to cross-talk maybe, so that they can continue fallback auth, if needed. Since this only affects requests that contain /auth/.*/fallback/web or auth inside the request body, maybe a better solution than documenting it could be found? Maybe if redis is used for communication between master processes, that could be extended to UI auth?

@clokep
Copy link
Member

clokep commented Mar 10, 2020

Maybe if redis is used for communication between master processes, that could be extended to UI auth?

This is where my mind went to as well, especially with the work that @erikjohnston has been doing.

@richvdh
Copy link
Member

richvdh commented Mar 17, 2020

interim solutions to this could be:

  • just document that everything that does UI auth has to go to the master - though that is unsatisfactory because some of them can be quite high-cpu (particularly the bcrypting part of password-setting/registration)
  • create an internal http API on the master, which just handles UI auth, and which workers can call.

@clokep
Copy link
Member

clokep commented Mar 27, 2020

Somewhat related, #649 (comment) notes that UI auth sessions get dropped over restarts.

@clokep
Copy link
Member

clokep commented Apr 1, 2020

In addition to #6705 (comment) we could attempt to solve this by persisting the data more permanently:

  • Use Redis (this could be nice since I think it has utilities to auto-expire content), but I'm unsure of the final plan for Redis for workers in the future. It also means that the implementation for worker vs. non-worker will diverge more heavily, but I don't think it would be terrible to abstract.
  • Persist the data into SQL. This means we'll need to run some sort of periodic task to prune data, but besides that I don't see a ton of downsides.

@richvdh
Copy link
Member

richvdh commented Apr 2, 2020

sticking it in the db sounds good

@richvdh richvdh changed the title Worker configuration for web fallback registration is incorrect User-Interactive Auth sessions need to be replicated across the processes of a worker-mode deployment Apr 17, 2020
@clokep
Copy link
Member

clokep commented Apr 30, 2020

Fixed by #7302.

@clokep clokep closed this as completed Apr 30, 2020
@richvdh
Copy link
Member

richvdh commented May 7, 2020

Also slightly worrying as I think mobile use this to do recaptcha...?

For the record, no, none of riot-ios/riot-andriod/riotX use fallback auth for recaptcha.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

5 participants