-
Notifications
You must be signed in to change notification settings - Fork 1.7k
rust: add refresh token authentication #4647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: Previously, `LogdirLoader` and `RunLoader` were both hard-tied to the native OS filesystem, via `walkdir::WalkDir` and `std::fs::File`, respectively. This patch introduces a `Logdir` trait that abstracts over listing and reading event files. A `DiskLogdir` implementation recovers the current behavior (mostly; see below), and we can slot in adapters for more filesystems as desired. At this point, it’s convenient to drop the hyper-precise semantics around non-UTF-8 run name collisions, which only occur if you have two runs whose names are invalid Unicode and are equal after lossy conversion. We originally handled this precisely because Rust made it easy to do so, but it’s never come up as an issue in the real world. It’s no longer quite so convenient to handle, so we cull the complexity. Test Plan: Existing unit tests suffice, and a `--load_fast` end-to-end test still checks out. wchargin-branch: rust-logdir-trait wchargin-source: 7535f5d9c7997b111e80edee5be9f1726f8fbe6b
wchargin-branch: rust-logdir-trait wchargin-source: ee5495623979251b83234ee51f6c92ae46a3643d
wchargin-branch: rust-logdir-trait wchargin-source: a97ab9996a1a87b98951441ab44c1461879f9a2c # Conflicts: # tensorboard/data/server/bench.rs # tensorboard/data/server/cli.rs # tensorboard/data/server/logdir.rs
wchargin-branch: rust-logdir-trait wchargin-source: a97ab9996a1a87b98951441ab44c1461879f9a2c
wchargin-branch: rust-logdir-trait wchargin-source: 3a683c56a8013e9e46389ff53891a794b661cd43
wchargin-branch: rust-logdir-trait wchargin-source: bdad59cb2dee154d63518e05bfa0d252301e8c83
wchargin-branch: rust-logdir-trait wchargin-source: 15bddd7be7ef775d415d2fecf835814bc35e27c8
wchargin-branch: rust-logdir-trait wchargin-source: 15bddd7be7ef775d415d2fecf835814bc35e27c8
Summary: The [`reqwest`] crate provides a high-level HTTP interface, similar in spirit to Python’s `requests` package. It has built-in support for JSON serialization and deserialization via `serde` (which we already use) and is based on the `hyper` stack (which we also already use), so it should fit in nicely. We’ll use it to make requests to GCS. [`reqwest`]: https://crates.io/crates/reqwest Test Plan: It builds: `bazel build //third_party/rust:reqwest`. wchargin-branch: rust-dep-reqwest wchargin-source: d05d3de3a3d44974e282574ec965ad6bd0024246
Summary: This patch implements the extent of the Google Cloud Storage protocol that TensorBoard needs: list objects in a bucket with a given prefix, and read partial contents of an object. It turns out to be really easy. For comparison, [TensorFlow also rolls its own GCS client][tf]. Theirs is more complex because it needs to handle writable files and support general-purpose caching patterns. By contrast, we have a simple one-pass read pattern and already assume that files are append-only, so we avoid both the complexity and pathological interactions like #1225. For now, this only serves public buckets and objects. Authentication is also easy (and doesn’t require crypto or anything complicated), but, for ease of review, we defer it to a future patch. [tf]: https://github.com/tensorflow/tensorflow/tree/r2.4/tensorflow/core/platform/cloud Test Plan: Included a simple client that supports `gsutil ls` and `gsutil cat`. Run with `RUST_LOG=debug cargo run --release --bin gsutil` and more args: - `ls tensorboard-bench-logs` to list all 33K objects in the bucket, across 34 pages of list operations (3.3s on my machine); - `ls tensorboard-bench-logs --prefix mnist/` to list just a single logdir, which should be much faster (0.1 seconds on my machine, which includes setting up the keep-alive connection); - `cat tensorboard-bench-logs mnist/README --to=11` to print the first 12 bytes (`Range: bytes=0-11` inclusive) of an object; - `cat tensorboard-bench-logs mnist/README --from=9999` to print nothing, since the object is shorter than 9999 bytes. wchargin-branch: rust-gcs-client wchargin-source: d9e404df57ecf5ee80089b810835a241084ffbc8
wchargin-branch: rust-gcs-client wchargin-source: 48943e73d17f4dadaeb7aa83b6eeaa3ec8f78707
Summary: You can now run `tensorboard --load_fast --logdir gs://...` to launch a Rust-backed TensorBoard reading from GCS. This patch introduces two new `Logdir` implementations: one that connects to the GCS client, and one that delegates to either a `DiskLogdir` or a `gcs::Logdir` depending on the form of the `--logdir` flag. Test Plan: Run: ``` bazel run //tensorboard -- --logdir gs://tensorboard-bench-logs/edge_cgan --load_fast ``` In practice, this makes `tensorboard(1)` fairly slow, because the DebuggerV2 plugin aggressively scans the GCS logdir. Thus, it might take a while even for the server link to appear. You can comment out the debugger plugin in `default.py` to get an idea of the underlying speed. wchargin-branch: rust-gcs-logdir wchargin-source: 85d75544b0b2391dbc735270940c61ce783b7748
Summary: Our GCS integration can now read standard Google credentials, read from either the `GOOGLE_APPLICATION_CREDENTIALS` environment variable or the well-known path under `~/.config/gcloud`. Refresh tokens are exchanged for access tokens, which are cached until they’re about to expire. Outgoing requests use a cached access token or fetch a new one if necessary. Test Plan: Run TensorBoard with `--load_fast` and a `--logdir` pointing to a bucket with restricted access. Note (with `RUST_LOG=rustboard=debug`) that access tokens are only fetched once and are then reused. wchargin-branch: rust-gcs-auth wchargin-source: 292a7fe347c136284d13d24e3ae89aa1e8760f50
wchargin-branch: rust-gcs-logdir wchargin-source: 68e19b39d5704335d5c05074da5f0c44e069c431 # Conflicts: # tensorboard/data/server/BUILD # tensorboard/data/server/cli.rs # tensorboard/data/server/gcs.rs # tensorboard/data/server/gcs/client.rs
wchargin-branch: rust-gcs-logdir wchargin-source: 68e19b39d5704335d5c05074da5f0c44e069c431
wchargin-branch: rust-gcs-auth wchargin-source: 56d6f3c2c148e2104a12565b325ec7ec215cfa1e
Ping me internally if you want a pointer. (edit: http://b/179067453#comment2) |
psybuzz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, everything makes sense. It was fun to read about things I wasn't familiar with: XDG_CONFIG_HOME, RwLocks in action, etc!
When I try to run
bazel run tensorboard:dev --define=link_data_server=true -- --load_fast --logdir gs://tensorboard-bench-logs/edge_cgan --bind_all
I get:
thread 'main' panicked at 'Cannot drop a runtime in a context where blocking is not allowed. This happens when a runtime is dropped from within an asynchronous context.', external/raze__tokio__1_0_2/src/runtime/blocking/shutdown.rs:51:21
Presumably, the blocking is because of the Reqwest HttpClient, but I'm not sure why Rust thinks blocking is not allowed here. Have you seen this before?
tensorboard/data/server/gcs/auth.rs
Outdated
| /// [`Self::fetch`] will always return `None`. | ||
| /// | ||
| /// This exists as an optimization so that a [`TokenStore`] doesn't need to check locks all the | ||
| /// time when the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfinished comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes; thanks.
tensorboard/data/server/gcs/auth.rs
Outdated
|
|
||
| /// Checks whether `token` represents a token that will still be valid for at least the given | ||
| /// `lifetime`, and if so returns a reference to the inner access token. | ||
| fn check_expiry(token: &Option<BoundedToken>, lifetime: Duration) -> Option<&AccessToken> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume this means the token must be valid for at least lifetime from now, in order to be usable. In the TF side [1], it seems like they have a much longer 60sec margin, compared to the 10sec in this PR (client.rs L76). Do we believe that the TF lifetime is too conservative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn’t bothered to check their padding value. Happy to use 60s. Thanks!
| ); | ||
| Credentials::RefreshToken(RefreshToken(refresh_token)) | ||
| } | ||
| Err(e) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If serde fails to deserialize the contents of the credentials file, it goes into this error case, which is good. However, does the match still fall into this case when serde properly deserializes the contents into a proper struct, which does not conform to the RefreshTokenCreds struct? (e.g. if it doesn't have a client_id field)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Indeed: that becomes a decode error, which we
propagate up (as an io::ErrorKind::InvalidData error, through the
mapping in gcs::logdir::reqwest_to_io_error from #4646).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, thanks
| f.debug_struct("RefreshTokenCreds") | ||
| .field("client_id", &self.client_id) | ||
| .field("client_secret", &self.client_secret) | ||
| .field("refresh_token", &format_args!("<redacted>")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking: if the <redacted> here is to cover the case when we accidentally save/log our refresh tokens somewhere, should we redact the client_secret too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the client “secret” is actually not secret:
tensorboard/tensorboard/uploader/auth.py
Lines 46 to 47 in b306651
| # The client "secret" is public by design for installed apps. See | |
| # https://developers.google.com/identity/protocols/OAuth2?csw=1#installed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…but, you know, on second thought, it costs nothing to hide the client
“secret”, so I might as well, just in case my understanding is missing
some pieces. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - it's not secret for well-known installed apps like gcloud or our uploader, since it's not possible to hide it, but I think it could be a real secret for a generic GOOGLE_APPLICATION_CREDENTIALS path.
tensorboard/data/server/gcs/auth.rs
Outdated
| return rb; | ||
| } | ||
| let token = self.token.read().expect("failed to read auth token"); | ||
| if let Some(t) = Self::check_expiry(&*token, lifetime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud:
At first glance, I was confused by what '&' meant, but I think it makes sense now. The type of token is a RwLockReadGuard<Option<BoundedToken>>, and what we really want to check is the Option<BoundedToken>. Doing the deref '' removes the guard, and the next '&' gets a reference to the Option, since we don't want to do a value copy.
If we simply did a &token, the code still compiles, but then we're just getting a reference to the guard, which is not the same as the reference to what it is guarding (the Option).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Exactly right; good tracing.
wchargin-branch: rust-gcs-logdir wchargin-source: f8b766436143e6faa603ca551c0ceb8813ea829c
wchargin-branch: rust-gcs-logdir wchargin-source: f8b766436143e6faa603ca551c0ceb8813ea829c
wchargin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
| f.debug_struct("RefreshTokenCreds") | ||
| .field("client_id", &self.client_id) | ||
| .field("client_secret", &self.client_secret) | ||
| .field("refresh_token", &format_args!("<redacted>")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the client “secret” is actually not secret:
tensorboard/tensorboard/uploader/auth.py
Lines 46 to 47 in b306651
| # The client "secret" is public by design for installed apps. See | |
| # https://developers.google.com/identity/protocols/OAuth2?csw=1#installed |
tensorboard/data/server/gcs/auth.rs
Outdated
| return rb; | ||
| } | ||
| let token = self.token.read().expect("failed to read auth token"); | ||
| if let Some(t) = Self::check_expiry(&*token, lifetime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Exactly right; good tracing.
tensorboard/data/server/gcs/auth.rs
Outdated
|
|
||
| /// Checks whether `token` represents a token that will still be valid for at least the given | ||
| /// `lifetime`, and if so returns a reference to the inner access token. | ||
| fn check_expiry(token: &Option<BoundedToken>, lifetime: Duration) -> Option<&AccessToken> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn’t bothered to check their padding value. Happy to use 60s. Thanks!
tensorboard/data/server/gcs/auth.rs
Outdated
| /// [`Self::fetch`] will always return `None`. | ||
| /// | ||
| /// This exists as an optimization so that a [`TokenStore`] doesn't need to check locks all the | ||
| /// time when the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes; thanks.
| ); | ||
| Credentials::RefreshToken(RefreshToken(refresh_token)) | ||
| } | ||
| Err(e) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Indeed: that becomes a decode error, which we
propagate up (as an io::ErrorKind::InvalidData error, through the
mapping in gcs::logdir::reqwest_to_io_error from #4646).
wchargin-branch: rust-gcs-auth wchargin-source: 9c1d46133d64d3dd5f22d7d5e801868ca54a6a60 # Conflicts: # tensorboard/data/server/cli/dynamic_logdir.rs
wchargin-branch: rust-gcs-auth wchargin-source: 9c1d46133d64d3dd5f22d7d5e801868ca54a6a60
Thanks much for catching this! Looks like it only happens in debug mode, |
| ); | ||
| Credentials::RefreshToken(RefreshToken(refresh_token)) | ||
| } | ||
| Err(e) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, thanks
wchargin-branch: rust-gcs-auth wchargin-source: 42ce43c59cfebd640e4bdd2920c6e31134c30305
wchargin-branch: rust-gcs-auth wchargin-source: 0545ce350bb5b64a516aa559fc7c043c8121d840
wchargin-branch: rust-gcs-logdir wchargin-source: 3a14dc3d2fdf168b291cea1d58a97f93b6190d87
wchargin-branch: rust-gcs-auth wchargin-source: 74a2cbce08f5c68bf36b715cc952e9337841a79d
wchargin-branch: rust-gcs-auth wchargin-source: 7016f8659ccbf02456bf0d01d3dfe8b294c61d6c # Conflicts: # tensorboard/data/server/cli.rs # tensorboard/data/server/cli/dynamic_logdir.rs # tensorboard/data/server/gcs/client.rs # tensorboard/data/server/gcs/logdir.rs
wchargin-branch: rust-gcs-auth wchargin-source: 7016f8659ccbf02456bf0d01d3dfe8b294c61d6c
| f.debug_struct("RefreshTokenCreds") | ||
| .field("client_id", &self.client_id) | ||
| .field("client_secret", &self.client_secret) | ||
| .field("refresh_token", &format_args!("<redacted>")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - it's not secret for well-known installed apps like gcloud or our uploader, since it's not possible to hide it, but I think it could be a real secret for a generic GOOGLE_APPLICATION_CREDENTIALS path.
| } | ||
|
|
||
| /// Checks whether `token` represents a token that will still be valid for at least the given | ||
| /// `lifetime`, and if so returns a reference to the inner access token. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly non-obvious IMO from the name that check_expiry also unwraps the token. Optional but maybe consider making it a method on BoundedToken (could replace valid_at) like unwrap_if_valid_for()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, sure: changed as suggested.
wchargin-branch: rust-gcs-auth wchargin-source: 0d3abc195a783abb8e87178ddbcdb3b91a0d2b33
Summary:
Our GCS integration can now read standard Google credentials, read from
either the
GOOGLE_APPLICATION_CREDENTIALSenvironment variable or thewell-known path under
~/.config/gcloud. Refresh tokens are exchangedfor access tokens, which are cached until they’re about to expire.
Outgoing requests use a cached access token or fetch a new one if
necessary.
Test Plan:
Run TensorBoard with
--load_fastand a--logdirpointing to a bucketwith restricted access. Note (with
RUST_LOG=rustboard=debug) thataccess tokens are only fetched once and are then reused.
Use
gcloud auth application-default loginto generate credentials.Note also that if you cap access token lifetime to 30 seconds—
—(and change the margin from 60 seconds to 10 seconds) then access
tokens are still only refreshed once every 30 seconds, even among
concurrent loads.
wchargin-branch: rust-gcs-auth