Skip to content

Conversation

@wchargin
Copy link
Contributor

@wchargin wchargin commented Feb 3, 2021

Summary:
Our GCS integration can now read standard Google credentials, read from
either the GOOGLE_APPLICATION_CREDENTIALS environment variable or the
well-known path under ~/.config/gcloud. Refresh tokens are exchanged
for access tokens, which are cached until they’re about to expire.
Outgoing requests use a cached access token or fetch a new one if
necessary.

Test Plan:
Run TensorBoard with --load_fast and a --logdir pointing to a bucket
with restricted access. Note (with RUST_LOG=rustboard=debug) that
access tokens are only fetched once and are then reused.

Use gcloud auth application-default login to generate credentials.

Note also that if you cap access token lifetime to 30 seconds—

diff --git a/tensorboard/data/server/gcs/auth.rs b/tensorboard/data/server/gcs/auth.rs
--- a/tensorboard/data/server/gcs/auth.rs
+++ b/tensorboard/data/server/gcs/auth.rs
@@ -296,3 +296,3 @@ impl RefreshTokenCreds {
             access_token: res.access_token,
-            expires: Instant::now() + Duration::from_secs(res.expires_in),
+            expires: Instant::now() + Duration::from_secs(res.expires_in.min(30)),
         })

—(and change the margin from 60 seconds to 10 seconds) then access
tokens are still only refreshed once every 30 seconds, even among
concurrent loads.

wchargin-branch: rust-gcs-auth

Summary:
Previously, `LogdirLoader` and `RunLoader` were both hard-tied to the
native OS filesystem, via `walkdir::WalkDir` and `std::fs::File`,
respectively. This patch introduces a `Logdir` trait that abstracts over
listing and reading event files. A `DiskLogdir` implementation recovers
the current behavior (mostly; see below), and we can slot in adapters
for more filesystems as desired.

At this point, it’s convenient to drop the hyper-precise semantics
around non-UTF-8 run name collisions, which only occur if you have two
runs whose names are invalid Unicode and are equal after lossy
conversion. We originally handled this precisely because Rust made it
easy to do so, but it’s never come up as an issue in the real world.
It’s no longer quite so convenient to handle, so we cull the complexity.

Test Plan:
Existing unit tests suffice, and a `--load_fast` end-to-end test still
checks out.

wchargin-branch: rust-logdir-trait
wchargin-source: 7535f5d9c7997b111e80edee5be9f1726f8fbe6b
wchargin-branch: rust-logdir-trait
wchargin-source: ee5495623979251b83234ee51f6c92ae46a3643d
wchargin-branch: rust-logdir-trait
wchargin-source: a97ab9996a1a87b98951441ab44c1461879f9a2c

# Conflicts:
#	tensorboard/data/server/bench.rs
#	tensorboard/data/server/cli.rs
#	tensorboard/data/server/logdir.rs
wchargin-branch: rust-logdir-trait
wchargin-source: a97ab9996a1a87b98951441ab44c1461879f9a2c
wchargin-branch: rust-logdir-trait
wchargin-source: 3a683c56a8013e9e46389ff53891a794b661cd43
wchargin-branch: rust-logdir-trait
wchargin-source: bdad59cb2dee154d63518e05bfa0d252301e8c83
wchargin-branch: rust-logdir-trait
wchargin-source: 15bddd7be7ef775d415d2fecf835814bc35e27c8
wchargin-branch: rust-logdir-trait
wchargin-source: 15bddd7be7ef775d415d2fecf835814bc35e27c8
Summary:
The [`reqwest`] crate provides a high-level HTTP interface, similar in
spirit to Python’s `requests` package. It has built-in support for JSON
serialization and deserialization via `serde` (which we already use) and
is based on the `hyper` stack (which we also already use), so it should
fit in nicely. We’ll use it to make requests to GCS.

[`reqwest`]: https://crates.io/crates/reqwest

Test Plan:
It builds: `bazel build //third_party/rust:reqwest`.

wchargin-branch: rust-dep-reqwest
wchargin-source: d05d3de3a3d44974e282574ec965ad6bd0024246
Summary:
This patch implements the extent of the Google Cloud Storage protocol
that TensorBoard needs: list objects in a bucket with a given prefix,
and read partial contents of an object. It turns out to be really easy.

For comparison, [TensorFlow also rolls its own GCS client][tf]. Theirs
is more complex because it needs to handle writable files and support
general-purpose caching patterns. By contrast, we have a simple one-pass
read pattern and already assume that files are append-only, so we avoid
both the complexity and pathological interactions like #1225.

For now, this only serves public buckets and objects. Authentication is
also easy (and doesn’t require crypto or anything complicated), but, for
ease of review, we defer it to a future patch.

[tf]: https://github.com/tensorflow/tensorflow/tree/r2.4/tensorflow/core/platform/cloud

Test Plan:
Included a simple client that supports `gsutil ls` and `gsutil cat`. Run
with `RUST_LOG=debug cargo run --release --bin gsutil` and more args:

  - `ls tensorboard-bench-logs` to list all 33K objects in the bucket,
    across 34 pages of list operations (3.3s on my machine);
  - `ls tensorboard-bench-logs --prefix mnist/` to list just a single
    logdir, which should be much faster (0.1 seconds on my machine,
    which includes setting up the keep-alive connection);
  - `cat tensorboard-bench-logs mnist/README --to=11` to print the first
    12 bytes (`Range: bytes=0-11` inclusive) of an object;
  - `cat tensorboard-bench-logs mnist/README --from=9999` to print
    nothing, since the object is shorter than 9999 bytes.

wchargin-branch: rust-gcs-client
wchargin-source: d9e404df57ecf5ee80089b810835a241084ffbc8
wchargin-branch: rust-gcs-client
wchargin-source: 48943e73d17f4dadaeb7aa83b6eeaa3ec8f78707
Summary:
You can now run `tensorboard --load_fast --logdir gs://...` to launch a
Rust-backed TensorBoard reading from GCS. This patch introduces two new
`Logdir` implementations: one that connects to the GCS client, and one
that delegates to either a `DiskLogdir` or a `gcs::Logdir` depending on
the form of the `--logdir` flag.

Test Plan:
Run:

```
bazel run //tensorboard -- --logdir gs://tensorboard-bench-logs/edge_cgan --load_fast
```

In practice, this makes `tensorboard(1)` fairly slow, because the
DebuggerV2 plugin aggressively scans the GCS logdir. Thus, it might take
a while even for the server link to appear. You can comment out the
debugger plugin in `default.py` to get an idea of the underlying speed.

wchargin-branch: rust-gcs-logdir
wchargin-source: 85d75544b0b2391dbc735270940c61ce783b7748
Summary:
Our GCS integration can now read standard Google credentials, read from
either the `GOOGLE_APPLICATION_CREDENTIALS` environment variable or the
well-known path under `~/.config/gcloud`. Refresh tokens are exchanged
for access tokens, which are cached until they’re about to expire.
Outgoing requests use a cached access token or fetch a new one if
necessary.

Test Plan:
Run TensorBoard with `--load_fast` and a `--logdir` pointing to a bucket
with restricted access. Note (with `RUST_LOG=rustboard=debug`) that
access tokens are only fetched once and are then reused.

wchargin-branch: rust-gcs-auth
wchargin-source: 292a7fe347c136284d13d24e3ae89aa1e8760f50
@wchargin wchargin added type:feature core:rustboard //tensorboard/data/server/... labels Feb 3, 2021
@google-cla google-cla bot added the cla: yes label Feb 3, 2021
wchargin-branch: rust-gcs-logdir
wchargin-source: 68e19b39d5704335d5c05074da5f0c44e069c431

# Conflicts:
#	tensorboard/data/server/BUILD
#	tensorboard/data/server/cli.rs
#	tensorboard/data/server/gcs.rs
#	tensorboard/data/server/gcs/client.rs
wchargin-branch: rust-gcs-logdir
wchargin-source: 68e19b39d5704335d5c05074da5f0c44e069c431
wchargin-branch: rust-gcs-auth
wchargin-source: 56d6f3c2c148e2104a12565b325ec7ec215cfa1e
@wchargin wchargin requested a review from psybuzz February 4, 2021 02:27
@wchargin
Copy link
Contributor Author

wchargin commented Feb 4, 2021

Run TensorBoard with --load_fast and a --logdir pointing to a bucket
with restricted access.

Ping me internally if you want a pointer. (edit: http://b/179067453#comment2)

Copy link
Contributor

@psybuzz psybuzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, everything makes sense. It was fun to read about things I wasn't familiar with: XDG_CONFIG_HOME, RwLocks in action, etc!

When I try to run
bazel run tensorboard:dev --define=link_data_server=true -- --load_fast --logdir gs://tensorboard-bench-logs/edge_cgan --bind_all

I get:

thread 'main' panicked at 'Cannot drop a runtime in a context where blocking is not allowed. This happens when a runtime is dropped from within an asynchronous context.', external/raze__tokio__1_0_2/src/runtime/blocking/shutdown.rs:51:21

Presumably, the blocking is because of the Reqwest HttpClient, but I'm not sure why Rust thinks blocking is not allowed here. Have you seen this before?

/// [`Self::fetch`] will always return `None`.
///
/// This exists as an optimization so that a [`TokenStore`] doesn't need to check locks all the
/// time when the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfinished comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes; thanks.


/// Checks whether `token` represents a token that will still be valid for at least the given
/// `lifetime`, and if so returns a reference to the inner access token.
fn check_expiry(token: &Option<BoundedToken>, lifetime: Duration) -> Option<&AccessToken> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume this means the token must be valid for at least lifetime from now, in order to be usable. In the TF side [1], it seems like they have a much longer 60sec margin, compared to the 10sec in this PR (client.rs L76). Do we believe that the TF lifetime is too conservative?

https://github.com/tensorflow/tensorflow/blob/r2.4/tensorflow/core/platform/cloud/google_auth_provider.cc#L63

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn’t bothered to check their padding value. Happy to use 60s. Thanks!

);
Credentials::RefreshToken(RefreshToken(refresh_token))
}
Err(e) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If serde fails to deserialize the contents of the credentials file, it goes into this error case, which is good. However, does the match still fall into this case when serde properly deserializes the contents into a proper struct, which does not conform to the RefreshTokenCreds struct? (e.g. if it doesn't have a client_id field)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Indeed: that becomes a decode error, which we
propagate up (as an io::ErrorKind::InvalidData error, through the
mapping in gcs::logdir::reqwest_to_io_error from #4646).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, thanks

f.debug_struct("RefreshTokenCreds")
.field("client_id", &self.client_id)
.field("client_secret", &self.client_secret)
.field("refresh_token", &format_args!("<redacted>"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking: if the <redacted> here is to cover the case when we accidentally save/log our refresh tokens somewhere, should we redact the client_secret too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the client “secret” is actually not secret:

# The client "secret" is public by design for installed apps. See
# https://developers.google.com/identity/protocols/OAuth2?csw=1#installed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…but, you know, on second thought, it costs nothing to hide the client
“secret”, so I might as well, just in case my understanding is missing
some pieces. Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - it's not secret for well-known installed apps like gcloud or our uploader, since it's not possible to hide it, but I think it could be a real secret for a generic GOOGLE_APPLICATION_CREDENTIALS path.

return rb;
}
let token = self.token.read().expect("failed to read auth token");
if let Some(t) = Self::check_expiry(&*token, lifetime) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud:
At first glance, I was confused by what '&' meant, but I think it makes sense now. The type of token is a RwLockReadGuard<Option<BoundedToken>>, and what we really want to check is the Option<BoundedToken>. Doing the deref '' removes the guard, and the next '&' gets a reference to the Option, since we don't want to do a value copy.

If we simply did a &token, the code still compiles, but then we're just getting a reference to the guard, which is not the same as the reference to what it is guarding (the Option).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Exactly right; good tracing.

wchargin-branch: rust-gcs-logdir
wchargin-source: f8b766436143e6faa603ca551c0ceb8813ea829c
wchargin-branch: rust-gcs-logdir
wchargin-source: f8b766436143e6faa603ca551c0ceb8813ea829c
Copy link
Contributor Author

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

f.debug_struct("RefreshTokenCreds")
.field("client_id", &self.client_id)
.field("client_secret", &self.client_secret)
.field("refresh_token", &format_args!("<redacted>"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the client “secret” is actually not secret:

# The client "secret" is public by design for installed apps. See
# https://developers.google.com/identity/protocols/OAuth2?csw=1#installed

return rb;
}
let token = self.token.read().expect("failed to read auth token");
if let Some(t) = Self::check_expiry(&*token, lifetime) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Exactly right; good tracing.


/// Checks whether `token` represents a token that will still be valid for at least the given
/// `lifetime`, and if so returns a reference to the inner access token.
fn check_expiry(token: &Option<BoundedToken>, lifetime: Duration) -> Option<&AccessToken> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn’t bothered to check their padding value. Happy to use 60s. Thanks!

/// [`Self::fetch`] will always return `None`.
///
/// This exists as an optimization so that a [`TokenStore`] doesn't need to check locks all the
/// time when the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes; thanks.

);
Credentials::RefreshToken(RefreshToken(refresh_token))
}
Err(e) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Indeed: that becomes a decode error, which we
propagate up (as an io::ErrorKind::InvalidData error, through the
mapping in gcs::logdir::reqwest_to_io_error from #4646).

wchargin-branch: rust-gcs-auth
wchargin-source: 9c1d46133d64d3dd5f22d7d5e801868ca54a6a60

# Conflicts:
#	tensorboard/data/server/cli/dynamic_logdir.rs
wchargin-branch: rust-gcs-auth
wchargin-source: 9c1d46133d64d3dd5f22d7d5e801868ca54a6a60
@wchargin
Copy link
Contributor Author

wchargin commented Feb 5, 2021

When I try to run
bazel run tensorboard:dev --define=link_data_server=true -- --load_fast --logdir gs://tensorboard-bench-logs/edge_cgan --bind_all

I get:

thread 'main' panicked at 'Cannot drop a runtime in a context where blocking is not allowed. This happens when a runtime is dropped from within an asynchronous context.', external/raze__tokio__1_0_2/src/runtime/blocking/shutdown.rs:51:21

Presumably, the blocking is because of the Reqwest HttpClient, but I'm not sure why Rust thinks blocking is not allowed here. Have you seen this before?

Thanks much for catching this! Looks like it only happens in debug mode,
which is why I didn’t catch it. Fixed in #4646.

);
Credentials::RefreshToken(RefreshToken(refresh_token))
}
Err(e) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, thanks

wchargin-branch: rust-gcs-auth
wchargin-source: 42ce43c59cfebd640e4bdd2920c6e31134c30305
wchargin-branch: rust-gcs-auth
wchargin-source: 0545ce350bb5b64a516aa559fc7c043c8121d840
wchargin-branch: rust-gcs-logdir
wchargin-source: 3a14dc3d2fdf168b291cea1d58a97f93b6190d87
wchargin-branch: rust-gcs-auth
wchargin-source: 74a2cbce08f5c68bf36b715cc952e9337841a79d
Base automatically changed from wchargin-rust-gcs-logdir to master February 9, 2021 22:09
wchargin-branch: rust-gcs-auth
wchargin-source: 7016f8659ccbf02456bf0d01d3dfe8b294c61d6c

# Conflicts:
#	tensorboard/data/server/cli.rs
#	tensorboard/data/server/cli/dynamic_logdir.rs
#	tensorboard/data/server/gcs/client.rs
#	tensorboard/data/server/gcs/logdir.rs
wchargin-branch: rust-gcs-auth
wchargin-source: 7016f8659ccbf02456bf0d01d3dfe8b294c61d6c
f.debug_struct("RefreshTokenCreds")
.field("client_id", &self.client_id)
.field("client_secret", &self.client_secret)
.field("refresh_token", &format_args!("<redacted>"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - it's not secret for well-known installed apps like gcloud or our uploader, since it's not possible to hide it, but I think it could be a real secret for a generic GOOGLE_APPLICATION_CREDENTIALS path.

}

/// Checks whether `token` represents a token that will still be valid for at least the given
/// `lifetime`, and if so returns a reference to the inner access token.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly non-obvious IMO from the name that check_expiry also unwraps the token. Optional but maybe consider making it a method on BoundedToken (could replace valid_at) like unwrap_if_valid_for()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sure: changed as suggested.

wchargin-branch: rust-gcs-auth
wchargin-source: 0d3abc195a783abb8e87178ddbcdb3b91a0d2b33
@wchargin wchargin merged commit 1782ca1 into master Feb 10, 2021
@wchargin wchargin deleted the wchargin-rust-gcs-auth branch February 10, 2021 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes core:rustboard //tensorboard/data/server/... type:feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants