rust: add GCS logdir support #4646

wchargin · 2021-02-03T21:04:58Z

Summary:
You can now run tensorboard --load_fast --logdir gs://... to launch a
Rust-backed TensorBoard reading from GCS. This patch introduces two new
Logdir implementations: one that connects to the GCS client, and one
that delegates to either a DiskLogdir or a gcs::Logdir depending on
the form of the --logdir flag.

Test Plan:
Both //tensorboard and the

bazel run //tensorboard -- \
    --logdir gs://tensorboard-bench-logs/edge_cgan --load_fast

bazel run //tensorboard/data/server:bench -- \
    --logdir gs://tensorboard-bench-logs/edge_cgan

wchargin-branch: rust-gcs-logdir

Summary: Previously, `LogdirLoader` and `RunLoader` were both hard-tied to the native OS filesystem, via `walkdir::WalkDir` and `std::fs::File`, respectively. This patch introduces a `Logdir` trait that abstracts over listing and reading event files. A `DiskLogdir` implementation recovers the current behavior (mostly; see below), and we can slot in adapters for more filesystems as desired. At this point, it’s convenient to drop the hyper-precise semantics around non-UTF-8 run name collisions, which only occur if you have two runs whose names are invalid Unicode and are equal after lossy conversion. We originally handled this precisely because Rust made it easy to do so, but it’s never come up as an issue in the real world. It’s no longer quite so convenient to handle, so we cull the complexity. Test Plan: Existing unit tests suffice, and a `--load_fast` end-to-end test still checks out. wchargin-branch: rust-logdir-trait wchargin-source: 7535f5d9c7997b111e80edee5be9f1726f8fbe6b

wchargin-branch: rust-logdir-trait wchargin-source: ee5495623979251b83234ee51f6c92ae46a3643d

wchargin-branch: rust-logdir-trait wchargin-source: a97ab9996a1a87b98951441ab44c1461879f9a2c # Conflicts: # tensorboard/data/server/bench.rs # tensorboard/data/server/cli.rs # tensorboard/data/server/logdir.rs

wchargin-branch: rust-logdir-trait wchargin-source: a97ab9996a1a87b98951441ab44c1461879f9a2c

wchargin-branch: rust-logdir-trait wchargin-source: 3a683c56a8013e9e46389ff53891a794b661cd43

wchargin-branch: rust-logdir-trait wchargin-source: bdad59cb2dee154d63518e05bfa0d252301e8c83

wchargin-branch: rust-logdir-trait wchargin-source: 15bddd7be7ef775d415d2fecf835814bc35e27c8

Summary: The [`reqwest`] crate provides a high-level HTTP interface, similar in spirit to Python’s `requests` package. It has built-in support for JSON serialization and deserialization via `serde` (which we already use) and is based on the `hyper` stack (which we also already use), so it should fit in nicely. We’ll use it to make requests to GCS. [`reqwest`]: https://crates.io/crates/reqwest Test Plan: It builds: `bazel build //third_party/rust:reqwest`. wchargin-branch: rust-dep-reqwest wchargin-source: d05d3de3a3d44974e282574ec965ad6bd0024246

Summary: This patch implements the extent of the Google Cloud Storage protocol that TensorBoard needs: list objects in a bucket with a given prefix, and read partial contents of an object. It turns out to be really easy. For comparison, [TensorFlow also rolls its own GCS client][tf]. Theirs is more complex because it needs to handle writable files and support general-purpose caching patterns. By contrast, we have a simple one-pass read pattern and already assume that files are append-only, so we avoid both the complexity and pathological interactions like #1225. For now, this only serves public buckets and objects. Authentication is also easy (and doesn’t require crypto or anything complicated), but, for ease of review, we defer it to a future patch. [tf]: https://github.com/tensorflow/tensorflow/tree/r2.4/tensorflow/core/platform/cloud Test Plan: Included a simple client that supports `gsutil ls` and `gsutil cat`. Run with `RUST_LOG=debug cargo run --release --bin gsutil` and more args: - `ls tensorboard-bench-logs` to list all 33K objects in the bucket, across 34 pages of list operations (3.3s on my machine); - `ls tensorboard-bench-logs --prefix mnist/` to list just a single logdir, which should be much faster (0.1 seconds on my machine, which includes setting up the keep-alive connection); - `cat tensorboard-bench-logs mnist/README --to=11` to print the first 12 bytes (`Range: bytes=0-11` inclusive) of an object; - `cat tensorboard-bench-logs mnist/README --from=9999` to print nothing, since the object is shorter than 9999 bytes. wchargin-branch: rust-gcs-client wchargin-source: d9e404df57ecf5ee80089b810835a241084ffbc8

wchargin-branch: rust-gcs-client wchargin-source: 48943e73d17f4dadaeb7aa83b6eeaa3ec8f78707

… HEAD

Summary: You can now run `tensorboard --load_fast --logdir gs://...` to launch a Rust-backed TensorBoard reading from GCS. This patch introduces two new `Logdir` implementations: one that connects to the GCS client, and one that delegates to either a `DiskLogdir` or a `gcs::Logdir` depending on the form of the `--logdir` flag. Test Plan: Run: ``` bazel run //tensorboard -- --logdir gs://tensorboard-bench-logs/edge_cgan --load_fast ``` In practice, this makes `tensorboard(1)` fairly slow, because the DebuggerV2 plugin aggressively scans the GCS logdir. Thus, it might take a while even for the server link to appear. You can comment out the debugger plugin in `default.py` to get an idea of the underlying speed. wchargin-branch: rust-gcs-logdir wchargin-source: 85d75544b0b2391dbc735270940c61ce783b7748

wchargin-branch: rust-gcs-logdir wchargin-source: 68e19b39d5704335d5c05074da5f0c44e069c431 # Conflicts: # tensorboard/data/server/BUILD # tensorboard/data/server/cli.rs # tensorboard/data/server/gcs.rs # tensorboard/data/server/gcs/client.rs

wchargin-branch: rust-gcs-logdir wchargin-source: 68e19b39d5704335d5c05074da5f0c44e069c431

wchargin · 2021-02-04T01:24:56Z

tensorboard/data/server/cli/dynamic_logdir.rs

+/// A logdir dynamically dispatched over supported implementations.
+pub enum DynLogdir {
+    Disk(DiskLogdir),
+    Gcs(gcs::Logdir),
+}


@nfelt, re: discussion: you can see from the structure of these types
and implementations how they could be easily macroed into existence—all
except DynLogdir::new, of course. It’s basically “the same thing with
more $ signs”:
https://gist.github.com/wchargin/f44c370bebb61210dbeb11037f0b2d99

I think that I have a weak preference for the explicit implementation,
but the macro’s actually not that bad. It does make new cases a one-line
change. Open to suggestions.

Explicit seems clearer to me given especially that we only have 2 implementations now.

What I had in mind when talking about "it would be nice to have a more automatic way to do this" was really more about whether there was something that would generalize not just to additional implementations but also cover the full set of trait methods as well (i.e. the dispatching logic for each method would be autogenerated without even having to include the macro template for each method). So it'd be sort of a compromise position between enums and trait objects, where the downside is that you need both a closed set of types known to the compiler and the trait must be object safe, but the upside is that you have no vtable overhead and you don't have to hand-roll the dispatching code for each trait method.

I'm not even sure that would work for this case where we have the associated type as a complicating factor, but it seems like the sort of thing that would in principle be possible to autogenerate in the basic case.

But this is just hypothesizing; no changes required or expected.

wchargin-branch: rust-gcs-logdir wchargin-source: f8b766436143e6faa603ca551c0ceb8813ea829c

wchargin-branch: rust-gcs-logdir wchargin-source: 3a14dc3d2fdf168b291cea1d58a97f93b6190d87

wchargin · 2021-02-09T18:28:57Z

@nfelt: Friendly ping? This blocks #4647, and I would like to get these
merged with a bit of buffer to release a data server with GCS support
and teach tb-nightly to depend on it.

nfelt · 2021-02-09T18:53:55Z

tensorboard/data/server/cli/dynamic_logdir.rs

+/// A logdir dynamically dispatched over supported implementations.
+pub enum DynLogdir {
+    Disk(DiskLogdir),
+    Gcs(gcs::Logdir),
+}


Explicit seems clearer to me given especially that we only have 2 implementations now.

What I had in mind when talking about "it would be nice to have a more automatic way to do this" was really more about whether there was something that would generalize not just to additional implementations but also cover the full set of trait methods as well (i.e. the dispatching logic for each method would be autogenerated without even having to include the macro template for each method). So it'd be sort of a compromise position between enums and trait objects, where the downside is that you need both a closed set of types known to the compiler and the trait must be object safe, but the upside is that you have no vtable overhead and you don't have to hand-roll the dispatching code for each trait method.

I'm not even sure that would work for this case where we have the associated type as a complicating factor, but it seems like the sort of thing that would in principle be possible to autogenerate in the basic case.

But this is just hypothesizing; no changes required or expected.

nfelt · 2021-02-09T18:58:34Z

tensorboard/data/server/gcs/logdir.rs

+}
+
+/// Read large chunks from GCS to reduce network roundtrips.
+const BUFFER_CAPACITY: usize = 1024 * 1024 * 128;


Just wondering, did we determine that the buffer is a bottleneck below this size? It's just a significant chunk of RAM when we add it up across many files, so my default inclination would probably be to start out smaller with something like 8-16 MiB. Or maybe at least worth making it tunable with an env var or something (like the TF GCS readahead buffer was).

Nope, I haven’t analyzed this. I picked 128 MiB because we use that
number for some similar readahead buffers internally, but I’m happy to
tone it down to 16 MiB for now. Agreed that it should be tunable, and
I’m happy to run benchmark sweeps, too; just don’t want to introduce too
much right off the bat.

With a quick unscientific analysis, I can fully load edge_cgan from
GCS in about 18 seconds with a 128 MiB buffer or 20 seconds with 16 MiB.
The standard deviation here is on the order of seconds in each case, so
it seems fine to proceed conservatively.

nfelt · 2021-02-09T19:03:48Z

tensorboard/data/server/gcs/logdir.rs

+
+    fn open(&self, path: &EventFileBuf) -> io::Result<Self::File> {
+        // Paths as returned by `discover` are always valid unicode.
+        // valid Unicode.


extra line?

Fixed; thanks.

nfelt · 2021-02-09T19:21:10Z

tensorboard/data/server/gcs/logdir.rs

+            .gcs
+            .read(&self.bucket, &self.object, range)
+            .map_err(reqwest_to_io_error)?;
+        (&mut buf[0..result.len()]).copy_from_slice(&result);


I may not be reading this right but it looks like we have 2 copies here - one inside gcs::Client where we do body.to_vec() to go from Bytes to Vec<u8> and then another one here where we go from Vec<u8> to our final [u8].

I doubt it's really a big deal performance-wise but if that is indeed happening, it seems slightly nicer if we could do only one copy? E.g. if we were to have our client directly return Bytes rather than Vec<u8> it seems like we could directly call copy_from_slice(Bytes) since Bytes is Borrow<[u8]>?

Indeed! I was planning to do this as an optimization in a follow-up,
since using Bytes directly requires adding a dep on the bytes crate.
But it’s well spotted, and a perfectly appropriate use of Bytes.

I don’t see an easy way to zero-copy the bytes all the way from the HTTP
response into the io::Read output: e.g., neither reqwest nor hyper
seems to expose a response.bytes_into(&mut buf).

There is a similar large copy in the ReadBlob RPC handler, which in
previous drafts I had replaced with an Arc<[u8]>, but Bytes might be
better there, too. (After all, Bytes is “Arc<[u8]> on steroids”.)

Gotcha, fine by me to do it in a follow-up.

I may be misunderstanding but isn't it impossible to do a zero-copy transfer of HTTP response bytes into an &mut [u8] which is what io::Read takes? Since the signature requires that we stuff the bytes into a mutable buffer we don't control? Unless what you're saying is that we can actually change the underlying buffer pointed to by buf to literally just point to the one occupied by Bytes, but my understanding was that we can only mutate the contents of buf, not repoint it entirely.

I agree that I don’t see a way to do it with the reqwest/et al. APIs,
but you could imagine that it could be possible if each of the layers of
the stack supported it. Something like:

// mod reqwest::blocking::response impl Response { /// Streams some bytes from the response body directly into the /// given buffer. Returns the number of bytes written. fn bytes_into<T: BufMut>(&self, x: &mut T) -> crate::Result<usize>; } // mod gcs::client impl Client { fn read(&self, bucket, object, buf: impl BufMut) -> self::Result<usize> { let res = self.http.get(...).send()?; res.bytes_into(&mut buf) } } // mod gcs::logdir impl Read for File { fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> { let n = self.gcs.read(bucket, obj, buf)?; self.pos += n as u64; Ok(n) } }

Or, from a lower-level perspective: at the end of the day, we receive
network data by calling recvfrom(2) and passing the kernel a mutable
pointer to a user-space buffer into which to write the received data, so
there’s no fundamental reason that that pointer can’t point into the
buffer given to io::Read::read (at least, as far as I can tell).

but my understanding was that we can only mutate the contents of buf, not repoint it entirely.

Correct, and that is indeed not what I am alluding to.

wchargin-branch: rust-gcs-logdir wchargin-source: f5712f4865df18e239b78e2a5aa65f6f5a6e2982

wchargin

thanks!

wchargin · 2021-02-09T20:53:36Z

tensorboard/data/server/gcs/logdir.rs

+}
+
+/// Read large chunks from GCS to reduce network roundtrips.
+const BUFFER_CAPACITY: usize = 1024 * 1024 * 128;


Nope, I haven’t analyzed this. I picked 128 MiB because we use that
number for some similar readahead buffers internally, but I’m happy to
tone it down to 16 MiB for now. Agreed that it should be tunable, and
I’m happy to run benchmark sweeps, too; just don’t want to introduce too
much right off the bat.

With a quick unscientific analysis, I can fully load edge_cgan from
GCS in about 18 seconds with a 128 MiB buffer or 20 seconds with 16 MiB.
The standard deviation here is on the order of seconds in each case, so
it seems fine to proceed conservatively.

wchargin · 2021-02-09T20:53:58Z

tensorboard/data/server/gcs/logdir.rs

+
+    fn open(&self, path: &EventFileBuf) -> io::Result<Self::File> {
+        // Paths as returned by `discover` are always valid unicode.
+        // valid Unicode.


Fixed; thanks.

wchargin · 2021-02-09T21:06:41Z

tensorboard/data/server/gcs/logdir.rs

+            .gcs
+            .read(&self.bucket, &self.object, range)
+            .map_err(reqwest_to_io_error)?;
+        (&mut buf[0..result.len()]).copy_from_slice(&result);


Indeed! I was planning to do this as an optimization in a follow-up,
since using Bytes directly requires adding a dep on the bytes crate.
But it’s well spotted, and a perfectly appropriate use of Bytes.

I don’t see an easy way to zero-copy the bytes all the way from the HTTP
response into the io::Read output: e.g., neither reqwest nor hyper
seems to expose a response.bytes_into(&mut buf).

There is a similar large copy in the ReadBlob RPC handler, which in
previous drafts I had replaced with an Arc<[u8]>, but Bytes might be
better there, too. (After all, Bytes is “Arc<[u8]> on steroids”.)

Summary: The [`bytes`] crate provides facilitates zero-copy network programming with the `Bytes` struct, which is a kind of “`Arc<[u8]>` on steroids” that can be sliced and passed around with shared ownership. Other parts of our stack already use `Bytes` (notably, `tonic`, `reqwest`, and `prost`), so by using it ourselves we can remove some unnecessary clones, as discussed in [review on #4646]. Since this already exists as a transitive dep, this patch just exposes it for direct use by our code. [`bytes`]: https://crates.io/crates/bytes [review on #4646]: #4646 (comment) Test Plan: It builds: `bazel build //tensorboard/data/server/cargo:bytes`. wchargin-branch: rust-dep-bytes wchargin-source: 13445ec2b94489c55c33e0c3eea775d9e06aa683

wchargin-branch: rust-gcs-logdir wchargin-source: 8ceab0a2cd68987ec4d140b58980864b083e9dd2

Summary: The [`bytes`] crate provides facilitates zero-copy network programming with the `Bytes` struct, which is a kind of “`Arc<[u8]>` on steroids” that can be sliced and passed around with shared ownership. Other parts of our stack already use `Bytes` (notably, `tonic`, `reqwest`, and `prost`), so by using it ourselves we can remove some unnecessary clones, as discussed in [review on #4646]. Since this already exists as a transitive dep, this patch just exposes it for direct use by our code. [`bytes`]: https://crates.io/crates/bytes [review on #4646]: #4646 (comment) Test Plan: It builds: `bazel build //tensorboard/data/server/cargo:bytes`. wchargin-branch: rust-dep-bytes

wchargin added 13 commits February 1, 2021 10:57

[rust-logdir-trait: update patch]

3ab4fac

wchargin-branch: rust-logdir-trait wchargin-source: ee5495623979251b83234ee51f6c92ae46a3643d

[rust-logdir-trait: update diffbase]

358687e

wchargin-branch: rust-logdir-trait wchargin-source: a97ab9996a1a87b98951441ab44c1461879f9a2c # Conflicts: # tensorboard/data/server/bench.rs # tensorboard/data/server/cli.rs # tensorboard/data/server/logdir.rs

[rust-logdir-trait: update patch]

d77430a

wchargin-branch: rust-logdir-trait wchargin-source: a97ab9996a1a87b98951441ab44c1461879f9a2c

[rust-logdir-trait: update patch]

cf6869c

wchargin-branch: rust-logdir-trait wchargin-source: 3a683c56a8013e9e46389ff53891a794b661cd43

[rust-logdir-trait: update patch]

971c2c2

wchargin-branch: rust-logdir-trait wchargin-source: bdad59cb2dee154d63518e05bfa0d252301e8c83

[rust-logdir-trait: update diffbase]

428aaab

wchargin-branch: rust-logdir-trait wchargin-source: 15bddd7be7ef775d415d2fecf835814bc35e27c8

[rust-logdir-trait: remove unnecessary move]

b95ea29

wchargin-branch: rust-logdir-trait wchargin-source: 15bddd7be7ef775d415d2fecf835814bc35e27c8

[rust-gcs-client: handle no-items case]

61240d1

wchargin-branch: rust-gcs-client wchargin-source: 48943e73d17f4dadaeb7aa83b6eeaa3ec8f78707

Merge remote-tracking branch 'origin/wchargin-rust-logdir-trait' into…

281ba86

… HEAD

wchargin added type:feature core:rustboard //tensorboard/data/server/... labels Feb 3, 2021

google-cla bot added the cla: yes label Feb 3, 2021

wchargin mentioned this pull request Feb 4, 2021

rust: extract Logdir trait for other filesystems #4636

Merged

wchargin changed the base branch from wchargin-diffbase-gcs to wchargin-rust-gcs-client February 4, 2021 00:55

Base automatically changed from wchargin-rust-gcs-client to master February 4, 2021 01:32

wchargin added 2 commits February 3, 2021 17:34

[rust-gcs-logdir: update diffbase]

c32e282

wchargin-branch: rust-gcs-logdir wchargin-source: 68e19b39d5704335d5c05074da5f0c44e069c431 # Conflicts: # tensorboard/data/server/BUILD # tensorboard/data/server/cli.rs # tensorboard/data/server/gcs.rs # tensorboard/data/server/gcs/client.rs

[rust-gcs-logdir: resolve conflicts]

2dcce13

wchargin-branch: rust-gcs-logdir wchargin-source: 68e19b39d5704335d5c05074da5f0c44e069c431

wchargin requested a review from nfelt February 4, 2021 01:35

wchargin commented Feb 4, 2021

View reviewed changes

wchargin added 2 commits February 4, 2021 17:25

[rust-gcs-logdir: update diffbase]

9f812db

wchargin-branch: rust-gcs-logdir wchargin-source: f8b766436143e6faa603ca551c0ceb8813ea829c

[rust-gcs-logdir: work around seanmonstar/reqwest#1017 in debug mode]

9e04486

wchargin-branch: rust-gcs-logdir wchargin-source: f8b766436143e6faa603ca551c0ceb8813ea829c

wchargin mentioned this pull request Feb 5, 2021

rust: add refresh token authentication #4647

Merged

[rust-gcs-logdir: update diffbase]

9d8b1c6

wchargin-branch: rust-gcs-logdir wchargin-source: 3a14dc3d2fdf168b291cea1d58a97f93b6190d87

nfelt approved these changes Feb 9, 2021

View reviewed changes

[rust-gcs-logdir: update diffbase]

85b3741

wchargin-branch: rust-gcs-logdir wchargin-source: f5712f4865df18e239b78e2a5aa65f6f5a6e2982

[rust-gcs-logdir: address review comments]

6268d67

wchargin-branch: rust-gcs-logdir wchargin-source: f5712f4865df18e239b78e2a5aa65f6f5a6e2982

wchargin commented Feb 9, 2021

View reviewed changes

[rust-gcs-logdir: wire benchmark up to GCS logdir]

b2fdbe4

wchargin-branch: rust-gcs-logdir wchargin-source: 8ceab0a2cd68987ec4d140b58980864b083e9dd2

wchargin mentioned this pull request Feb 9, 2021

rust: add bytes dependency #4671

Merged

wchargin merged commit d93cf13 into master Feb 9, 2021

wchargin deleted the wchargin-rust-gcs-logdir branch February 9, 2021 22:09

yatbear mentioned this pull request Mar 20, 2023

Loading event files from gs:// consumes too much RAM #6248

Closed

rust: add GCS logdir support #4646

rust: add GCS logdir support #4646

Uh oh!

Conversation

wchargin commented Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin commented Feb 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wchargin commented Feb 3, 2021 •

edited

Loading

wchargin Feb 9, 2021 •

edited

Loading