Skip to content

Conversation

@wchargin
Copy link
Contributor

@wchargin wchargin commented Feb 3, 2021

Summary:
You can now run tensorboard --load_fast --logdir gs://... to launch a
Rust-backed TensorBoard reading from GCS. This patch introduces two new
Logdir implementations: one that connects to the GCS client, and one
that delegates to either a DiskLogdir or a gcs::Logdir depending on
the form of the --logdir flag.

Test Plan:
Both //tensorboard and the

bazel run //tensorboard -- \
    --logdir gs://tensorboard-bench-logs/edge_cgan --load_fast
bazel run //tensorboard/data/server:bench -- \
    --logdir gs://tensorboard-bench-logs/edge_cgan

wchargin-branch: rust-gcs-logdir

Summary:
Previously, `LogdirLoader` and `RunLoader` were both hard-tied to the
native OS filesystem, via `walkdir::WalkDir` and `std::fs::File`,
respectively. This patch introduces a `Logdir` trait that abstracts over
listing and reading event files. A `DiskLogdir` implementation recovers
the current behavior (mostly; see below), and we can slot in adapters
for more filesystems as desired.

At this point, it’s convenient to drop the hyper-precise semantics
around non-UTF-8 run name collisions, which only occur if you have two
runs whose names are invalid Unicode and are equal after lossy
conversion. We originally handled this precisely because Rust made it
easy to do so, but it’s never come up as an issue in the real world.
It’s no longer quite so convenient to handle, so we cull the complexity.

Test Plan:
Existing unit tests suffice, and a `--load_fast` end-to-end test still
checks out.

wchargin-branch: rust-logdir-trait
wchargin-source: 7535f5d9c7997b111e80edee5be9f1726f8fbe6b
wchargin-branch: rust-logdir-trait
wchargin-source: ee5495623979251b83234ee51f6c92ae46a3643d
wchargin-branch: rust-logdir-trait
wchargin-source: a97ab9996a1a87b98951441ab44c1461879f9a2c

# Conflicts:
#	tensorboard/data/server/bench.rs
#	tensorboard/data/server/cli.rs
#	tensorboard/data/server/logdir.rs
wchargin-branch: rust-logdir-trait
wchargin-source: a97ab9996a1a87b98951441ab44c1461879f9a2c
wchargin-branch: rust-logdir-trait
wchargin-source: 3a683c56a8013e9e46389ff53891a794b661cd43
wchargin-branch: rust-logdir-trait
wchargin-source: bdad59cb2dee154d63518e05bfa0d252301e8c83
wchargin-branch: rust-logdir-trait
wchargin-source: 15bddd7be7ef775d415d2fecf835814bc35e27c8
wchargin-branch: rust-logdir-trait
wchargin-source: 15bddd7be7ef775d415d2fecf835814bc35e27c8
Summary:
The [`reqwest`] crate provides a high-level HTTP interface, similar in
spirit to Python’s `requests` package. It has built-in support for JSON
serialization and deserialization via `serde` (which we already use) and
is based on the `hyper` stack (which we also already use), so it should
fit in nicely. We’ll use it to make requests to GCS.

[`reqwest`]: https://crates.io/crates/reqwest

Test Plan:
It builds: `bazel build //third_party/rust:reqwest`.

wchargin-branch: rust-dep-reqwest
wchargin-source: d05d3de3a3d44974e282574ec965ad6bd0024246
Summary:
This patch implements the extent of the Google Cloud Storage protocol
that TensorBoard needs: list objects in a bucket with a given prefix,
and read partial contents of an object. It turns out to be really easy.

For comparison, [TensorFlow also rolls its own GCS client][tf]. Theirs
is more complex because it needs to handle writable files and support
general-purpose caching patterns. By contrast, we have a simple one-pass
read pattern and already assume that files are append-only, so we avoid
both the complexity and pathological interactions like #1225.

For now, this only serves public buckets and objects. Authentication is
also easy (and doesn’t require crypto or anything complicated), but, for
ease of review, we defer it to a future patch.

[tf]: https://github.com/tensorflow/tensorflow/tree/r2.4/tensorflow/core/platform/cloud

Test Plan:
Included a simple client that supports `gsutil ls` and `gsutil cat`. Run
with `RUST_LOG=debug cargo run --release --bin gsutil` and more args:

  - `ls tensorboard-bench-logs` to list all 33K objects in the bucket,
    across 34 pages of list operations (3.3s on my machine);
  - `ls tensorboard-bench-logs --prefix mnist/` to list just a single
    logdir, which should be much faster (0.1 seconds on my machine,
    which includes setting up the keep-alive connection);
  - `cat tensorboard-bench-logs mnist/README --to=11` to print the first
    12 bytes (`Range: bytes=0-11` inclusive) of an object;
  - `cat tensorboard-bench-logs mnist/README --from=9999` to print
    nothing, since the object is shorter than 9999 bytes.

wchargin-branch: rust-gcs-client
wchargin-source: d9e404df57ecf5ee80089b810835a241084ffbc8
wchargin-branch: rust-gcs-client
wchargin-source: 48943e73d17f4dadaeb7aa83b6eeaa3ec8f78707
Summary:
You can now run `tensorboard --load_fast --logdir gs://...` to launch a
Rust-backed TensorBoard reading from GCS. This patch introduces two new
`Logdir` implementations: one that connects to the GCS client, and one
that delegates to either a `DiskLogdir` or a `gcs::Logdir` depending on
the form of the `--logdir` flag.

Test Plan:
Run:

```
bazel run //tensorboard -- --logdir gs://tensorboard-bench-logs/edge_cgan --load_fast
```

In practice, this makes `tensorboard(1)` fairly slow, because the
DebuggerV2 plugin aggressively scans the GCS logdir. Thus, it might take
a while even for the server link to appear. You can comment out the
debugger plugin in `default.py` to get an idea of the underlying speed.

wchargin-branch: rust-gcs-logdir
wchargin-source: 85d75544b0b2391dbc735270940c61ce783b7748
@wchargin wchargin added type:feature core:rustboard //tensorboard/data/server/... labels Feb 3, 2021
@google-cla google-cla bot added the cla: yes label Feb 3, 2021
@wchargin wchargin changed the base branch from wchargin-diffbase-gcs to wchargin-rust-gcs-client February 4, 2021 00:55
Base automatically changed from wchargin-rust-gcs-client to master February 4, 2021 01:32
wchargin-branch: rust-gcs-logdir
wchargin-source: 68e19b39d5704335d5c05074da5f0c44e069c431

# Conflicts:
#	tensorboard/data/server/BUILD
#	tensorboard/data/server/cli.rs
#	tensorboard/data/server/gcs.rs
#	tensorboard/data/server/gcs/client.rs
wchargin-branch: rust-gcs-logdir
wchargin-source: 68e19b39d5704335d5c05074da5f0c44e069c431
@wchargin wchargin requested a review from nfelt February 4, 2021 01:35
Comment on lines +28 to +32
/// A logdir dynamically dispatched over supported implementations.
pub enum DynLogdir {
Disk(DiskLogdir),
Gcs(gcs::Logdir),
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfelt, re: discussion: you can see from the structure of these types
and implementations how they could be easily macroed into existence—all
except DynLogdir::new, of course. It’s basically “the same thing with
more $ signs”:
https://gist.github.com/wchargin/f44c370bebb61210dbeb11037f0b2d99

I think that I have a weak preference for the explicit implementation,
but the macro’s actually not that bad. It does make new cases a one-line
change. Open to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicit seems clearer to me given especially that we only have 2 implementations now.

What I had in mind when talking about "it would be nice to have a more automatic way to do this" was really more about whether there was something that would generalize not just to additional implementations but also cover the full set of trait methods as well (i.e. the dispatching logic for each method would be autogenerated without even having to include the macro template for each method). So it'd be sort of a compromise position between enums and trait objects, where the downside is that you need both a closed set of types known to the compiler and the trait must be object safe, but the upside is that you have no vtable overhead and you don't have to hand-roll the dispatching code for each trait method.

I'm not even sure that would work for this case where we have the associated type as a complicating factor, but it seems like the sort of thing that would in principle be possible to autogenerate in the basic case.

But this is just hypothesizing; no changes required or expected.

wchargin-branch: rust-gcs-logdir
wchargin-source: f8b766436143e6faa603ca551c0ceb8813ea829c
wchargin-branch: rust-gcs-logdir
wchargin-source: f8b766436143e6faa603ca551c0ceb8813ea829c
wchargin-branch: rust-gcs-logdir
wchargin-source: 3a14dc3d2fdf168b291cea1d58a97f93b6190d87
@wchargin
Copy link
Contributor Author

wchargin commented Feb 9, 2021

@nfelt: Friendly ping? This blocks #4647, and I would like to get these
merged with a bit of buffer to release a data server with GCS support
and teach tb-nightly to depend on it.

Comment on lines +28 to +32
/// A logdir dynamically dispatched over supported implementations.
pub enum DynLogdir {
Disk(DiskLogdir),
Gcs(gcs::Logdir),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicit seems clearer to me given especially that we only have 2 implementations now.

What I had in mind when talking about "it would be nice to have a more automatic way to do this" was really more about whether there was something that would generalize not just to additional implementations but also cover the full set of trait methods as well (i.e. the dispatching logic for each method would be autogenerated without even having to include the macro template for each method). So it'd be sort of a compromise position between enums and trait objects, where the downside is that you need both a closed set of types known to the compiler and the trait must be object safe, but the upside is that you have no vtable overhead and you don't have to hand-roll the dispatching code for each trait method.

I'm not even sure that would work for this case where we have the associated type as a complicating factor, but it seems like the sort of thing that would in principle be possible to autogenerate in the basic case.

But this is just hypothesizing; no changes required or expected.

}

/// Read large chunks from GCS to reduce network roundtrips.
const BUFFER_CAPACITY: usize = 1024 * 1024 * 128;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, did we determine that the buffer is a bottleneck below this size? It's just a significant chunk of RAM when we add it up across many files, so my default inclination would probably be to start out smaller with something like 8-16 MiB. Or maybe at least worth making it tunable with an env var or something (like the TF GCS readahead buffer was).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I haven’t analyzed this. I picked 128 MiB because we use that
number for some similar readahead buffers internally, but I’m happy to
tone it down to 16 MiB for now. Agreed that it should be tunable, and
I’m happy to run benchmark sweeps, too; just don’t want to introduce too
much right off the bat.

With a quick unscientific analysis, I can fully load edge_cgan from
GCS in about 18 seconds with a 128 MiB buffer or 20 seconds with 16 MiB.
The standard deviation here is on the order of seconds in each case, so
it seems fine to proceed conservatively.


fn open(&self, path: &EventFileBuf) -> io::Result<Self::File> {
// Paths as returned by `discover` are always valid unicode.
// valid Unicode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed; thanks.

.gcs
.read(&self.bucket, &self.object, range)
.map_err(reqwest_to_io_error)?;
(&mut buf[0..result.len()]).copy_from_slice(&result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not be reading this right but it looks like we have 2 copies here - one inside gcs::Client where we do body.to_vec() to go from Bytes to Vec<u8> and then another one here where we go from Vec<u8> to our final [u8].

I doubt it's really a big deal performance-wise but if that is indeed happening, it seems slightly nicer if we could do only one copy? E.g. if we were to have our client directly return Bytes rather than Vec<u8> it seems like we could directly call copy_from_slice(Bytes) since Bytes is Borrow<[u8]>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! I was planning to do this as an optimization in a follow-up,
since using Bytes directly requires adding a dep on the bytes crate.
But it’s well spotted, and a perfectly appropriate use of Bytes.

I don’t see an easy way to zero-copy the bytes all the way from the HTTP
response into the io::Read output: e.g., neither reqwest nor hyper
seems to expose a response.bytes_into(&mut buf).

There is a similar large copy in the ReadBlob RPC handler, which in
previous drafts I had replaced with an Arc<[u8]>, but Bytes might be
better there, too. (After all, Bytes is “Arc<[u8]> on steroids”.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, fine by me to do it in a follow-up.

I may be misunderstanding but isn't it impossible to do a zero-copy transfer of HTTP response bytes into an &mut [u8] which is what io::Read takes? Since the signature requires that we stuff the bytes into a mutable buffer we don't control? Unless what you're saying is that we can actually change the underlying buffer pointed to by buf to literally just point to the one occupied by Bytes, but my understanding was that we can only mutate the contents of buf, not repoint it entirely.

Copy link
Contributor Author

@wchargin wchargin Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that I don’t see a way to do it with the reqwest/et al. APIs,
but you could imagine that it could be possible if each of the layers of
the stack supported it. Something like:

// mod reqwest::blocking::response
impl Response {
    /// Streams some bytes from the response body directly into the
    /// given buffer. Returns the number of bytes written.
    fn bytes_into<T: BufMut>(&self, x: &mut T) -> crate::Result<usize>;
}

// mod gcs::client
impl Client {
    fn read(&self, bucket, object, buf: impl BufMut) -> self::Result<usize> {
        let res = self.http.get(...).send()?;
        res.bytes_into(&mut buf)
    }
}

// mod gcs::logdir
impl Read for File {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        let n = self.gcs.read(bucket, obj, buf)?;
        self.pos += n as u64;
        Ok(n)
    }
}

Or, from a lower-level perspective: at the end of the day, we receive
network data by calling recvfrom(2) and passing the kernel a mutable
pointer to a user-space buffer into which to write the received data, so
there’s no fundamental reason that that pointer can’t point into the
buffer given to io::Read::read (at least, as far as I can tell).

but my understanding was that we can only mutate the contents of buf, not repoint it entirely.

Correct, and that is indeed not what I am alluding to.

wchargin-branch: rust-gcs-logdir
wchargin-source: f5712f4865df18e239b78e2a5aa65f6f5a6e2982
wchargin-branch: rust-gcs-logdir
wchargin-source: f5712f4865df18e239b78e2a5aa65f6f5a6e2982
Copy link
Contributor Author

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

}

/// Read large chunks from GCS to reduce network roundtrips.
const BUFFER_CAPACITY: usize = 1024 * 1024 * 128;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I haven’t analyzed this. I picked 128 MiB because we use that
number for some similar readahead buffers internally, but I’m happy to
tone it down to 16 MiB for now. Agreed that it should be tunable, and
I’m happy to run benchmark sweeps, too; just don’t want to introduce too
much right off the bat.

With a quick unscientific analysis, I can fully load edge_cgan from
GCS in about 18 seconds with a 128 MiB buffer or 20 seconds with 16 MiB.
The standard deviation here is on the order of seconds in each case, so
it seems fine to proceed conservatively.


fn open(&self, path: &EventFileBuf) -> io::Result<Self::File> {
// Paths as returned by `discover` are always valid unicode.
// valid Unicode.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed; thanks.

.gcs
.read(&self.bucket, &self.object, range)
.map_err(reqwest_to_io_error)?;
(&mut buf[0..result.len()]).copy_from_slice(&result);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! I was planning to do this as an optimization in a follow-up,
since using Bytes directly requires adding a dep on the bytes crate.
But it’s well spotted, and a perfectly appropriate use of Bytes.

I don’t see an easy way to zero-copy the bytes all the way from the HTTP
response into the io::Read output: e.g., neither reqwest nor hyper
seems to expose a response.bytes_into(&mut buf).

There is a similar large copy in the ReadBlob RPC handler, which in
previous drafts I had replaced with an Arc<[u8]>, but Bytes might be
better there, too. (After all, Bytes is “Arc<[u8]> on steroids”.)

wchargin added a commit that referenced this pull request Feb 9, 2021
Summary:
The [`bytes`] crate provides facilitates zero-copy network programming
with the `Bytes` struct, which is a kind of “`Arc<[u8]>` on steroids”
that can be sliced and passed around with shared ownership. Other parts
of our stack already use `Bytes` (notably, `tonic`, `reqwest`, and
`prost`), so by using it ourselves we can remove some unnecessary
clones, as discussed in [review on #4646]. Since this already exists as
a transitive dep, this patch just exposes it for direct use by our code.

[`bytes`]: https://crates.io/crates/bytes
[review on #4646]: #4646 (comment)

Test Plan:
It builds: `bazel build //tensorboard/data/server/cargo:bytes`.

wchargin-branch: rust-dep-bytes
wchargin-source: 13445ec2b94489c55c33e0c3eea775d9e06aa683
wchargin-branch: rust-gcs-logdir
wchargin-source: 8ceab0a2cd68987ec4d140b58980864b083e9dd2
@wchargin wchargin merged commit d93cf13 into master Feb 9, 2021
@wchargin wchargin deleted the wchargin-rust-gcs-logdir branch February 9, 2021 22:09
wchargin added a commit that referenced this pull request Feb 10, 2021
Summary:
The [`bytes`] crate provides facilitates zero-copy network programming
with the `Bytes` struct, which is a kind of “`Arc<[u8]>` on steroids”
that can be sliced and passed around with shared ownership. Other parts
of our stack already use `Bytes` (notably, `tonic`, `reqwest`, and
`prost`), so by using it ourselves we can remove some unnecessary
clones, as discussed in [review on #4646]. Since this already exists as
a transitive dep, this patch just exposes it for direct use by our code.

[`bytes`]: https://crates.io/crates/bytes
[review on #4646]: #4646 (comment)

Test Plan:
It builds: `bazel build //tensorboard/data/server/cargo:bytes`.

wchargin-branch: rust-dep-bytes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes core:rustboard //tensorboard/data/server/... type:feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants