Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference numbers for reverb_benchmark.py #303

Open
hartikainen opened this issue Jun 20, 2023 · 2 comments
Open

Reference numbers for reverb_benchmark.py #303

hartikainen opened this issue Jun 20, 2023 · 2 comments

Comments

@hartikainen
Copy link

hartikainen commented Jun 20, 2023

Hey,

I'm setting up a custom ACME learning algorithm and currently the learning seems to be quite heavily bottlenecked by the reverb dataset sampling. I tried to understand the reverb dynamics using the reverb_benchmark.py but am seeing quite different results with different python/conda environments. With my own, full environment dependencies, which has a ton of other packages installed than just acme and reverb, I see the following numbers (run on a 40-cpu machine):

Results with multiple dependencies
$ python -m acme.datasets.reverb_benchmark
2023-06-20 10:13:52.573591: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-06-20 10:13:52.573642: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-06-20 10:13:55.323693: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2023-06-20 10:13:55.323800: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2023-06-20 10:13:55.323886: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2023-06-20 10:13:55.323974: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2023-06-20 10:13:55.324065: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2023-06-20 10:13:55.324174: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2023-06-20 10:13:55.324294: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2023-06-20 10:13:55.324315: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[reverb/cc/platform/tfrecord_checkpointer.cc:162]  Initializing TFRecordCheckpointer in /tmp/tmpelvcljx6.
[reverb/cc/platform/tfrecord_checkpointer.cc:552] Loading latest checkpoint from /tmp/tmpelvcljx6
[reverb/cc/platform/default/server.cc:71] Started replay server on port 42699
I0620 10:13:55.333339 140594273768512 reverb_benchmark.py:74] Processed 0 steps
I0620 10:13:55.993398 140594273768512 reverb_benchmark.py:74] Processed 1000 steps
I0620 10:13:56.636906 140594273768512 reverb_benchmark.py:74] Processed 2000 steps
I0620 10:13:57.264292 140594273768512 reverb_benchmark.py:74] Processed 3000 steps
I0620 10:13:57.903438 140594273768512 reverb_benchmark.py:74] Processed 4000 steps
I0620 10:13:58.585648 140594273768512 reverb_benchmark.py:74] Processed 5000 steps
I0620 10:13:59.272802 140594273768512 reverb_benchmark.py:74] Processed 6000 steps
I0620 10:13:59.924327 140594273768512 reverb_benchmark.py:74] Processed 7000 steps
I0620 10:14:00.837419 140594273768512 reverb_benchmark.py:74] Processed 8000 steps
I0620 10:14:01.628472 140594273768512 reverb_benchmark.py:74] Processed 9000 steps
Processing batch_size=256 prefetch_size=0
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (766127) so Table default is accessed directly without gRPC.
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (766127) so Table default is accessed directly without gRPC.
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (766127) so Table default is accessed directly without gRPC.
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (766127) so Table default is accessed directly without gRPC.
Iteration 0 finished in 4.220823764801025s
Iteration 1 finished in 4.302377462387085s
Iteration 2 finished in 3.8878955841064453s
Processing batch_size=256 prefetch_size=1
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (766127) so Table default is accessed directly without gRPC.
Iteration 0 finished in 3.7061126232147217s
Iteration 1 finished in 3.785097599029541s
Iteration 2 finished in 3.6178388595581055s
Processing batch_size=256 prefetch_size=4
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (766127) so Table default is accessed directly without gRPC.
Iteration 0 finished in 3.877387285232544s
Iteration 1 finished in 3.884636878967285s
Iteration 2 finished in 3.94667911529541s
Processing batch_size=2048 prefetch_size=0
Iteration 0 finished in 35.439427852630615s
Iteration 1 finished in 32.06883978843689s
Iteration 2 finished in 33.00999927520752s
Processing batch_size=2048 prefetch_size=1
Iteration 0 finished in 33.30985379219055s
Iteration 1 finished in 34.13146257400513s
Iteration 2 finished in 34.06477403640747s
Processing batch_size=2048 prefetch_size=4
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (766127) so Table default is accessed directly without gRPC.
Iteration 0 finished in 31.547253370285034s
Iteration 1 finished in 32.04759335517883s
Iteration 2 finished in 32.44247269630432s
[...]

This gives me a throughput of only some tens of samples tens of thousands of samples per second independent of batch_size and prefetch_size. I went a head and installed fresh environment with just dm-acme[jax,tf,envs] @ git+https://github.com/deepmind/acme.git@546a47a0154b50145dd9ac3fb3ca57c62e69805f and protobuf>=3.20,<3.21 installed through pip, and that gives me the following results:

Results with simple dependencies
$ python -m acme.datasets.reverb_benchmark
2023-06-20 10:27:26.396250: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-06-20 10:27:26.396307: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-06-20 10:27:29.425242: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-06-20 10:27:29.425343: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2023-06-20 10:27:29.425406: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2023-06-20 10:27:29.425466: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2023-06-20 10:27:29.425525: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2023-06-20 10:27:29.425582: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2023-06-20 10:27:29.425640: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2023-06-20 10:27:29.425699: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2023-06-20 10:27:29.425713: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[reverb/cc/platform/tfrecord_checkpointer.cc:162]  Initializing TFRecordCheckpointer in /tmp/tmpn78_87az.
[reverb/cc/platform/tfrecord_checkpointer.cc:552] Loading latest checkpoint from /tmp/tmpn78_87az
[reverb/cc/platform/default/server.cc:71] Started replay server on port 37829
I0620 10:27:29.435296 140393266591552 reverb_benchmark.py:73] Processed 0 steps
I0620 10:27:30.165984 140393266591552 reverb_benchmark.py:73] Processed 1000 steps
I0620 10:27:30.897103 140393266591552 reverb_benchmark.py:73] Processed 2000 steps
I0620 10:27:31.645348 140393266591552 reverb_benchmark.py:73] Processed 3000 steps
I0620 10:27:32.466589 140393266591552 reverb_benchmark.py:73] Processed 4000 steps
I0620 10:27:33.119303 140393266591552 reverb_benchmark.py:73] Processed 5000 steps
I0620 10:27:33.919817 140393266591552 reverb_benchmark.py:73] Processed 6000 steps
I0620 10:27:34.704758 140393266591552 reverb_benchmark.py:73] Processed 7000 steps
I0620 10:27:35.436144 140393266591552 reverb_benchmark.py:73] Processed 8000 steps
I0620 10:27:36.208516 140393266591552 reverb_benchmark.py:73] Processed 9000 steps
Processing batch_size=256 prefetch_size=0
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (780869) so Table default is accessed directly without gRPC.
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (780869) so Table default is accessed directly without gRPC.
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (780869) so Table default is accessed directly without gRPC.
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (780869) so Table default is accessed directly without gRPC.
Iteration 0 finished in 1.1302223205566406s
Processing batch_size=256 prefetch_size=1
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (780869) so Table default is accessed directly without gRPC.
Iteration 0 finished in 1.1211729049682617s
Processing batch_size=256 prefetch_size=4
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (780869) so Table default is accessed directly without gRPC.
Iteration 0 finished in 1.08022141456604s
Processing batch_size=2048 prefetch_size=0
Iteration 0 finished in 8.891673564910889s
Processing batch_size=2048 prefetch_size=1
Iteration 0 finished in 9.745131015777588s
Processing batch_size=2048 prefetch_size=4
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (780869) so Table default is accessed directly without gRPC.
Iteration 0 finished in 8.422712802886963s
Processing batch_size=16384 prefetch_size=0
Iteration 0 finished in 75.7203643321991s
Processing batch_size=16384 prefetch_size=1
Iteration 0 finished in 81.72792387008667s
Processing batch_size=16384 prefetch_size=4
Iteration 0 finished in 81.6889877319336s
[reverb/cc/platform/default/server.cc:84] Shutting down replay server

The throughput here is ~2-3 times faster than with the earlier setup. I'm trying to figure out what the difference between these two are (the acme and reverb versions are the same, it's just the other packages that might've overwritten other dependencies in the first one). But before that, it would be nice to have some reference results for this script to understand how far the current throughput is from what is reasonably expected.

Are there any numbers for this available publicly?

@hartikainen
Copy link
Author

hartikainen commented Jun 20, 2023

Apologies, right after posting this and looking at the reverb_benchmark.py closer, I see that the dataset is sampled 1000 times, making the first script's throughput some tens of thousands of samples per second, rather than tens of samples as I said above. This seems like a much more reasonable number, although I'm still confused as to why there's such a big difference between the two installations.

It would still be nice to get some reference numbers, or confirmation for my numbers being in the right ballpark, if possible.

@hartikainen hartikainen changed the title Baseline numbers for reverb_benchmark.py Reference numbers for reverb_benchmark.py Jun 20, 2023
@hartikainen
Copy link
Author

Comparing my environments, the main difference was that the former (slower) had grpcio==1.51.1 and libgrpc==1.51.1 installed from through conda, whereas the latter (faster) had grpcio==1.54.2 and libgrpc==1.54.2 installed from pip. After updating the former environment to match the latter, I know see the similar performance in both, i.e. the former environment became ~2-3 times faster, giving about 256 * 1000 samples / 3.5s ~= 70k samples/s. Still leaves me wondering if this is what one would expect from the benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant