[Perf] Replaced the S3 connector implementation to use the AWS transfer manager #4272

paul-amonson · 2023-03-13T21:44:47Z

Summary:

Averages 3-4x or more improvement in speed. Improvement is less for smaller files but still a positive improvement. There appears to be be more variability from run to run than I like but Orri said this is expected. Discussed improvements/tuning at higher layers with Orri as future work.

Note:

Commas added for readability on large numbers.

Both Runs were using these parameters:

-gap 0 -num_in_run 1 -bytes 900,000,000 -seed 6000 -s3_config s3_config -path s3://velox-benchmark-cesg-hyperscalers-aws/test-files/1000MB.file

s3_config Contents:

hive.s3.transfer-manager-max-threads=25

Old GetObject Implementation 900MB:

Run: 900,000,000 Gap: 0 Count: 1 Repeats: 3
85.04564 MB/s 1 pread
94.39024 MB/s 1 preadv
99.87883 MB/s multiple pread
245.17638 MB/s 1 pread mt
238.32858 MB/s 1 preadv mt
248.44832 MB/s multiple pread mt

New TransferManager Implementation 900MB:

Run: 900,000,000 Gap: 0 Count: 1 Repeats: 3
401.64597 MB/s 1 pread
589.92163 MB/s 1 preadv
647.39044 MB/s multiple pread
288.90533 MB/s 1 pread mt
699.8762 MB/s 1 preadv mt
826.0291 MB/s multiple pread mt

netlify · 2023-03-13T21:44:52Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`d1fd64b`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/643026183d67d200081c4a08

paul-amonson · 2023-03-13T22:37:21Z

Hi, I am going to have to have some help to fix the CI for my fix. On my dev box I installed the AWS SDK and the build works but apparently not all of the AWS SDK is in the CI as it cannot find the transfer library.

majetideepak · 2023-03-13T23:20:42Z

@paul-amonson can you share the details of the hardware you used to evaluate?
How does GetObject with 25 threads compare against TransferManager with max-threads = 25?
Why is the reason behind the degradation for TM 288.90533 MB/s 1 pread mt?
Can you also share the results for smaller reads?

paul-amonson · 2023-03-14T15:16:23Z

I used a r6i.8xlarge EC2 Instance in AWS. The smaller numbers (below 20,000,000) vary widely but in all cases "pread mt" ALWAYS under performs. The use of the TransferManager API instead of doing home grown threading, was just a choice to reduce code changes and new code (i.e. decrease new bugs).

GetObject:

Run: 20,000,000 Gap: 0 Count: 1 Repeats: 5
55.317657 MB/s 1 pread
54.81104 MB/s 1 preadv
58.95969 MB/s multiple pread
111.007256 MB/s 1 pread mt
240.77356 MB/s 1 preadv mt
216.78947 MB/s multiple pread mt

TransferManager:

Run: 20,000,000 Gap: 0 Count: 1 Repeats: 5
117.97726 MB/s 1 pread
202.66833 MB/s 1 preadv
233.9537 MB/s multiple pread
68.899185 MB/s 1 pread mt
282.2013 MB/s 1 preadv mt
279.0428 MB/s multiple pread mt

akashsha1 · 2023-03-14T18:35:58Z

velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp

+        transfer_manager->DownloadFile(bucket_, key_, offset, length, [&]() {
+          return Aws::New<UnderlyingStreamWrapper>("TestTag", &streamBuffer);
+        });
+    downloadHandle->WaitUntilFinished();


Does WaitUntilFinished() wait indefinitely, or times out after some time?
If it is a timeout, please mention the value as a comment.

Wait's forever...good catch. Is there a timeout value that should be observed or should this be a new configuration value?

default to 5s, and expose the value as a configuration.
@majetideepak - what do you suggest?

Can we end up in a case when 5s is not enough? Ideally, we want to avoid configs.

Anything I do to add a timeout value here will be a busy wait. There is no TransferHandle.Wait() with a timeout. I can test completion of all parts but I don't think I can recommend a busy loop. I agree with @majetideepak, I added one for TransferManager max threads at a request but I always thought 25 threads max was a good number.

It might be good to log some message somewhere if these requests take a long time.

Ok, you all win! I will put in the busy loop with logging warnings. BTW: How do I log in Velox. I looked in random files but its not obvious.

BTW: I will not implement a timed abort for now. This can be added later. The warnings will suffice for now.

majetideepak · 2023-03-14T21:00:36Z

@paul-amonson Can you share the S3 API cost implication with the Transfer Manager?

I see 216.78947 MB/s multiple pread mt (16 threads default) Get Object > 117.97726 MB/s 1 pread (up to 25 threads) TM?
Is this analysis right?

There is always enough parallelism at the operator level, and I worry that TM might introduce more overhead. See another PR #4276.
Can you share the numbers with hive.s3.transfer-manager-max-threads=1. Is this the same as GetObject?

paul-amonson · 2023-03-14T22:11:01Z

It appears setting max threads to 1 is NOT the same as GetObject. Looking at the TransferManager source did not show any huge benefits except that the TM uses GetObjectAsync in the thread pool. It does use internal buffers and therefore has memcpy's. I will continue looking to see why there is this difference with 1 thread.

GetObject Small:

Run: 20,000,000 Gap: 0 Count: 1 Repeats: 5
78.848434 MB/s 1 pread
89.246666 MB/s 1 preadv
86.416084 MB/s multiple pread
155.94931 MB/s 1 pread mt
321.95337 MB/s 1 preadv mt
308.8326 MB/s multiple pread mt

TransferManager Small:

Threads Available: 1
Run: 20,000,000 Gap: 0 Count: 1 Repeats: 5
130.71059 MB/s 1 pread
211.11209 MB/s 1 preadv
240.55115 MB/s multiple pread
116.59836 MB/s 1 pread mt
275.49197 MB/s 1 preadv mt
122.494675 MB/s multiple pread mt

GetObject Large:

Run: 900,000,000 Gap: 0 Count: 1 Repeats: 3
96.863335 MB/s 1 pread
93.18586 MB/s 1 preadv
99.95754 MB/s multiple pread
245.16904 MB/s 1 pread mt
238.02472 MB/s 1 preadv mt
248.7209 MB/s multiple pread mt

TransferManager Large:

Threads Available: 1
Run: 900,000,000 Gap: 0 Count: 1 Repeats: 3
286.04272 MB/s 1 pread
554.0179 MB/s 1 preadv
841.80756 MB/s multiple pread
665.0374 MB/s 1 pread mt
633.59564 MB/s 1 preadv mt
750.433 MB/s multiple pread mt

velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp

pedroerp · 2023-03-15T17:53:36Z

velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp

@@ -222,6 +227,10 @@ class S3Config {
        "hive.s3.iam-role-session-name", std::string("velox-session"));
  }

+  int getTranferManagerMaxThreads() const {
+    return config_->get<int>("hive.s3.transfer-manager-max-threads", 25);


I know the file was already like this, but it would be nice to pull out these literal strings to the top of the file as constants and give them a name (kS3TransferManagerMaxThreadsConfig or something similar).

We need to move this to HiveConfig. Filed #4300
Should be done in a separate PR.

velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp

mrambacher · 2023-03-16T01:36:10Z

velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp

+        transfer_manager->DownloadFile(bucket_, key_, offset, length, [&]() {
+          return Aws::New<UnderlyingStreamWrapper>("TestTag", &streamBuffer);
+        });
+    downloadHandle->WaitUntilFinished();


It might be good to log some message somewhere if these requests take a long time.

…er manager Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

…erManager creation. Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

majetideepak · 2023-03-17T12:23:13Z

I will continue looking to see why there is this difference with 1 thread.

@paul-amonson Do we know anything about this? We should also run the VeloxTpchBenchmark with data on S3 with this change to get further insights.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

paul-amonson · 2023-03-20T17:45:10Z

I will continue looking to see why there is this difference with 1 thread.

@paul-amonson Do we know anything about this? We should also run the VeloxTpchBenchmark with data on S3 with this change to get further insights.

The difference appears to be inside TransferManager. In fact when 1 is specified for the max_threads for TransferManager there are 2 threads in fact used. i.e. DoDownload internal API is called on a thread and then GetObjecctAsync is used using a second thread. All work is done by the calling GetObject on the callers thread.

I am trying to run the benchmark suggested but I need to ramp up on Kube clusters first.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

paul-amonson · 2023-03-31T15:08:01Z

CI is failing installing Minio. I am not sure why, it appears my changes here in the PR all built ok including installing the new AWS SDK. I need help understanding what is wrong. Thanks.

#!/bin/bash -eo pipefail
set -xu
cd ~/adapter-deps/install/bin/
wget https://dl.min.io/server/minio/release/linux-amd64/archive/minio-20220526054841.0.0.x86_64.rpm
rpm -i minio-20220526054841.0.0.x86_64.rpm
rm minio-20220526054841.0.0.x86_64.rpm

cd /root/adapter-deps/install/bin/
/bin/bash: line 1: cd: /root/adapter-deps/install/bin/: No such file or directory

Exited with code exit status 1
CircleCI received exit code 1

majetideepak · 2023-04-03T12:59:22Z

/bin/bash: line 1: cd: /root/adapter-deps/install/bin/: No such file or directory

The upgraded SDK package is being installed at a different location(from /root/adapter-deps/install/) or is not creating a bin directory. The bin directory is missing as a result. You can create a bin directory in the later case.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

paul-amonson · 2023-04-03T15:53:19Z

So yet another failure:

FAILED: third_party/arrow_ep/src/arrow_ep-stamp/arrow_ep-download
cd /root/project/_build/debug/third_party/arrow_ep/src && /usr/bin/cmake -P /root/project/_build/debug/third_party/arrow_ep/src/arrow_ep-stamp/download-arrow_ep.cmake && /usr/bin/cmake -P /root/project/_build/debug/third_party/arrow_ep/src/arrow_ep-stamp/verify-arrow_ep.cmake && /usr/bin/cmake -P /root/project/_build/debug/third_party/arrow_ep/src/arrow_ep-stamp/extract-arrow_ep.cmake && /usr/bin/cmake -E touch /root/project/_build/debug/third_party/arrow_ep/src/arrow_ep-stamp/arrow_ep-download
-- Downloading...
dst='/root/project/_build/debug/third_party/arrow_ep/src/apache-arrow-8.0.0.tar.gz'
timeout='none'
-- Using src='https://dlcdn.apache.org/arrow/arrow-8.0.0/apache-arrow-8.0.0.tar.gz'
-- Retrying...
-- Using src='https://dlcdn.apache.org/arrow/arrow-8.0.0/apache-arrow-8.0.0.tar.gz'
-- Retry after 5 seconds (attempt #2) ...
-- Using src='https://dlcdn.apache.org/arrow/arrow-8.0.0/apache-arrow-8.0.0.tar.gz'
-- Retry after 5 seconds (attempt #3) ...
-- Using src='https://dlcdn.apache.org/arrow/arrow-8.0.0/apache-arrow-8.0.0.tar.gz'
-- Retry after 15 seconds (attempt #4) ...
-- Using src='https://dlcdn.apache.org/arrow/arrow-8.0.0/apache-arrow-8.0.0.tar.gz'
-- Retry after 60 seconds (attempt #5) ...
-- Using src='https://dlcdn.apache.org/arrow/arrow-8.0.0/apache-arrow-8.0.0.tar.gz'
CMake Error at arrow_ep-stamp/download-arrow_ep.cmake:159 (message):
Each download failed!

I tried a wget on the above URL and got a 404 error. Ideas?

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

majetideepak

We should also run the VeloxTpchBenchmark with data on S3 with this change to get further insights.

@paul-amonson We need more benchmark data to evaluate this change. Let's start by comparing the TableScan times of velox-tpch-benchmark queries. The benchmark will need some modifications (listing files) since the data is now on S3.
I am also setting up an i4i EC2 instance to evaluate this change.

velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

majetideepak · 2023-04-10T10:45:05Z

I ran query 6 with the modified tpch-benchmark here #4522 and saw a performance degradation.
Hardware Instance type: i4i.4xlarge (https://aws.amazon.com/ec2/instance-types/i4i/)

This branch took 57.83 seconds while the main branch took 31.76 seconds. The TableScan with this change is taking a lot more time.

$ velox_tpch_benchmark -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=6 -num_splits_per_file=128 -num-drivers=16
Execution time: 57.83s
Splits total: 8192, finished: 8192
-- Aggregation[FINAL a0 := sum("a0")] -> a0:DOUBLE
   Output: 1 rows (32B, 1 batches), Cpu time: 151.53us, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 2, Threads: 1
  -- LocalPartition[GATHER] -> a0:DOUBLE
     Output: 32 rows (1.00KB, 32 batches), Cpu time: 409.46us, Blocked wall time: 57.82s, Peak memory: 0B, Memory allocations: 0
     LocalPartition: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 369.11us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 16
     LocalExchange: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 40.35us, Blocked wall time: 57.82s, Peak memory: 0B, Memory allocations: 0, Threads: 1
    -- Aggregation[PARTIAL a0 := sum(ROW["p0"])] -> a0:DOUBLE
       Output: 16 rows (512B, 16 batches), Cpu time: 585.99ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 48, Threads: 16
      -- Project[expressions: (p0:DOUBLE, multiply(ROW["l_extendedprice"],ROW["l_discount"]))] -> p0:DOUBLE
         Output: 57025095 rows (815.16MB, 300716 batches), Cpu time: 734.99ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 33, Threads: 16
        -- TableScan[table: lineitem, range filters: [(discount, DoubleRange: [0.050000, 0.070000] no nulls), (quantity, DoubleRange: (-inf, 24.000000) no nulls), (shipdate, BigintRange: [8766, 9130] no nulls)]] -> l_shipdate:DATE, l_extendedprice:DOUBLE, l_quantity:DOUBLE, l_discount:DOUBLE
           Input: 57025095 rows (19.30GB, 300716 batches), Output: 57025095 rows (19.30GB, 300716 batches), Cpu time: 6m 40s, Blocked wall time: 0ns, Peak memory: 320.00MB, Memory allocations: 2432146, Threads: 16, Splits: 8192

 $ velox_tpch_benchmark_orig -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=6 -num_splits_per_file=128 -num-drivers=16
Execution time: 31.76s
Splits total: 8192, finished: 8192
-- Aggregation[FINAL a0 := sum("a0")] -> a0:DOUBLE
   Output: 1 rows (32B, 1 batches), Cpu time: 77.46us, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 2, Threads: 1
  -- LocalPartition[GATHER] -> a0:DOUBLE
     Output: 32 rows (1.00KB, 32 batches), Cpu time: 418.71us, Blocked wall time: 31.76s, Peak memory: 0B, Memory allocations: 0
     LocalPartition: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 354.51us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 16
     LocalExchange: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 64.20us, Blocked wall time: 31.76s, Peak memory: 0B, Memory allocations: 0, Threads: 1
    -- Aggregation[PARTIAL a0 := sum(ROW["p0"])] -> a0:DOUBLE
       Output: 16 rows (512B, 16 batches), Cpu time: 556.75ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 48, Threads: 16
      -- Project[expressions: (p0:DOUBLE, multiply(ROW["l_extendedprice"],ROW["l_discount"]))] -> p0:DOUBLE
         Output: 57025095 rows (809.44MB, 300716 batches), Cpu time: 686.31ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 33, Threads: 16
        -- TableScan[table: lineitem, range filters: [(discount, DoubleRange: [0.050000, 0.070000] no nulls), (quantity, DoubleRange: (-inf, 24.000000) no nulls), (shipdate, BigintRange: [8766, 9130] no nulls)]] -> l_shipdate:DATE, l_extendedprice:DOUBLE, l_quantity:DOUBLE, l_discount:DOUBLE
           Input: 57025095 rows (19.29GB, 300716 batches), Output: 57025095 rows (19.29GB, 300716 batches), Cpu time: 1m 14s, Blocked wall time: 0ns, Peak memory: 416.00MB, Memory allocations: 2432152, Threads: 16, Splits: 8192

akashsha1 · 2023-04-10T16:34:23Z

I ran query 6 with the modified tpch-benchmark here #4522 and saw a performance degradation. Hardware Instance type: i4i.4xlarge (https://aws.amazon.com/ec2/instance-types/i4i/)

This branch took 57.83 seconds while the main branch took 31.76 seconds. The TableScan with this change is taking a lot more time.

$ velox_tpch_benchmark -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=6 -num_splits_per_file=128 -num-drivers=16
Execution time: 57.83s
Splits total: 8192, finished: 8192
-- Aggregation[FINAL a0 := sum("a0")] -> a0:DOUBLE
   Output: 1 rows (32B, 1 batches), Cpu time: 151.53us, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 2, Threads: 1
  -- LocalPartition[GATHER] -> a0:DOUBLE
     Output: 32 rows (1.00KB, 32 batches), Cpu time: 409.46us, Blocked wall time: 57.82s, Peak memory: 0B, Memory allocations: 0
     LocalPartition: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 369.11us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 16
     LocalExchange: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 40.35us, Blocked wall time: 57.82s, Peak memory: 0B, Memory allocations: 0, Threads: 1
    -- Aggregation[PARTIAL a0 := sum(ROW["p0"])] -> a0:DOUBLE
       Output: 16 rows (512B, 16 batches), Cpu time: 585.99ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 48, Threads: 16
      -- Project[expressions: (p0:DOUBLE, multiply(ROW["l_extendedprice"],ROW["l_discount"]))] -> p0:DOUBLE
         Output: 57025095 rows (815.16MB, 300716 batches), Cpu time: 734.99ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 33, Threads: 16
        -- TableScan[table: lineitem, range filters: [(discount, DoubleRange: [0.050000, 0.070000] no nulls), (quantity, DoubleRange: (-inf, 24.000000) no nulls), (shipdate, BigintRange: [8766, 9130] no nulls)]] -> l_shipdate:DATE, l_extendedprice:DOUBLE, l_quantity:DOUBLE, l_discount:DOUBLE
           Input: 57025095 rows (19.30GB, 300716 batches), Output: 57025095 rows (19.30GB, 300716 batches), Cpu time: 6m 40s, Blocked wall time: 0ns, Peak memory: 320.00MB, Memory allocations: 2432146, Threads: 16, Splits: 8192

 $ velox_tpch_benchmark_orig -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=6 -num_splits_per_file=128 -num-drivers=16
Execution time: 31.76s
Splits total: 8192, finished: 8192
-- Aggregation[FINAL a0 := sum("a0")] -> a0:DOUBLE
   Output: 1 rows (32B, 1 batches), Cpu time: 77.46us, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 2, Threads: 1
  -- LocalPartition[GATHER] -> a0:DOUBLE
     Output: 32 rows (1.00KB, 32 batches), Cpu time: 418.71us, Blocked wall time: 31.76s, Peak memory: 0B, Memory allocations: 0
     LocalPartition: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 354.51us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 16
     LocalExchange: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 64.20us, Blocked wall time: 31.76s, Peak memory: 0B, Memory allocations: 0, Threads: 1
    -- Aggregation[PARTIAL a0 := sum(ROW["p0"])] -> a0:DOUBLE
       Output: 16 rows (512B, 16 batches), Cpu time: 556.75ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 48, Threads: 16
      -- Project[expressions: (p0:DOUBLE, multiply(ROW["l_extendedprice"],ROW["l_discount"]))] -> p0:DOUBLE
         Output: 57025095 rows (809.44MB, 300716 batches), Cpu time: 686.31ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 33, Threads: 16
        -- TableScan[table: lineitem, range filters: [(discount, DoubleRange: [0.050000, 0.070000] no nulls), (quantity, DoubleRange: (-inf, 24.000000) no nulls), (shipdate, BigintRange: [8766, 9130] no nulls)]] -> l_shipdate:DATE, l_extendedprice:DOUBLE, l_quantity:DOUBLE, l_discount:DOUBLE
           Input: 57025095 rows (19.29GB, 300716 batches), Output: 57025095 rows (19.29GB, 300716 batches), Cpu time: 1m 14s, Blocked wall time: 0ns, Peak memory: 416.00MB, Memory allocations: 2432152, Threads: 16, Splits: 8192

What are your observations on other TPC-H queries?
From the metrics, I see that for TransferManager the peak memory usage on TableScan is lower, and CPU time goes higher.
I am not fully familiar with Velox metrics, but maybe this is an indication that resources are not fully getting utilized by the transfer manager, and further tuning (prefetch, read-ahead, coalesce etc.) will help.

majetideepak · 2023-04-11T14:51:46Z

tpch query 1 has a similar behavior
2m 44s vs 3m 26s dominated by TableScan. Ignore the negative row counts. The counter variable type needs to be updated.

~/velox_tpch_benchmark_orig -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=1 -num_splits_per_file=128 -num-drivers=16
Execution time: 2m 44s
Splits total: 8192, finished: 8192
          -- TableScan[table: lineitem, range filters: [(shipdate, BigintRange: [-9223372036854775808, 10471] no nulls)]] -> l_returnflag:VARCHAR, l_linestatus:VARCHAR, l_quantity:DOUBLE, l_extendedprice:DOUBLE, l_discount:DOUBLE, l_tax:DOUBLE, l_shipdate:DATE
             Input: -1341716328 rows (328.64GB, 301056 batches), Output: -1341716328 rows (328.64GB, 301056 batches), Cpu time: 2m 7s, Blocked wall time: 0ns, Peak memory: 712.00MB, Memory allocations: 4249900, Threads: 16, Splits: 8192

~/velox_tpch_benchmark -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=1 -num_splits_per_file=128 -num-drivers=16
Execution time: 3m 26s
Splits total: 8192, finished: 8192
          -- TableScan[table: lineitem, range filters: [(shipdate, BigintRange: [-9223372036854775808, 10471] no nulls)]] -> l_returnflag:VARCHAR, l_linestatus:VARCHAR, l_quantity:DOUBLE, l_extendedprice:DOUBLE, l_discount:DOUBLE, l_tax:DOUBLE, l_shipdate:DATE
             Input: -1341716328 rows (328.64GB, 301056 batches), Output: -1341716328 rows (328.64GB, 301056 batches), Cpu time: 4m 59s, Blocked wall time: 0ns, Peak memory: 736.00MB, Memory allocations: 4249900, Threads: 16, Splits: 8192

majetideepak · 2023-04-11T14:56:54Z

Can you re-run the velox_s3read_benchmark with -bytes 2000000? This would give 50 repeats and the parallel reads would emulate a real scenario. repeat in parallel mode is equivalent to the number of concurrent threads invoking the read calls.

paul-amonson · 2023-04-11T18:11:08Z

tpch query 1 has a similar behavior 2m 44s vs 3m 26s dominated by TableScan. Ignore the negative row counts. The counter variable type needs to be updated.

~/velox_tpch_benchmark_orig -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=1 -num_splits_per_file=128 -num-drivers=16
Execution time: 2m 44s
Splits total: 8192, finished: 8192
          -- TableScan[table: lineitem, range filters: [(shipdate, BigintRange: [-9223372036854775808, 10471] no nulls)]] -> l_returnflag:VARCHAR, l_linestatus:VARCHAR, l_quantity:DOUBLE, l_extendedprice:DOUBLE, l_discount:DOUBLE, l_tax:DOUBLE, l_shipdate:DATE
             Input: -1341716328 rows (328.64GB, 301056 batches), Output: -1341716328 rows (328.64GB, 301056 batches), Cpu time: 2m 7s, Blocked wall time: 0ns, Peak memory: 712.00MB, Memory allocations: 4249900, Threads: 16, Splits: 8192

~/velox_tpch_benchmark -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=1 -num_splits_per_file=128 -num-drivers=16
Execution time: 3m 26s
Splits total: 8192, finished: 8192
          -- TableScan[table: lineitem, range filters: [(shipdate, BigintRange: [-9223372036854775808, 10471] no nulls)]] -> l_returnflag:VARCHAR, l_linestatus:VARCHAR, l_quantity:DOUBLE, l_extendedprice:DOUBLE, l_discount:DOUBLE, l_tax:DOUBLE, l_shipdate:DATE
             Input: -1341716328 rows (328.64GB, 301056 batches), Output: -1341716328 rows (328.64GB, 301056 batches), Cpu time: 4m 59s, Blocked wall time: 0ns, Peak memory: 736.00MB, Memory allocations: 4249900, Threads: 16, Splits: 8192

Can you set the S3 threads to 1 to see if this is on par with GetObject()? FYI: I am still unable to run the velox_tpch_benchmark without a segfault.

paul-amonson · 2023-05-04T20:58:22Z

There is currently no benefit to adding the transfer manager to the S3 connector as long as there are CURL lib issues with higher number of threads being used for download. A future solution not using (or fixing) libcurl may require a need to revisit this but for now closing this PR.

Our focus is going to be on tuning Velox IO for S3 (much work is already being done in this direction).

paul-amonson added the performance label Mar 13, 2023

paul-amonson requested review from majetideepak, pedroerp and akashsha1 March 13, 2023 21:44

paul-amonson self-assigned this Mar 13, 2023

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 13, 2023

Yohahaha mentioned this pull request Mar 14, 2023

Introduce ParallelBufferedInput to support multi-thread read in hive connector #4276

Closed

akashsha1 reviewed Mar 14, 2023

View reviewed changes

akashsha1 requested review from oerling and mbasmanova March 14, 2023 19:30

paul-amonson mentioned this pull request Mar 15, 2023

[Update] Use a newer version of AWS SDK. #4296

Closed

pedroerp reviewed Mar 15, 2023

View reviewed changes

velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp Outdated Show resolved Hide resolved

pedroerp reviewed Mar 15, 2023

View reviewed changes

velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp Outdated Show resolved Hide resolved

pedroerp reviewed Mar 15, 2023

View reviewed changes

mrambacher reviewed Mar 16, 2023

View reviewed changes

zhouyuan mentioned this pull request Mar 16, 2023

optimize s3 with task manager oap-project/velox#153

Closed

paul-amonson added 4 commits March 16, 2023 09:11

[Perf] Replaced the S3 connector implementation to use the AWS transf…

02f830b

…er manager Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

[Fix] Format fixes via check.py.

a199be7

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

[Fix] Coding style issues.

6f5b82b

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

[Refactor,Fix] Fixed a couple review issues and refactored the Transf…

5a6ae13

…erManager creation. Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

paul-amonson and others added 2 commits March 20, 2023 08:59

Merge branch 'facebookincubator:main' into improve_s3_performance

0b6842e

[Fix] Proper log for long S3 waits.

1729fb5

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

paul-amonson requested a review from mrambacher March 20, 2023 17:39

paul-amonson and others added 3 commits March 21, 2023 07:20

Merge branch 'facebookincubator:main' into improve_s3_performance

e4e1457

Merge branch 'facebookincubator:main' into improve_s3_performance

4db7e49

[Fix] Attempt to fix CI.

91d5d63

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

paul-amonson and others added 2 commits April 3, 2023 07:11

Merge branch 'facebookincubator:main' into improve_s3_performance

15889d6

[Fix] Add missing folder not created by new AWS SDK.

8dccd52

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

paul-amonson and others added 3 commits April 3, 2023 15:16

Merge branch 'facebookincubator:main' into improve_s3_performance

f93dae4

Merge branch 'facebookincubator:main' into improve_s3_performance

e65b922

[Fix] Fixed a source problem causing UT failures.

9e397c8

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

majetideepak requested changes Apr 5, 2023

View reviewed changes

majetideepak requested changes Apr 6, 2023

View reviewed changes

velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp Show resolved Hide resolved

velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp Outdated Show resolved Hide resolved

velox/connectors/hive/storage_adapters/s3fs/S3FileSystem.cpp Outdated Show resolved Hide resolved

paul-amonson and others added 4 commits April 6, 2023 09:06

Merge branch 'facebookincubator:main' into improve_s3_performance

4fc8a5c

[Fix] Addressed review comments.

ced8a1a

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

[Fix] Formatting.

e9e638b

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>

Merge branch 'facebookincubator:main' into improve_s3_performance

d1fd64b

paul-amonson closed this May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Replaced the S3 connector implementation to use the AWS transfer manager #4272

[Perf] Replaced the S3 connector implementation to use the AWS transfer manager #4272

paul-amonson commented Mar 13, 2023

netlify bot commented Mar 13, 2023 •

edited

Loading

paul-amonson commented Mar 13, 2023

majetideepak commented Mar 13, 2023 •

edited

Loading

paul-amonson commented Mar 14, 2023

akashsha1 Mar 14, 2023

paul-amonson Mar 14, 2023

akashsha1 Mar 14, 2023

majetideepak Mar 14, 2023

paul-amonson Mar 14, 2023

mrambacher Mar 16, 2023

paul-amonson Mar 16, 2023

paul-amonson Mar 16, 2023

majetideepak commented Mar 14, 2023 •

edited

Loading

paul-amonson commented Mar 14, 2023

pedroerp Mar 15, 2023 •

edited

Loading

paul-amonson Mar 15, 2023

majetideepak Mar 15, 2023

mrambacher Mar 16, 2023

majetideepak commented Mar 17, 2023

paul-amonson commented Mar 20, 2023

paul-amonson commented Mar 31, 2023

majetideepak commented Apr 3, 2023

paul-amonson commented Apr 3, 2023

majetideepak left a comment •

edited

Loading

majetideepak commented Apr 10, 2023

akashsha1 commented Apr 10, 2023

majetideepak commented Apr 11, 2023 •

edited

Loading

majetideepak commented Apr 11, 2023 •

edited

Loading

paul-amonson commented Apr 11, 2023

paul-amonson commented May 4, 2023 •

edited

Loading

[Perf] Replaced the S3 connector implementation to use the AWS transfer manager #4272

[Perf] Replaced the S3 connector implementation to use the AWS transfer manager #4272

Conversation

paul-amonson commented Mar 13, 2023

Summary:

Note:

Both Runs were using these parameters:

s3_config Contents:

Old GetObject Implementation 900MB:

New TransferManager Implementation 900MB:

netlify bot commented Mar 13, 2023 • edited Loading

✅ Deploy Preview for meta-velox canceled.

paul-amonson commented Mar 13, 2023

majetideepak commented Mar 13, 2023 • edited Loading

paul-amonson commented Mar 14, 2023

GetObject:

TransferManager:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

majetideepak commented Mar 14, 2023 • edited Loading

paul-amonson commented Mar 14, 2023

GetObject Small:

TransferManager Small:

GetObject Large:

TransferManager Large:

pedroerp Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

majetideepak commented Mar 17, 2023

paul-amonson commented Mar 20, 2023

paul-amonson commented Mar 31, 2023

majetideepak commented Apr 3, 2023

paul-amonson commented Apr 3, 2023

majetideepak left a comment • edited Loading

Choose a reason for hiding this comment

majetideepak commented Apr 10, 2023

akashsha1 commented Apr 10, 2023

majetideepak commented Apr 11, 2023 • edited Loading

majetideepak commented Apr 11, 2023 • edited Loading

paul-amonson commented Apr 11, 2023

paul-amonson commented May 4, 2023 • edited Loading

netlify bot commented Mar 13, 2023 •

edited

Loading

majetideepak commented Mar 13, 2023 •

edited

Loading

majetideepak commented Mar 14, 2023 •

edited

Loading

pedroerp Mar 15, 2023 •

edited

Loading

majetideepak left a comment •

edited

Loading

majetideepak commented Apr 11, 2023 •

edited

Loading

majetideepak commented Apr 11, 2023 •

edited

Loading

paul-amonson commented May 4, 2023 •

edited

Loading