Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Replaced the S3 connector implementation to use the AWS transfer manager #4272

Closed
wants to merge 18 commits into from
Closed

[Perf] Replaced the S3 connector implementation to use the AWS transfer manager #4272

wants to merge 18 commits into from

Conversation

paul-amonson
Copy link
Contributor

RE: #3938

Summary:

Averages 3-4x or more improvement in speed. Improvement is less for smaller files but still a positive improvement. There appears to be be more variability from run to run than I like but Orri said this is expected. Discussed improvements/tuning at higher layers with Orri as future work.

Note:

Commas added for readability on large numbers.

Both Runs were using these parameters:

-gap 0 -num_in_run 1 -bytes 900,000,000 -seed 6000 -s3_config s3_config -path s3://velox-benchmark-cesg-hyperscalers-aws/test-files/1000MB.file

s3_config Contents:

hive.s3.transfer-manager-max-threads=25

Old GetObject Implementation 900MB:

Run: 900,000,000 Gap: 0 Count: 1 Repeats: 3
85.04564 MB/s 1 pread
94.39024 MB/s 1 preadv
99.87883 MB/s multiple pread
245.17638 MB/s 1 pread mt
238.32858 MB/s 1 preadv mt
248.44832 MB/s multiple pread mt

New TransferManager Implementation 900MB:

Run: 900,000,000 Gap: 0 Count: 1 Repeats: 3
401.64597 MB/s 1 pread
589.92163 MB/s 1 preadv
647.39044 MB/s multiple pread
288.90533 MB/s 1 pread mt
699.8762 MB/s 1 preadv mt
826.0291 MB/s multiple pread mt

@netlify
Copy link

netlify bot commented Mar 13, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit d1fd64b
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/643026183d67d200081c4a08

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 13, 2023
@paul-amonson
Copy link
Contributor Author

Hi, I am going to have to have some help to fix the CI for my fix. On my dev box I installed the AWS SDK and the build works but apparently not all of the AWS SDK is in the CI as it cannot find the transfer library.

@majetideepak
Copy link
Collaborator

majetideepak commented Mar 13, 2023

@paul-amonson can you share the details of the hardware you used to evaluate?
How does GetObject with 25 threads compare against TransferManager with max-threads = 25?
Why is the reason behind the degradation for TM 288.90533 MB/s 1 pread mt?
Can you also share the results for smaller reads?

@paul-amonson
Copy link
Contributor Author

I used a r6i.8xlarge EC2 Instance in AWS. The smaller numbers (below 20,000,000) vary widely but in all cases "pread mt" ALWAYS under performs. The use of the TransferManager API instead of doing home grown threading, was just a choice to reduce code changes and new code (i.e. decrease new bugs).

GetObject:

Run: 20,000,000 Gap: 0 Count: 1 Repeats: 5
55.317657 MB/s 1 pread
54.81104 MB/s 1 preadv
58.95969 MB/s multiple pread
111.007256 MB/s 1 pread mt
240.77356 MB/s 1 preadv mt
216.78947 MB/s multiple pread mt

TransferManager:

Run: 20,000,000 Gap: 0 Count: 1 Repeats: 5
117.97726 MB/s 1 pread
202.66833 MB/s 1 preadv
233.9537 MB/s multiple pread
68.899185 MB/s 1 pread mt
282.2013 MB/s 1 preadv mt
279.0428 MB/s multiple pread mt

transfer_manager->DownloadFile(bucket_, key_, offset, length, [&]() {
return Aws::New<UnderlyingStreamWrapper>("TestTag", &streamBuffer);
});
downloadHandle->WaitUntilFinished();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does WaitUntilFinished() wait indefinitely, or times out after some time?
If it is a timeout, please mention the value as a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait's forever...good catch. Is there a timeout value that should be observed or should this be a new configuration value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default to 5s, and expose the value as a configuration.
@majetideepak - what do you suggest?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we end up in a case when 5s is not enough? Ideally, we want to avoid configs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything I do to add a timeout value here will be a busy wait. There is no TransferHandle.Wait() with a timeout. I can test completion of all parts but I don't think I can recommend a busy loop. I agree with @majetideepak, I added one for TransferManager max threads at a request but I always thought 25 threads max was a good number.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to log some message somewhere if these requests take a long time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, you all win! I will put in the busy loop with logging warnings. BTW: How do I log in Velox. I looked in random files but its not obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: I will not implement a timed abort for now. This can be added later. The warnings will suffice for now.

@akashsha1 akashsha1 requested review from oerling and mbasmanova March 14, 2023 19:30
@majetideepak
Copy link
Collaborator

majetideepak commented Mar 14, 2023

@paul-amonson Can you share the S3 API cost implication with the Transfer Manager?

I see 216.78947 MB/s multiple pread mt (16 threads default) Get Object > 117.97726 MB/s 1 pread (up to 25 threads) TM?
Is this analysis right?

There is always enough parallelism at the operator level, and I worry that TM might introduce more overhead. See another PR #4276.
Can you share the numbers with hive.s3.transfer-manager-max-threads=1. Is this the same as GetObject?

@paul-amonson
Copy link
Contributor Author

It appears setting max threads to 1 is NOT the same as GetObject. Looking at the TransferManager source did not show any huge benefits except that the TM uses GetObjectAsync in the thread pool. It does use internal buffers and therefore has memcpy's. I will continue looking to see why there is this difference with 1 thread.

GetObject Small:

Run: 20,000,000 Gap: 0 Count: 1 Repeats: 5
78.848434 MB/s 1 pread
89.246666 MB/s 1 preadv
86.416084 MB/s multiple pread
155.94931 MB/s 1 pread mt
321.95337 MB/s 1 preadv mt
308.8326 MB/s multiple pread mt

TransferManager Small:

Threads Available: 1
Run: 20,000,000 Gap: 0 Count: 1 Repeats: 5
130.71059 MB/s 1 pread
211.11209 MB/s 1 preadv
240.55115 MB/s multiple pread
116.59836 MB/s 1 pread mt
275.49197 MB/s 1 preadv mt
122.494675 MB/s multiple pread mt

GetObject Large:

Run: 900,000,000 Gap: 0 Count: 1 Repeats: 3
96.863335 MB/s 1 pread
93.18586 MB/s 1 preadv
99.95754 MB/s multiple pread
245.16904 MB/s 1 pread mt
238.02472 MB/s 1 preadv mt
248.7209 MB/s multiple pread mt

TransferManager Large:

Threads Available: 1
Run: 900,000,000 Gap: 0 Count: 1 Repeats: 3
286.04272 MB/s 1 pread
554.0179 MB/s 1 preadv
841.80756 MB/s multiple pread
665.0374 MB/s 1 pread mt
633.59564 MB/s 1 preadv mt
750.433 MB/s multiple pread mt

@@ -222,6 +227,10 @@ class S3Config {
"hive.s3.iam-role-session-name", std::string("velox-session"));
}

int getTranferManagerMaxThreads() const {
return config_->get<int>("hive.s3.transfer-manager-max-threads", 25);
Copy link
Contributor

@pedroerp pedroerp Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the file was already like this, but it would be nice to pull out these literal strings to the top of the file as constants and give them a name (kS3TransferManagerMaxThreadsConfig or something similar).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to move this to HiveConfig. Filed #4300
Should be done in a separate PR.

transfer_manager->DownloadFile(bucket_, key_, offset, length, [&]() {
return Aws::New<UnderlyingStreamWrapper>("TestTag", &streamBuffer);
});
downloadHandle->WaitUntilFinished();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to log some message somewhere if these requests take a long time.

…er manager

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
…erManager creation.

Signed-off-by: Paul Amonson <paul.d.amonson@intel.com>
@majetideepak
Copy link
Collaborator

I will continue looking to see why there is this difference with 1 thread.

@paul-amonson Do we know anything about this? We should also run the VeloxTpchBenchmark with data on S3 with this change to get further insights.

@paul-amonson paul-amonson requested a review from mrambacher March 20, 2023 17:39
@paul-amonson
Copy link
Contributor Author

I will continue looking to see why there is this difference with 1 thread.

@paul-amonson Do we know anything about this? We should also run the VeloxTpchBenchmark with data on S3 with this change to get further insights.

The difference appears to be inside TransferManager. In fact when 1 is specified for the max_threads for TransferManager there are 2 threads in fact used. i.e. DoDownload internal API is called on a thread and then GetObjecctAsync is used using a second thread. All work is done by the calling GetObject on the callers thread.

I am trying to run the benchmark suggested but I need to ramp up on Kube clusters first.

@paul-amonson
Copy link
Contributor Author

CI is failing installing Minio. I am not sure why, it appears my changes here in the PR all built ok including installing the new AWS SDK. I need help understanding what is wrong. Thanks.

#!/bin/bash -eo pipefail
set -xu
cd ~/adapter-deps/install/bin/
wget https://dl.min.io/server/minio/release/linux-amd64/archive/minio-20220526054841.0.0.x86_64.rpm
rpm -i minio-20220526054841.0.0.x86_64.rpm
rm minio-20220526054841.0.0.x86_64.rpm

  • cd /root/adapter-deps/install/bin/
    /bin/bash: line 1: cd: /root/adapter-deps/install/bin/: No such file or directory

Exited with code exit status 1
CircleCI received exit code 1

@majetideepak
Copy link
Collaborator

  • /bin/bash: line 1: cd: /root/adapter-deps/install/bin/: No such file or directory

The upgraded SDK package is being installed at a different location(from /root/adapter-deps/install/) or is not creating a bin directory. The bin directory is missing as a result. You can create a bin directory in the later case.

@paul-amonson
Copy link
Contributor Author

So yet another failure:

FAILED: third_party/arrow_ep/src/arrow_ep-stamp/arrow_ep-download
cd /root/project/_build/debug/third_party/arrow_ep/src && /usr/bin/cmake -P /root/project/_build/debug/third_party/arrow_ep/src/arrow_ep-stamp/download-arrow_ep.cmake && /usr/bin/cmake -P /root/project/_build/debug/third_party/arrow_ep/src/arrow_ep-stamp/verify-arrow_ep.cmake && /usr/bin/cmake -P /root/project/_build/debug/third_party/arrow_ep/src/arrow_ep-stamp/extract-arrow_ep.cmake && /usr/bin/cmake -E touch /root/project/_build/debug/third_party/arrow_ep/src/arrow_ep-stamp/arrow_ep-download
-- Downloading...
dst='/root/project/_build/debug/third_party/arrow_ep/src/apache-arrow-8.0.0.tar.gz'
timeout='none'
-- Using src='https://dlcdn.apache.org/arrow/arrow-8.0.0/apache-arrow-8.0.0.tar.gz'
-- Retrying...
-- Using src='https://dlcdn.apache.org/arrow/arrow-8.0.0/apache-arrow-8.0.0.tar.gz'
-- Retry after 5 seconds (attempt #2) ...
-- Using src='https://dlcdn.apache.org/arrow/arrow-8.0.0/apache-arrow-8.0.0.tar.gz'
-- Retry after 5 seconds (attempt #3) ...
-- Using src='https://dlcdn.apache.org/arrow/arrow-8.0.0/apache-arrow-8.0.0.tar.gz'
-- Retry after 15 seconds (attempt #4) ...
-- Using src='https://dlcdn.apache.org/arrow/arrow-8.0.0/apache-arrow-8.0.0.tar.gz'
-- Retry after 60 seconds (attempt #5) ...
-- Using src='https://dlcdn.apache.org/arrow/arrow-8.0.0/apache-arrow-8.0.0.tar.gz'
CMake Error at arrow_ep-stamp/download-arrow_ep.cmake:159 (message):
Each download failed!

I tried a wget on the above URL and got a 404 error. Ideas?

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also run the VeloxTpchBenchmark with data on S3 with this change to get further insights.

@paul-amonson We need more benchmark data to evaluate this change. Let's start by comparing the TableScan times of velox-tpch-benchmark queries. The benchmark will need some modifications (listing files) since the data is now on S3.
I am also setting up an i4i EC2 instance to evaluate this change.

@majetideepak
Copy link
Collaborator

I ran query 6 with the modified tpch-benchmark here #4522 and saw a performance degradation.
Hardware Instance type: i4i.4xlarge (https://aws.amazon.com/ec2/instance-types/i4i/)

This branch took 57.83 seconds while the main branch took 31.76 seconds. The TableScan with this change is taking a lot more time.

$ velox_tpch_benchmark -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=6 -num_splits_per_file=128 -num-drivers=16
Execution time: 57.83s
Splits total: 8192, finished: 8192
-- Aggregation[FINAL a0 := sum("a0")] -> a0:DOUBLE
   Output: 1 rows (32B, 1 batches), Cpu time: 151.53us, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 2, Threads: 1
  -- LocalPartition[GATHER] -> a0:DOUBLE
     Output: 32 rows (1.00KB, 32 batches), Cpu time: 409.46us, Blocked wall time: 57.82s, Peak memory: 0B, Memory allocations: 0
     LocalPartition: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 369.11us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 16
     LocalExchange: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 40.35us, Blocked wall time: 57.82s, Peak memory: 0B, Memory allocations: 0, Threads: 1
    -- Aggregation[PARTIAL a0 := sum(ROW["p0"])] -> a0:DOUBLE
       Output: 16 rows (512B, 16 batches), Cpu time: 585.99ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 48, Threads: 16
      -- Project[expressions: (p0:DOUBLE, multiply(ROW["l_extendedprice"],ROW["l_discount"]))] -> p0:DOUBLE
         Output: 57025095 rows (815.16MB, 300716 batches), Cpu time: 734.99ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 33, Threads: 16
        -- TableScan[table: lineitem, range filters: [(discount, DoubleRange: [0.050000, 0.070000] no nulls), (quantity, DoubleRange: (-inf, 24.000000) no nulls), (shipdate, BigintRange: [8766, 9130] no nulls)]] -> l_shipdate:DATE, l_extendedprice:DOUBLE, l_quantity:DOUBLE, l_discount:DOUBLE
           Input: 57025095 rows (19.30GB, 300716 batches), Output: 57025095 rows (19.30GB, 300716 batches), Cpu time: 6m 40s, Blocked wall time: 0ns, Peak memory: 320.00MB, Memory allocations: 2432146, Threads: 16, Splits: 8192

 $ velox_tpch_benchmark_orig -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=6 -num_splits_per_file=128 -num-drivers=16
Execution time: 31.76s
Splits total: 8192, finished: 8192
-- Aggregation[FINAL a0 := sum("a0")] -> a0:DOUBLE
   Output: 1 rows (32B, 1 batches), Cpu time: 77.46us, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 2, Threads: 1
  -- LocalPartition[GATHER] -> a0:DOUBLE
     Output: 32 rows (1.00KB, 32 batches), Cpu time: 418.71us, Blocked wall time: 31.76s, Peak memory: 0B, Memory allocations: 0
     LocalPartition: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 354.51us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 16
     LocalExchange: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 64.20us, Blocked wall time: 31.76s, Peak memory: 0B, Memory allocations: 0, Threads: 1
    -- Aggregation[PARTIAL a0 := sum(ROW["p0"])] -> a0:DOUBLE
       Output: 16 rows (512B, 16 batches), Cpu time: 556.75ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 48, Threads: 16
      -- Project[expressions: (p0:DOUBLE, multiply(ROW["l_extendedprice"],ROW["l_discount"]))] -> p0:DOUBLE
         Output: 57025095 rows (809.44MB, 300716 batches), Cpu time: 686.31ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 33, Threads: 16
        -- TableScan[table: lineitem, range filters: [(discount, DoubleRange: [0.050000, 0.070000] no nulls), (quantity, DoubleRange: (-inf, 24.000000) no nulls), (shipdate, BigintRange: [8766, 9130] no nulls)]] -> l_shipdate:DATE, l_extendedprice:DOUBLE, l_quantity:DOUBLE, l_discount:DOUBLE
           Input: 57025095 rows (19.29GB, 300716 batches), Output: 57025095 rows (19.29GB, 300716 batches), Cpu time: 1m 14s, Blocked wall time: 0ns, Peak memory: 416.00MB, Memory allocations: 2432152, Threads: 16, Splits: 8192

@akashsha1
Copy link
Contributor

I ran query 6 with the modified tpch-benchmark here #4522 and saw a performance degradation. Hardware Instance type: i4i.4xlarge (https://aws.amazon.com/ec2/instance-types/i4i/)

This branch took 57.83 seconds while the main branch took 31.76 seconds. The TableScan with this change is taking a lot more time.

$ velox_tpch_benchmark -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=6 -num_splits_per_file=128 -num-drivers=16
Execution time: 57.83s
Splits total: 8192, finished: 8192
-- Aggregation[FINAL a0 := sum("a0")] -> a0:DOUBLE
   Output: 1 rows (32B, 1 batches), Cpu time: 151.53us, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 2, Threads: 1
  -- LocalPartition[GATHER] -> a0:DOUBLE
     Output: 32 rows (1.00KB, 32 batches), Cpu time: 409.46us, Blocked wall time: 57.82s, Peak memory: 0B, Memory allocations: 0
     LocalPartition: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 369.11us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 16
     LocalExchange: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 40.35us, Blocked wall time: 57.82s, Peak memory: 0B, Memory allocations: 0, Threads: 1
    -- Aggregation[PARTIAL a0 := sum(ROW["p0"])] -> a0:DOUBLE
       Output: 16 rows (512B, 16 batches), Cpu time: 585.99ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 48, Threads: 16
      -- Project[expressions: (p0:DOUBLE, multiply(ROW["l_extendedprice"],ROW["l_discount"]))] -> p0:DOUBLE
         Output: 57025095 rows (815.16MB, 300716 batches), Cpu time: 734.99ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 33, Threads: 16
        -- TableScan[table: lineitem, range filters: [(discount, DoubleRange: [0.050000, 0.070000] no nulls), (quantity, DoubleRange: (-inf, 24.000000) no nulls), (shipdate, BigintRange: [8766, 9130] no nulls)]] -> l_shipdate:DATE, l_extendedprice:DOUBLE, l_quantity:DOUBLE, l_discount:DOUBLE
           Input: 57025095 rows (19.30GB, 300716 batches), Output: 57025095 rows (19.30GB, 300716 batches), Cpu time: 6m 40s, Blocked wall time: 0ns, Peak memory: 320.00MB, Memory allocations: 2432146, Threads: 16, Splits: 8192
 $ velox_tpch_benchmark_orig -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=6 -num_splits_per_file=128 -num-drivers=16
Execution time: 31.76s
Splits total: 8192, finished: 8192
-- Aggregation[FINAL a0 := sum("a0")] -> a0:DOUBLE
   Output: 1 rows (32B, 1 batches), Cpu time: 77.46us, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 2, Threads: 1
  -- LocalPartition[GATHER] -> a0:DOUBLE
     Output: 32 rows (1.00KB, 32 batches), Cpu time: 418.71us, Blocked wall time: 31.76s, Peak memory: 0B, Memory allocations: 0
     LocalPartition: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 354.51us, Blocked wall time: 0ns, Peak memory: 0B, Memory allocations: 0, Threads: 16
     LocalExchange: Input: 16 rows (512B, 16 batches), Output: 16 rows (512B, 16 batches), Cpu time: 64.20us, Blocked wall time: 31.76s, Peak memory: 0B, Memory allocations: 0, Threads: 1
    -- Aggregation[PARTIAL a0 := sum(ROW["p0"])] -> a0:DOUBLE
       Output: 16 rows (512B, 16 batches), Cpu time: 556.75ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 48, Threads: 16
      -- Project[expressions: (p0:DOUBLE, multiply(ROW["l_extendedprice"],ROW["l_discount"]))] -> p0:DOUBLE
         Output: 57025095 rows (809.44MB, 300716 batches), Cpu time: 686.31ms, Blocked wall time: 0ns, Peak memory: 1.00MB, Memory allocations: 33, Threads: 16
        -- TableScan[table: lineitem, range filters: [(discount, DoubleRange: [0.050000, 0.070000] no nulls), (quantity, DoubleRange: (-inf, 24.000000) no nulls), (shipdate, BigintRange: [8766, 9130] no nulls)]] -> l_shipdate:DATE, l_extendedprice:DOUBLE, l_quantity:DOUBLE, l_discount:DOUBLE
           Input: 57025095 rows (19.29GB, 300716 batches), Output: 57025095 rows (19.29GB, 300716 batches), Cpu time: 1m 14s, Blocked wall time: 0ns, Peak memory: 416.00MB, Memory allocations: 2432152, Threads: 16, Splits: 8192

What are your observations on other TPC-H queries?
From the metrics, I see that for TransferManager the peak memory usage on TableScan is lower, and CPU time goes higher.
I am not fully familiar with Velox metrics, but maybe this is an indication that resources are not fully getting utilized by the transfer manager, and further tuning (prefetch, read-ahead, coalesce etc.) will help.

@majetideepak
Copy link
Collaborator

majetideepak commented Apr 11, 2023

tpch query 1 has a similar behavior
2m 44s vs 3m 26s dominated by TableScan. Ignore the negative row counts. The counter variable type needs to be updated.

~/velox_tpch_benchmark_orig -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=1 -num_splits_per_file=128 -num-drivers=16
Execution time: 2m 44s
Splits total: 8192, finished: 8192
          -- TableScan[table: lineitem, range filters: [(shipdate, BigintRange: [-9223372036854775808, 10471] no nulls)]] -> l_returnflag:VARCHAR, l_linestatus:VARCHAR, l_quantity:DOUBLE, l_extendedprice:DOUBLE, l_discount:DOUBLE, l_tax:DOUBLE, l_shipdate:DATE
             Input: -1341716328 rows (328.64GB, 301056 batches), Output: -1341716328 rows (328.64GB, 301056 batches), Cpu time: 2m 7s, Blocked wall time: 0ns, Peak memory: 712.00MB, Memory allocations: 4249900, Threads: 16, Splits: 8192
~/velox_tpch_benchmark -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=1 -num_splits_per_file=128 -num-drivers=16
Execution time: 3m 26s
Splits total: 8192, finished: 8192
          -- TableScan[table: lineitem, range filters: [(shipdate, BigintRange: [-9223372036854775808, 10471] no nulls)]] -> l_returnflag:VARCHAR, l_linestatus:VARCHAR, l_quantity:DOUBLE, l_extendedprice:DOUBLE, l_discount:DOUBLE, l_tax:DOUBLE, l_shipdate:DATE
             Input: -1341716328 rows (328.64GB, 301056 batches), Output: -1341716328 rows (328.64GB, 301056 batches), Cpu time: 4m 59s, Blocked wall time: 0ns, Peak memory: 736.00MB, Memory allocations: 4249900, Threads: 16, Splits: 8192

@majetideepak
Copy link
Collaborator

majetideepak commented Apr 11, 2023

Can you re-run the velox_s3read_benchmark with -bytes 2000000? This would give 50 repeats and the parallel reads would emulate a real scenario. repeat in parallel mode is equivalent to the number of concurrent threads invoking the read calls.

@paul-amonson
Copy link
Contributor Author

tpch query 1 has a similar behavior 2m 44s vs 3m 26s dominated by TableScan. Ignore the negative row counts. The counter variable type needs to be updated.

~/velox_tpch_benchmark_orig -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=1 -num_splits_per_file=128 -num-drivers=16
Execution time: 2m 44s
Splits total: 8192, finished: 8192
          -- TableScan[table: lineitem, range filters: [(shipdate, BigintRange: [-9223372036854775808, 10471] no nulls)]] -> l_returnflag:VARCHAR, l_linestatus:VARCHAR, l_quantity:DOUBLE, l_extendedprice:DOUBLE, l_discount:DOUBLE, l_tax:DOUBLE, l_shipdate:DATE
             Input: -1341716328 rows (328.64GB, 301056 batches), Output: -1341716328 rows (328.64GB, 301056 batches), Cpu time: 2m 7s, Blocked wall time: 0ns, Peak memory: 712.00MB, Memory allocations: 4249900, Threads: 16, Splits: 8192
~/velox_tpch_benchmark -data_path="/home/ec2-user/tables/sf1000" -run_query_verbose=1 -num_splits_per_file=128 -num-drivers=16
Execution time: 3m 26s
Splits total: 8192, finished: 8192
          -- TableScan[table: lineitem, range filters: [(shipdate, BigintRange: [-9223372036854775808, 10471] no nulls)]] -> l_returnflag:VARCHAR, l_linestatus:VARCHAR, l_quantity:DOUBLE, l_extendedprice:DOUBLE, l_discount:DOUBLE, l_tax:DOUBLE, l_shipdate:DATE
             Input: -1341716328 rows (328.64GB, 301056 batches), Output: -1341716328 rows (328.64GB, 301056 batches), Cpu time: 4m 59s, Blocked wall time: 0ns, Peak memory: 736.00MB, Memory allocations: 4249900, Threads: 16, Splits: 8192

Can you set the S3 threads to 1 to see if this is on par with GetObject()? FYI: I am still unable to run the velox_tpch_benchmark without a segfault.

@paul-amonson
Copy link
Contributor Author

paul-amonson commented May 4, 2023

There is currently no benefit to adding the transfer manager to the S3 connector as long as there are CURL lib issues with higher number of threads being used for download. A future solution not using (or fixing) libcurl may require a need to revisit this but for now closing this PR.

Our focus is going to be on tuning Velox IO for S3 (much work is already being done in this direction).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants