Faster ListingTable partition listing (#6182) #6183

tustvold · 2023-05-01T16:51:31Z

Which issue does this PR close?

Closes #6182

Rationale for this change

There were reports of slow listing performance

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

tustvold · 2023-05-01T16:52:08Z

datafusion/core/src/datasource/listing/helpers.rs

+
+    let batch = RecordBatch::try_new(schema.clone(), arrays)?;
+
+    // TODO: Plumb this down


This was a pre-existing issue

I think it was even worse (the old code creates an entire SessionContext right)?

kylebrooks-8451 · 2023-05-01T23:28:07Z

I'll give this a try tomorrow, thanks for this!

tustvold · 2023-05-02T10:56:01Z

datafusion/core/src/datasource/listing/url.rs

-            "" => path,
-            p => path.strip_prefix(p)?.strip_prefix(DELIMITER)?,
-        };
+        let stripped = path.as_ref().strip_prefix(self.prefix.as_ref())?;


The previous logic would return None for an exact match

tustvold · 2023-05-05T13:48:02Z

@kylebrooks-8451 did you manage to give this a go and did it help your use-case?

tustvold · 2023-05-16T05:53:11Z

Marking as ready for review, as resolves a longstanding todo, and should scale significantly better.

Perhaps @yahoNanJing you might be able to give this a test, as I seem to remember you having some non-trivial workloads that exercise this functionality?

kylebrooks-8451 · 2023-05-16T12:12:24Z

@kylebrooks-8451 did you manage to give this a go and did it help your use-case?

@tustvold - We did test this on a very large table, we gave it 10 minutes to list and it still was running. For reference, PyArrow with adlfs fsspec takes around 2-3 minutes on this same table. Let me test it again on this latest commit.

Edit:

Correction - This was using a trailing / in the URI which was trying to infer the schema for all partitions which is why is was slow. There is still an issue with how to infer schema for hierarchical namespace Azure Storage accounts but that is a separate issue.

tustvold · 2023-05-16T12:19:22Z

we gave it 10 minutes to list and it still was running

That is disappointing, I presume you are running in release mode? I will push a commit that adds some logs so that we can get some insight into what it is spending its time doing

kylebrooks-8451 · 2023-05-16T12:23:57Z

we gave it 10 minutes to list and it still was running

That is disappointing, I presume you are running in release mode? I will push a commit that adds some logs so that we can get some insight into what it is spending its time doing

I see the commit to add logs. Let me build this in release on an Azure VM and try to run against our largest parquet dataset.

tustvold · 2023-05-16T12:26:07Z

Azure VM

Aah, I didn't realise this was Azure... Azure Blob Storage is notoriously slow, still we should be able to at least match fsspec. Interested to see where it is spending time.

kylebrooks-8451 · 2023-05-16T13:58:06Z

Still debugging this, I noticed this error:

Error: ObjectStore(Generic { store: "MicrosoftAzure", source: ListRequest { source: Error { retries: 0, message: "request error", source: Some(reqwest::Error { kind: Request, url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("[redacted].blob.core.windows.net")), port: None, path: "/[redacted]", query: Some("restype=container&comp=list&prefix=[redacted]&delimiter=[redacted]"), fragment: None }, source: hyper::Error(Connect, ConnectError("tcp open error", Os { code: 24, kind: Uncategorized, message: "Too many open files" })) }) } } })

tustvold · 2023-05-16T14:09:23Z

Still debugging this, I noticed this error:

Aah, I worried that might happen, we should probably limit the maximum number of concurrent requests when listing the partitions. Will push a fix later today

kylebrooks-8451 · 2023-05-16T16:23:51Z

@tustvold It's working now and very fast, 7 seconds for this table. I'm worried this might not be real results because it seems to not be reading any parquet files only the partition folder. The inferred schema is only partition columns no actual data from parquet.

Is there some setting I'm missing to read the files / infer their schema?

tustvold · 2023-05-16T16:34:37Z

The inferred schema is only partition columns no actual data from parquet.

Is it possible your query has predicates that aren't satisfied by any of the partitions, i.e. it is pruning everything out?

tustvold · 2023-05-16T18:08:06Z

Following some investigation with @kylebrooks-8451 I believe the conclusions to be:

This PR drastically improves the listing performance for queries
Work still remains to make schema inference comparably fast

alamb

Thank you @tustvold -- I think this PR really improves the code. ❤️

My only concern about this PR is that it removes some tests -- there is probably a good reason for doing so, but I wanted to get the answer / rationale before approving

cc @thinkharderdev / @Dandandan I wonder if you have time to review this idea. It seems like a great idea to me, but I don't think I have a great way to test it.

alamb · 2023-05-16T20:30:08Z

datafusion/core/src/datasource/listing/helpers.rs

@@ -153,225 +151,239 @@ pub fn split_files(
        .collect()
 }

+struct Partition {


I think adding some comments about specifically what depth and files fields mean would help readability. Like are the files only files or do they include paths and what does depth signify?

alamb · 2023-05-16T20:33:42Z

datafusion/core/src/datasource/listing/helpers.rs

+    futures.push(partition.list(store));
+
+    while let Some((partition, paths)) = futures.next().await.transpose()? {
+        if let Some(next) = pending.pop() {


Is there an invariant that if pending is non empty, then futures.len() prior to the loop is CONCURRENCY_LIMT? I am trying work out why only one pending future pushed to futures rather than pushing while futures.len() < CONCURRENCY_LIMIT

Each iteration of the loop can at most complete one future, therefore freeing up at most one "slot" in futures. If pending contains anything it implies that we were at CONCURRENCY_LIMIT before we polled futures, and therefore can only add at most one future

alamb · 2023-05-16T20:34:40Z

datafusion/core/src/datasource/listing/helpers.rs

+
+    let batch = RecordBatch::try_new(schema.clone(), arrays)?;
+
+    // TODO: Plumb this down


I think it was even worse (the old code creates an entire SessionContext right)?

datafusion/core/src/datasource/listing/helpers.rs

alamb · 2023-05-16T20:39:42Z

datafusion/core/src/datasource/listing/url.rs

-                    None => true,
-                };
-
+                let glob_match = self.contains(path);


this is a nice refactoring

alamb · 2023-05-16T20:40:17Z

datafusion/core/src/datasource/listing/helpers.rs

            )
        );
    }

-    #[test]
-    fn test_path_batch_roundtrip_no_partiton() {


What were these tests removed?

This is testing logic that no longer exists, we no longer encode filenames to a RecordBatch, filter it and convert it back, instead evaluating the expressions directly

alamb · 2023-05-16T20:41:53Z

Work still remains to make schema inference comparably fast

I can file a ticket for this if you would like

kylebrooks-8451 · 2023-05-16T20:47:02Z

For some perspective, this PR is able to list a ~27,000 partition table in Azure Blob storage in 7 seconds whereas the PyArrow Dataset for the same table the adlfs fsspec FileSystem takes 2.5 minutes. The old DataFusion code before this PR never finished after waiting > 10 minutes. Fantastic work @tustvold!

alamb · 2023-05-16T21:07:09Z

For some perspective, this PR is able to list a ~27,000 partition table in Azure Blob storage in 7 seconds whereas the PyArrow Dataset for the same table the adlfs fsspec FileSystem takes 2.5 minutes. The old DataFusion code before this PR never finished after waiting > 10 minutes. Fantastic work @tustvold!

🎉 that is amazing -- thank you for the measurement @kylebrooks-8451

alamb · 2023-05-17T19:45:50Z

Schema inference PR : #6366

github-actions bot added the core Core DataFusion crate label May 1, 2023

tustvold commented May 1, 2023

View reviewed changes

Faster ListingTable partition listing (apache#6182)

61d922c

tustvold force-pushed the faster-listing branch from f3646c0 to 61d922c Compare May 1, 2023 16:52

Fix strip_prefix

41dc9db

tustvold commented May 2, 2023

View reviewed changes

tustvold added 2 commits May 2, 2023 12:25

Fix strip_prefix

a821ba7

Implement list_with_delimiter for MirroringObjectStore

2755b10

tustvold mentioned this pull request May 3, 2023

Faster prefix match in object_store path handling apache/arrow-rs#4164

Merged

tustvold added 2 commits May 3, 2023 22:08

Use split_terminator

be2668d

Fix MirroringObjectStore::list_with_delimiter

bea7578

tustvold marked this pull request as ready for review May 16, 2023 05:54

tustvold added 2 commits May 16, 2023 07:07

Merge remote-tracking branch 'upstream/main' into faster-listing

11d3bcc

Fix logical conflict

8fb9e8e

Add logs

160144b

tustvold added 2 commits May 16, 2023 16:06

Limit concurrency

48089ad

Increase concurrency limit

8f65fb3

tustvold mentioned this pull request May 16, 2023

Return NotFound for directories in Head and Get (#4230) apache/arrow-rs#4231

Merged

alamb reviewed May 16, 2023

View reviewed changes

Review feedback

fd2edc2

tustvold merged commit 9f808f4 into apache:main May 17, 2023

alamb mentioned this pull request May 17, 2023

Concurrent Parquet Schema Inference #6366

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster ListingTable partition listing (#6182) #6183

Faster ListingTable partition listing (#6182) #6183

tustvold commented May 1, 2023 •

edited

Loading

tustvold May 1, 2023

alamb May 16, 2023

kylebrooks-8451 commented May 1, 2023

tustvold May 2, 2023

tustvold commented May 5, 2023

tustvold commented May 16, 2023 •

edited

Loading

kylebrooks-8451 commented May 16, 2023 •

edited

Loading

tustvold commented May 16, 2023

kylebrooks-8451 commented May 16, 2023

tustvold commented May 16, 2023 •

edited

Loading

kylebrooks-8451 commented May 16, 2023

tustvold commented May 16, 2023

kylebrooks-8451 commented May 16, 2023

tustvold commented May 16, 2023

tustvold commented May 16, 2023

alamb left a comment

alamb May 16, 2023

alamb May 16, 2023

tustvold May 17, 2023 •

edited

Loading

alamb May 16, 2023

alamb May 16, 2023

alamb May 16, 2023

tustvold May 17, 2023

alamb commented May 16, 2023

kylebrooks-8451 commented May 16, 2023

alamb commented May 16, 2023

alamb commented May 17, 2023


		let batch = RecordBatch::try_new(schema.clone(), arrays)?;

		// TODO: Plumb this down

Faster ListingTable partition listing (#6182) #6183

Faster ListingTable partition listing (#6182) #6183

Conversation

tustvold commented May 1, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebrooks-8451 commented May 1, 2023

Choose a reason for hiding this comment

tustvold commented May 5, 2023

tustvold commented May 16, 2023 • edited Loading

kylebrooks-8451 commented May 16, 2023 • edited Loading

tustvold commented May 16, 2023

kylebrooks-8451 commented May 16, 2023

tustvold commented May 16, 2023 • edited Loading

kylebrooks-8451 commented May 16, 2023

tustvold commented May 16, 2023

kylebrooks-8451 commented May 16, 2023

tustvold commented May 16, 2023

tustvold commented May 16, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold May 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented May 16, 2023

kylebrooks-8451 commented May 16, 2023

alamb commented May 16, 2023

alamb commented May 17, 2023

tustvold commented May 1, 2023 •

edited

Loading

tustvold commented May 16, 2023 •

edited

Loading

kylebrooks-8451 commented May 16, 2023 •

edited

Loading

tustvold commented May 16, 2023 •

edited

Loading

tustvold May 17, 2023 •

edited

Loading