Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider resolving a clickbench files as Utf8 (rather than binary) #12510

Closed
alamb opened this issue Sep 17, 2024 · 3 comments
Closed

Consider resolving a clickbench files as Utf8 (rather than binary) #12510

alamb opened this issue Sep 17, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Sep 17, 2024

Is your feature request related to a problem or challenge?

In the ClickBench benchmark queries, there are two datasets we use. A "single file" hits.parquet and "partitioned" which has 100 files in a directory. They hold the same data.

However DataFusion resolves hits.parquet such that columns like URL are a Utf8 or Utf8View while the same columns are resolved as Binary or BinaryView

This has caused some small slowdowns while enabling StringView by default -- see #12509

You can see the schema resolution by:

cd benchmarks
# download hits.parquet
./bench.sh data clickbench_1
# download hits_partitioned
./bench.sh data clickbench_partitioned

Then run datafusion-cli:

cd data
# hits.parquet has Utf8 columns
datafusion-cli -c 'describe "hits.parquet"' | grep Utf8
| Title                 | Utf8      | NO          |
| URL                   | Utf8      | NO          |
| Referer               | Utf8      | NO          |
...
| UTMContent            | Utf8      | NO          |
| UTMTerm               | Utf8      | NO          |
| FromTag               | Utf8      | NO          |

# hits_patitioned has Binary type for the same columns
datafusion-cli -c 'describe "hits_partitioned"' | grep Binary
| Title                 | Binary    | YES         |
| URL                   | Binary    | YES         |
| Referer               | Binary    | YES         |
...
| UTMContent            | Binary    | YES         |
| UTMTerm               | Binary    | YES         |
| FromTag               | Binary    | YES         |

It semes for some reason the individual files are all resolved to Binary:

datafusion-cli -c 'describe "hits_partitioned/hits_99.parquet"' | grep Binary
| Title                 | Binary    | YES         |
| URL                   | Binary    | YES         |
| Referer               | Binary    | YES         |
| FlashMinor2           | Binary    | YES         |
| UserAgentMinor        | Binary    | YES         |
...
datafusion-cli -c 'describe "hits_partitioned/hits_60.parquet"' | grep Binary
| Title                 | Binary    | YES         |
| URL                   | Binary    | YES         |
| Referer               | Binary    | YES         |
| FlashMinor2           | Binary    | YES         |
| UserAgentMinor        | Binary    | YES         |
...

Describe the solution you'd like

I would like ideally that the clickbench queries resolve to the same schema, in this case Utf8 given the contents of the files and the queries that treat it them as strings

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Sep 17, 2024
@alamb alamb changed the title Consider resolving a column stored as both Binary and Utf8 as Utf8 (rather than binary) Consider resolving a clickbench files as Utf8 (rather than binary) Sep 17, 2024
@thinh2
Copy link
Contributor

thinh2 commented Sep 17, 2024

take

@alamb
Copy link
Contributor Author

alamb commented Sep 22, 2024

I looked into this issue more -- I think fundamentally the schema is different in the files, and there isn't any way, short of some sort of configuration to cast Binary --> String always, we would be able to special case this

hits.parquet

Metadata for file: hits.parquet

version: 1
num of rows: 99997497
created by: parquet-cpp version 1.5.1-SNAPSHOT
message schema {
  REQUIRED INT64 WatchID;
  REQUIRED INT32 JavaEnable (INTEGER(16,true));
  REQUIRED BYTE_ARRAY Title (STRING);
...

Thus I am closing this issue as won't do -- please let me know if you have found something different @thinh2

hits_partitioned/hits_55.parquet

Metadata for file: hits_partitioned/hits_55.parquet

version: 1
num of rows: 1000000
created by: parquet-cpp version 1.5.1-SNAPSHOT
message schema {
  OPTIONAL INT64 WatchID;
  OPTIONAL INT32 JavaEnable (INTEGER(16,true));
  OPTIONAL BYTE_ARRAY Title;
...

hits_55.parquet.schema.txt
hits.parquet.schema.txt

@alamb alamb closed this as not planned Won't fix, can't repro, duplicate, stale Sep 22, 2024
@alamb
Copy link
Contributor Author

alamb commented Sep 22, 2024

I had another ideea here: #12509 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants