Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate use of INT64 (and setting pq schema as UINT_64) #33

Open
wseaton opened this issue Oct 26, 2021 · 0 comments
Open

Evaluate use of INT64 (and setting pq schema as UINT_64) #33

wseaton opened this issue Oct 26, 2021 · 0 comments

Comments

@wseaton
Copy link

wseaton commented Oct 26, 2021

The INT64 Go type is being used in a lot of places, which translates to the UINT_64 logical type in parquet. This type is incompatible w/ Spark <= 3.1:

org.apache.spark.sql.AnalysisException: Parquet type not supported: INT64 (UINT_64)

This on it's own is fine, but it seems to be used in cases where the underlying data is not very large, say max value of 2048.0, is there a way to run an aggregation on the entire series before export and downcast to the smallest suitable logical type? Or maybe even issue a Prometheus query across a wider date range to grab the max, to help prevent schema (read: type) drift? I think this may be partly to blame for the large memory usage on export.

There's currently not a great spark workaround for the above since we need to use spark.read.option("mergeSchema", "true") to account for the schema drift internal to Prometheus. Best solution is to use bleeding edge Spark 3.2.0 which has it's own problems 😬

@wseaton wseaton changed the title Evaluate use of INT64 (UINT_64) Evaluate use of INT64 (and setting pq schema as UINT_64) Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant