Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I read DataColumnStatistics of the column only before reading the entire column data ? #252

Closed
mirosuav opened this issue Feb 6, 2023 · 5 comments
Assignees
Milestone

Comments

@mirosuav
Copy link
Contributor

mirosuav commented Feb 6, 2023

Hi

Is there a way to read only the DataColumnStatistics before actually loading the entire column data into memory ?
Essentially I have a method that checks if the search value exists in the data column by checking column Min and Max value

public static bool ValueExistsInColumnRange<T>(this DataColumn column, T value)
        where T : IComparable<T>
    {
        if (value is null || column.Statistics.MinValue is null || column.Statistics.MaxValue is null ||
            ((T)column.Statistics.MinValue).CompareTo(value) > 0 || ((T)column.Statistics.MaxValue).CompareTo(value) < 0)
            return false;

        return true;
    }

and if value doesn't happen to exist in within the column then I can skip entire column without loading its vaules into memory. However I've noticed that
ParquetRowGroupReader.ReadColumnAsync
loads the entire column data into memory.
How to only load column statistics and optionaly load column data on demand ?

@mirosuav mirosuav changed the title Can I read DataColumnStatistics of the column only before readin the entire column data ? Can I read DataColumnStatistics of the column only before reading the entire column data ? Feb 6, 2023
@aloneguid
Copy link
Owner

aloneguid commented Feb 6, 2023

No, not now, but it should be easy to implement.

@aloneguid
Copy link
Owner

@mirosuav You should be able to access RowGroupReader.ThriftRowGroup metadata. It's a bit rough and may not suit your use case.

@mirosuav
Copy link
Contributor Author

mirosuav commented Apr 18, 2023

Thanks, @aloneguid. I already have a working solution for that, will PR it once I'm done testing.

@aloneguid
Copy link
Owner

@mirosuav just wondering, is this in any way related to GeoParquet?

@mirosuav
Copy link
Contributor Author

@aloneguid no, I don't know GeoParquet :) We're doing our own research on comparing different storage format for big real time data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants