Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Getting different length for keyColumn and valueColumn of a partition column #489

Open
shamimashik opened this issue Mar 18, 2024 · 2 comments

Comments

@shamimashik
Copy link

shamimashik commented Mar 18, 2024

Library Version

4.23.4

OS

Windows

OS Architecture

64 bit

How to reproduce?

I'm seeing a difference in DataColumn.Data.Length for the keyColumn and the valueColumn of the partitionValues column.

Here's the paths I'm using -
keyPath: "add/partitionValues/key_value/key"
valuePath: "add/partitionValues/key_value/value"

For the keyPath, I'm getting 64865 as the DataColumn.Data.Length whereas valuePath returns 64867.

Note that this issue was not present in version 3.10.0

Failing test

Code that I used to verify the issue: 

private async Task<DataColumn[]> ReadParquetMyFileAsync(bool treatByteArrayAsString) {
    List<DataColumn> dataColumns = new List<DataColumn>();
    string name = "<filename>.checkpoint.parquet";
    string keyPath = "add/partitionValues/key_value/key";
    string valuePath = "add/partitionValues/key_value/value";

    using(Stream s = OpenTestFile(name)) {
        using(ParquetReader pr = await ParquetReader.CreateAsync(
            s, new ParquetOptions { TreatByteArrayAsString = treatByteArrayAsString })) {
            DataField[] dataFields = pr.Schema.GetDataFields();
            Dictionary<string, DataField> dataFieldMapping = this.RetrieveDataFieldMapping(dataFields);
            for(int i = 0; i < pr.RowGroupCount; ++i) {
                using ParquetRowGroupReader groupReader = pr.OpenRowGroupReader(i);
                if(dataFieldMapping.TryGetValue(keyPath, out DataField keyField) &&
                    dataFieldMapping.TryGetValue(valuePath, out DataField valueField)) {
                    DataColumn keyColumn = await groupReader.ReadColumnAsync(keyField);
                    DataColumn valueColumn = await groupReader.ReadColumnAsync(valueField);
                    Array keyColumnData = keyColumn.Data;
                    Array valueColumnData = valueColumn.Data;
                    dataColumns.Add(keyColumn);
                    dataColumns.Add(valueColumn);

                    string result = string.Empty;
                    for(int dataIndex = 0; dataIndex < keyColumn.Data.Length; ++dataIndex) {
                        string key = keyColumnData.GetValue(dataIndex).ToString();
                        string val = valueColumnData.GetValue(dataIndex) == null ? "null" : valueColumnData.GetValue(dataIndex).ToString();
                        result += "[" + (dataIndex) + "] " + key + ": " + val + "\n";
                    }
                    Console.WriteLine(result);
                }
            }

            return dataColumns.ToArray();
        }
    }
}
@shamimashik shamimashik changed the title [BUG]: Getting different length for keyColumn and valueColumn of partition column [BUG]: Getting different length for keyColumn and valueColumn of a partition column Mar 18, 2024
@mukunku
Copy link
Contributor

mukunku commented Apr 29, 2024

I noticed this a while ago too. I thought that it was intentional to save space if all the values after a certain index are null.

I wrote my code the following way to accommodate the key and value array lengths not being the same.

https://github.com/mukunku/ParquetViewer/blob/77c70c9d2a95c96de28c5701717c08c362d8eb13/src/ParquetViewer.Engine/ParquetEngine.Processor.cs#L252-L273

@aloneguid
Copy link
Owner

@shamimashik do you have a test file i can reproduce on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants