Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Nulls in a FixedLengthByteArray column throw an exception #486

Open
mbnshtck opened this issue Dec 21, 2024 · 1 comment
Open

[BUG]: Nulls in a FixedLengthByteArray column throw an exception #486

mbnshtck opened this issue Dec 21, 2024 · 1 comment

Comments

@mbnshtck
Copy link

mbnshtck commented Dec 21, 2024

Issue Description

Fail to read column that contains null FixedLengthByteArray values.

  • if more than one null item in the column it throws an exception.
  • if first item in column is null it throws an exception.
  • if only one null item in the column that is not the first item, it returns the first item value instead of null.

Environment Information

  • ParquetSharp Version: 16.01
  • .NET Framework/SDK Version: NET Framework 4.7.2
  • Operating System: Windows 11

Steps To Reproduce

string path = "c:\temp\bronze_ppe_export_datatable_null_decimals_first_not_null.gz.parquet";
int rowGroupIndex = 0;
int columnIndex = 1;

var reader = new ParquetSharp.ParquetFileReader(path)
var rowGroupReader = reader.RowGroup(rowGroupIndex);
var numRows = rowGroupReader .MetaData.NumRows;
var values = new ParquetSharp.FixedLenByteArray[numRows];

var columnReader = rowGroupReader.Column(columnIndex);
using var column = (ColumnReader<ParquetSharp.FixedLenByteArray>)columnReader;
var rowsRead = column.ReadBatch(numRows, values, out var valuesRead);

bronze_ppe_export_datatable_null_decimals_first_not_null.gz.zip

Expected Behavior

throw exception on column.ReadBatch(numRows, values, out var valuesRead);

Additional Context (Optional)

@adamreeve adamreeve changed the title [BUG]: <title>Null FixedLengthByteArray throw execption [BUG]: Nulls in a FixedLengthByteArray column throw an execption Dec 23, 2024
@adamreeve adamreeve changed the title [BUG]: Nulls in a FixedLengthByteArray column throw an execption [BUG]: Nulls in a FixedLengthByteArray column throw an exception Dec 23, 2024
@adamreeve
Copy link
Contributor

Hi @mbnshtck. This is caused by using the ReadBatch overload that doesn't take defLevels and repLevels parameters. This is only valid to do if you know a column has no null values. We should add some documentation to these methods as this behaviour is non-obvious and currently not documented.

If you want to work with the raw FixedLenByteArray values, you will need to also read the definition levels (repetition levels can be ignored for non-nested data). Also note that the values will not be read "spaced" but all non-null values will be contiguous in the start of the output array:

const int rowGroupIndex = 0;
const int columnIndex = 1;
using var reader = new ParquetFileReader(path);
using var rowGroup = reader.RowGroup(rowGroupIndex);
using var column = (ColumnReader<ParquetSharp.FixedLenByteArray>) rowGroup.Column(columnIndex);
var numRows = rowGroup.MetaData.NumRows;

var definitionLevels = new short[numRows];
var values = new ParquetSharp.FixedLenByteArray[numRows];
var rowsRead = column.ReadBatch(numRows, definitionLevels, repLevels: null, values, out var valuesRead);

Assert.That(rowsRead, Is.EqualTo(8));
Assert.That(valuesRead, Is.EqualTo(2));
var valueOffset = 0;
var value = new byte[column.ColumnDescriptor.TypeLength];
for (var i = 0; i < numRows; i++)
{
    if (definitionLevels[i] == 1)
    {
        Marshal.Copy(values[valueOffset].Pointer, value, 0, column.ColumnDescriptor.TypeLength);
        Console.WriteLine($"values[{i}] = {Convert.ToHexString(value)}");
        valueOffset += 1;
    }
    else
    {
        Console.WriteLine($"values[{i}] = null");
    }
}

But you can also use the higher-level LogicalColumnReader class to read these values as decimal typed data. This will also handle interpreting the definition levels:

const int rowGroupIndex = 0;
const int columnIndex = 1;
using var reader = new ParquetFileReader(path);
using var rowGroup = reader.RowGroup(rowGroupIndex);
using var column = rowGroup.Column(columnIndex).LogicalReader<decimal?>();
var numRows = rowGroup.MetaData.NumRows;
var values = column.ReadAll((int)numRows);

for (var i = 0; i < numRows; i++)
{
    Console.WriteLine($"values[{i}] = {(values[i].HasValue ? values[i]!.Value.ToString() : "null")}");
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants