Skip to content

Add support for nested struct field based filter expressions in Iceberg #122

@prodeezy

Description

@prodeezy

I tried testing struct filter pushdowns in Iceberg by applying these dependent code changes viz.

  1. Spark pr for struct pushdown
  2. Iceberg writers for Parquet
  3. Changes to Metrics collection to add struct metrics in Iceberg

Iceberg rejects it with this validation error:

Caused by: com.netflix.iceberg.exceptions.ValidationException: Cannot find field 'location.lat' in struct: struct<1: age: optional int, 2: name: optional string, 3: friends: optional map<string, int>, 4: location: optional struct<7: lat: optional double, 8: lon: optional double>>
  at com.netflix.iceberg.exceptions.ValidationException.check(ValidationException.java:42)
  at com.netflix.iceberg.expressions.UnboundPredicate.bind(UnboundPredicate.java:76)
  at com.netflix.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:138)
  at com.netflix.iceberg.expressions.Projections$BaseProjectionEvaluator.predicate(Projections.java:94)
  at com.netflix.iceberg.expressions.ExpressionVisitors.visit(ExpressionVisitors.java:147)
  at com.netflix.iceberg.expressions.ExpressionVisitors.visit(ExpressionVisitors.java:160)
  at com.netflix.iceberg.expressions.Projections$BaseProjectionEvaluator.project(Projections.java:108)
  at com.netflix.iceberg.expressions.InclusiveManifestEvaluator.<init>(InclusiveManifestEvaluator.java:57)
  at com.netflix.iceberg.BaseTableScan$1.load(BaseTableScan.java:153)
  at com.netflix.iceberg.BaseTableScan$1.load(BaseTableScan.java:149)
  at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
  at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
  at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)

Test Gist : https://gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac

Based on discussions on dev mailing-list and Issue#78 we want to be able to support nested struct filtering in Iceberg. Although for now we want to avoid mixed fields like struct inside map or struct inside array as that changes the semantics of the expression For example, a.b = 5 can be run on a: struct<b: int> but can't be run on a: list<struct<b: int>>.

Issue#78 focusses on adding the metrics in Iceberg for struct fields, this issue is to address the expression handling for the same once the former is available.

/cc @aokolnychyi @rdblue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions