feat(table/scanner): Initial pass for planning a scan and returning the files to use #118

zeroshade · 2024-07-23T21:53:30Z

Very rough initial implementation of metrics evaluation and a simple scanner for Tables that produces the list of FileScanTasks to perform a scan along with positional delete files and so on.

This also includes a framework and setup for performing integration testing that is adapted from the approach used in pyiceberg, creating docker images and a file of tests which are only executed by setting the integration tag which is used in a new workflow which runs those tests.

This provides an end-to-end case of using a table and row-filter-expression to perform manifest and metrics evaluations to create the plan for scanning. The next step would be actually fetching the data!

zeroshade · 2024-07-29T22:34:20Z

@Fokko @nastra This should be ready for review now, though there's a weirdness in the number of data files being created for one of the integration testing tables on the CI here vs when I run the docker compose and provisioning locally. I don't know enough about spark-iceberg internals to know whether that is a quirk, expected, or something that I should change the tests for. Any ideas?

I've added a comment in scanner_test.go referencing the weirdness. You can also look at the failed CI runs for examples.

dev/provision.py

dev/hive/core-site.xml

dev/provision.py

zeroshade · 2024-08-13T15:04:30Z

@nastra Any further comments?

nastra · 2024-08-15T13:56:30Z

thanks for the patience here @zeroshade. I'll do a full review in the next 2-3 days

nastra · 2024-08-16T08:53:18Z

io/s3.go

-				HostnameImmutable: true,
-			}, nil
-		})))
+		opts = append(opts, func(o *s3.Options) {


for the future I think it would be great to just extract such things out into a separate (small) PR. That way we can get PRs reviewed faster as otherwise it's quite difficult to find long periods of time to review a huge chunk of new code that is mixed with other changes

Sorry about this, I originally had intended for this to be done as a separate change but the CI started failing without this / I wasn't able to get the CI testing of the new changes without this so I ended up adding it to this.

nastra · 2024-08-16T13:39:22Z

literals.go

@@ -71,6 +71,12 @@ type TypedLiteral[T LiteralType] interface {
 	Comparator() Comparator[T]
 }

+type NumericLiteral interface {


do we need this type?

It's used with the transforms, specifically for truncateNumber.

Looking through this, I could probably change this to not be exported though if we want.

nastra · 2024-08-16T13:41:17Z

manifest.go

+	for k, v := range input {
+		switch v := v.(type) {
+		case map[string]any:
+			for typname, val := range v {


nit: did you mean to call this typeName?

nastra · 2024-08-16T13:42:38Z

manifest.go

+	//
+	// Becomes:
+	//
+	//  map[string]any{"ts": map[string]any{"int.date": time.Time{}}}


are we losing the field-id here?

Currently yes, the hambra/avro library doesn't return the field-id back to us at all.

I think I can possibly leverage the https://pkg.go.dev/github.com/hamba/avro/v2#Schema object in the library to get the field-id property but I haven't spent enough time there yet to work out the best way to handle it.

nastra · 2024-08-16T13:45:31Z

manifest.go

+		case map[string]any:
+			for typname, val := range v {
+				switch typname {
+				case "int.date":


can you elaborate where this string representation is coming from?

the representation is coming from the hambra/avro library. When we unmarshal the data, it constructs the type as type.logical-type.

As described in the comment above, the avro has something like "type": {"type": "int", "logicalType": "date"} so the hambra/avro library will denote that with the type int.date.

All of the below representations come from the avro specification for logical types

nastra · 2024-08-16T13:48:12Z

table/evaluators.go

+	return (&manifestEvalVisitor{partitionFilter: boundFilter}).Eval, nil
+}
+
+type manifestEvalVisitor struct {


I'm a little confused, is this new code or just code that has been moved around?

This is just code that was moved around. I was able to put it into the table package to limit access. This doesn't need to be exported so I was able to shift this into the table package rather than export it in the main iceberg package.

nastra · 2024-08-19T14:51:01Z

@zeroshade could you please rebase this one now that all the other PRs are merged?

zeroshade · 2024-08-19T15:13:32Z

@nastra All rebased already 😄

…e files to use

nastra · 2024-08-23T12:50:03Z

table/scanner_test.go

+		// for some reason when I run the provisioning locally i get 5 data files
+		// but GHA CI running spark provisioning ends up with only 4 files?
+		// anyone know why?
+		{"test_uuid_and_fixed_unpartitioned", iceberg.AlwaysTrue{}, 4},


5 should be correct right?

I think the default max parallelism of Spark is capped by the number of CPUs, so probably one of the data files contains two rows.

it's probably fine to follow up on this issue in a separate PR @zeroshade

Fokko · 2024-08-23T13:07:12Z

dev/provision.py

+from pyiceberg.types import FixedType, NestedField, UUIDType
+
+spark = SparkSession.builder.getOrCreate()
+


github-actions bot added the INFRA label Jul 23, 2024

nastra reviewed Aug 2, 2024

View reviewed changes

dev/provision.py Outdated Show resolved Hide resolved

nastra reviewed Aug 2, 2024

View reviewed changes

dev/hive/core-site.xml Outdated Show resolved Hide resolved

nastra reviewed Aug 2, 2024

View reviewed changes

dev/provision.py Outdated Show resolved Hide resolved

zeroshade changed the title ~~feat(table/scanner): Initial pass for planing a scan and returning the files to use~~ feat(table/scanner): Initial pass for planning a scan and returning the files to use Aug 8, 2024

nastra reviewed Aug 16, 2024

View reviewed changes

This was referenced Aug 16, 2024

chore(s3): update to remove deprecated s3 sdk #122

Merged

refactor(evaluators): shift evaluator code into the table package #123

Merged

feat(manifest): fix partition data map #124

Merged

zeroshade force-pushed the metrics-evaluator branch from e7a8eba to 66797d6 Compare August 19, 2024 14:40

zeroshade force-pushed the metrics-evaluator branch from 7a111bd to 2087008 Compare August 19, 2024 15:00

zeroshade mentioned this pull request Aug 19, 2024

refactor: some more small refactors #130

Merged

zeroshade added 8 commits August 20, 2024 11:19

feat(table/scanner): Initial pass for planing a scan and returning th…

824e57c

…e files to use

fix insert for integration

975b26e

fix provision and env var for integration

758f19e

lets try this....

3bcd346

sleep after provision to ensure catalog updates?

06d094d

add calls to command line to test weirdness

5bdccf1

why do I get 5 files locally but 4 in GHA?

e396f15

add creds properly

3711ce0

zeroshade added 4 commits August 20, 2024 11:19

make tests pass

dd897e6

remove the hive stuff

9bf6442

update from feedback

5cd4dda

fix tests

f4e008e

zeroshade force-pushed the metrics-evaluator branch from 4c952ef to f4e008e Compare August 20, 2024 15:19

nastra reviewed Aug 23, 2024

View reviewed changes

Fokko reviewed Aug 23, 2024

View reviewed changes

dev/provision.py

from pyiceberg.types import FixedType, NestedField, UUIDType

spark = SparkSession.builder.getOrCreate()

Copy link

Contributor

Fokko Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it

Fokko approved these changes Aug 23, 2024

View reviewed changes

nastra approved these changes Aug 23, 2024

View reviewed changes

nastra merged commit 6e1ba00 into apache:main Aug 23, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(table/scanner): Initial pass for planning a scan and returning the files to use #118

feat(table/scanner): Initial pass for planning a scan and returning the files to use #118

zeroshade commented Jul 23, 2024

zeroshade commented Jul 29, 2024

zeroshade commented Aug 13, 2024

nastra commented Aug 15, 2024

nastra Aug 16, 2024

zeroshade Aug 16, 2024

nastra Aug 16, 2024

zeroshade Aug 16, 2024

nastra Aug 16, 2024

nastra Aug 16, 2024

zeroshade Aug 16, 2024

nastra Aug 16, 2024

zeroshade Aug 16, 2024

nastra Aug 16, 2024

zeroshade Aug 16, 2024

nastra commented Aug 19, 2024

zeroshade commented Aug 19, 2024

nastra Aug 23, 2024

Fokko Aug 23, 2024

nastra Aug 23, 2024

Fokko Aug 23, 2024

		from pyiceberg.types import FixedType, NestedField, UUIDType

		spark = SparkSession.builder.getOrCreate()

feat(table/scanner): Initial pass for planning a scan and returning the files to use #118

feat(table/scanner): Initial pass for planning a scan and returning the files to use #118

Conversation

zeroshade commented Jul 23, 2024

zeroshade commented Jul 29, 2024

zeroshade commented Aug 13, 2024

nastra commented Aug 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nastra commented Aug 19, 2024

zeroshade commented Aug 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment