Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API, Core: Add geometry and geography types support #12346

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Kontinuation
Copy link
Member

@Kontinuation Kontinuation commented Feb 20, 2025

This adds 2 primitive types to iceberg-api and iceberg-core for supporting geospatial data types, partially implementing the iceberg geo spec: #10981

The newly added primitive types are:

  • geometry(C): geometries with linear/planar edge interpolation
  • geography(C, A): geometries with non-linear edge interpolation, the algorithm for interpolating edges is parameterized by A.

@jiayuasu
Copy link
Member

@rdblue @szehon-ho @flyrain please review when you have time 🙏🏻

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey , really sorry for delay, let's work on it this week. Some early comments


private final Geometry geometry;

public Geography(Geometry geometry) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need interop between Geography and Geometry? I assume we have just have a class for each.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We define Geography this way to reuse the data structures of coordinates and geospatial objects (Points, LineString, etc.), as well as WKB/WKT functions provided by JTS. Geography does not interoperate with Geometry, they are different classes but have the same data structure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a base class, and two implementing classes then? its confusing as it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try removing Geometry and Geography from iceberg-api, and use ByteBuffer as the underlying Java class for these types.

@@ -88,11 +91,64 @@ public boolean test(T value) {
return String.valueOf(value).startsWith((String) literal.value());
case NOT_STARTS_WITH:
return !String.valueOf(value).startsWith((String) literal.value());
case ST_INTERSECTS:
Copy link
Collaborator

@szehon-ho szehon-ho Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets split out predicate pruning support. I feel 80% of the changes is to support those, let's focus in this pr to get the API right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Predicates and pruning were removed from this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

build.gradle Outdated
@@ -292,6 +292,7 @@ project(':iceberg-api') {

dependencies {
implementation project(path: ':iceberg-bundled-guava', configuration: 'shadow')
api libs.jts.core
Copy link
Collaborator

@szehon-ho szehon-ho Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not comfortable leaking the JTS API from Iceberg API, as this forces all the consumer to now depend on JTS. Can we instead encapsulate this dep in iceberg-core? Going to check with other Iceberg PMC's on this.

At very least this should be implementation.

if (primitive.typeId() == Type.TypeID.GEOMETRY) {
Types.GeometryType geometryType = (Types.GeometryType) primitive;
generator.writeStartObject();
generator.writeStringField("type", "geometry");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use the constant for "type"

Copy link
Member Author

@Kontinuation Kontinuation Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced constant literals such as "type", "geometry", "geography", "crs" and "algorithm" with constants.

@@ -141,11 +143,34 @@ static void toJson(Types.MapType map, JsonGenerator generator) throws IOExceptio
}

static void toJson(Type.PrimitiveType primitive, JsonGenerator generator) throws IOException {
generator.writeString(primitive.toString());
if (primitive.typeId() == Type.TypeID.GEOMETRY) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use switch statement like rest of the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to use switch-case.

generator.writeStringField("type", "geometry");
String crs = geometryType.crs();
if (!crs.isEmpty()) {
generator.writeStringField("crs", geometryType.crs());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "crs" seems used frequently enough to warrant a constant.

@@ -535,6 +535,8 @@ private static int estimateSize(Type type) {
return ((Types.FixedType) type).length();
case BINARY:
case VARIANT:
case GEOMETRY:
case GEOGRAPHY:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put a comment how we arrived at 80?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot give an accurate estimation without statistics of the data, so I've simply reused the number for BINARY and VARIANT.

80 is roughly size of a polygon or linestring with 4 coordinates, which denotes a box. This is larger for a dataset full of points and smaller for datasets containing complex shapes, but I think it is a reasonable guess.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I meant code comment :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added reasons why we use 80 (even though the choice was made randomly like https://github.com/apache/iceberg/pull/11324/files#r1821277533) and avoid falling back into the same case as BINARY and VARIANT.

@@ -70,6 +75,20 @@ public static Type fromTypeName(String typeString) {
return TYPES.get(lowerTypeString);
}

if (lowerTypeString.startsWith("geometry")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cant we put the startsWith in the regex, like DECIMAL and FIXED

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot match the regex with lowerTypeString, because we want to extract the original CRS from the type instead of a lower case one. I'll see how to make it more concise and get rid of startsWith.

return new Literals.GeometryLiteral(value);
}

static Literal<Geography> of(Geography value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess here, let's specifically not leak the Geography in the API, we should just have a wrapper class as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove GeometryLiteral ad GeographyLiteral. The iceberg expressions should only work with bounding boxes, so there's no need to involve full-fledged Geometry and Geography objects.


private final Geometry geometry;

public Geography(Geometry geometry) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a base class, and two implementing classes then? its confusing as it is.

}

public static GeographyType of(String crs) {
return of(crs, DEFAULT_ALGORITHM);
Copy link
Collaborator

@szehon-ho szehon-ho Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option is to allow null for algorithm right? and the code can default it , as per the spec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The algorithm field is null when no algorithm is specified.

@Kontinuation
Copy link
Member Author

Kontinuation commented Mar 5, 2025

I have removed dependency to JTS from iceberg-api, now the underlying Java type of Geometry and Geography are ByteBuffer. Now JTS is a implementation dependency of iceberg-core.

I have removed code for converting between ByteBuffer and geospatial types, removed GeometryLiteral/GeographyLiteral, as well as default initial/write value support for geospatial fields. I'll proceed adding them once we are all agree on the API of primitive geospatial types.

Matcher geography = GEOGRAPHY_PARAMETERS.matcher(typeString.substring(9));
if (geography.matches()) {
return GeographyType.of(geography.group(1), geography.group(2));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both geometry and geography cases, do you think we should throw IllegalArgumentException when patterns don't match?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll throw an IllegalArgumentException at the bottom of this function if none of the pattern matches: "Cannot parse type string to primitive: " + typeString. I think this is already informative. Do you think that we need a more specific error message for geospatial types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I didn't pay attention to the very bottom. LGTM.

import org.apache.iceberg.types.Type;
import org.apache.iceberg.util.SerializableFunction;

class Identity<T> implements Transform<T, T> {
private static final Identity<?> INSTANCE = new Identity<>();

private static final Set<Type.TypeID> UNSUPPORTED_TYPES =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

public static GeometryType of(String crs) {
return new GeometryType(crs == null ? "" : crs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Maybe crs == null can be crs == null || crs.isEmpty()

Copy link
Member Author

@Kontinuation Kontinuation Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believed that including crs.isEmpty() was redundant in this situation, so I didn't add it, as we would return an empty string regardless.

// We need to set outputDimension = 4 for XYM geometries to make JTS WKTWriter or WKBWriter work
// correctly.
// The WKB/WKT writers will ignore Z ordinate for XYM geometries.
if (!Double.isNaN(coordinate.getZ())) {
Copy link
Contributor

@hsiang-c hsiang-c Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could do (not a must).

if (Coordinate.NULL_ORDINATE != coordinate.getZ())

Copy link
Member Author

@Kontinuation Kontinuation Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that we cannot do this. Coordinate.NULL_ORDINATE is NaN, comparing it with other floating point number (including NaN) using != will always yield true.

import org.locationtech.jts.geom.Geometry;
import org.locationtech.jts.geom.GeometryFactory;

public class TestGeometryUtil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you covered Coordinate and its subclasses except ExtendedCoordinate, am I right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As far as I know, ExtendedCoordinate is part of JTS example code so we don't need to handle it. JTS will only be used internally in iceberg and it exchanges geospatial data with other systems through WKB/Bounds byte buffer, so there's no need to take non-standard extensions of Coordinate into consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

Copy link
Contributor

@hsiang-c hsiang-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Kontinuation for removing the api dependency on jts from iceberg-api, this looks a lot closer to me

}

public static EdgeInterpolationAlgorithm fromName(String algorithmName) {
try {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precondition, check null and throw exception?

return EdgeInterpolationAlgorithm.valueOf(algorithmName.toUpperCase(Locale.ENGLISH));
} catch (IllegalArgumentException e) {
throw new IllegalArgumentException(
String.format("Invalid edge interpolation algorithm name: %s", algorithmName), e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can remove redundant 'name'

}

public static GeometryType get() {
return of("");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not , new GeometryType("")

}

public static GeographyType get() {
return of("");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not new GeographyType("")

@@ -90,5 +90,42 @@ public void testNestedFieldBuilderIdCheck() {
assertThatExceptionOfType(NullPointerException.class)
.isThrownBy(() -> required("field").ofType(Types.StringType.get()).build())
.withMessage("Id cannot be null");

assertThat(Types.fromPrimitiveString("geometry")).isEqualTo(Types.GeometryType.get());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in its own test

@@ -277,6 +313,17 @@ private static Types.MapType mapFromJson(JsonNode json) {
}
}

private static Types.GeometryType geometryFromJson(JsonNode json) {
String crs = JsonUtil.getStringOrNull("crs", json);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these (and following) can be replaced by the constants?

return TypeID.GEOMETRY;
}

public String crs() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe can annotate NotNull?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NotNull annotation requires an additional dependency, and I noticed that it is not currently used in iceberg-api or iceberg-core. Would you like me to add the additional dependency and annotate this method?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, nvm it may not be worth it

private final String crs;

private GeometryType(String crs) {
this.crs = crs;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precondition that crs is not null?

}

private static int getOutputDimension(Geometry geom) {
int dimension = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make a constant to avoid instantiating it every time?

@szehon-ho
Copy link
Collaborator

Also, (as can't comment on files that are not in the change)

Do we need to add Geo types to following places?

  1. Types.java TYPES constant?
  2. TestSchemaUnionByFieldName primitiveTypes()?
  3. TestSchema TEST_TYPES?
  4. TestTypesUtil testTypes()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants