-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API, Core: Add geometry and geography types support #12346
base: main
Are you sure you want to change the base?
Conversation
b39e32c
to
904f503
Compare
904f503
to
896bc19
Compare
@rdblue @szehon-ho @flyrain please review when you have time 🙏🏻 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey , really sorry for delay, let's work on it this week. Some early comments
|
||
private final Geometry geometry; | ||
|
||
public Geography(Geometry geometry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need interop between Geography and Geometry? I assume we have just have a class for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We define Geography this way to reuse the data structures of coordinates and geospatial objects (Points, LineString, etc.), as well as WKB/WKT functions provided by JTS. Geography does not interoperate with Geometry, they are different classes but have the same data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a base class, and two implementing classes then? its confusing as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try removing Geometry and Geography from iceberg-api, and use ByteBuffer
as the underlying Java class for these types.
@@ -88,11 +91,64 @@ public boolean test(T value) { | |||
return String.valueOf(value).startsWith((String) literal.value()); | |||
case NOT_STARTS_WITH: | |||
return !String.valueOf(value).startsWith((String) literal.value()); | |||
case ST_INTERSECTS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets split out predicate pruning support. I feel 80% of the changes is to support those, let's focus in this pr to get the API right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Predicates and pruning were removed from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
896bc19
to
4d1bb83
Compare
build.gradle
Outdated
@@ -292,6 +292,7 @@ project(':iceberg-api') { | |||
|
|||
dependencies { | |||
implementation project(path: ':iceberg-bundled-guava', configuration: 'shadow') | |||
api libs.jts.core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not comfortable leaking the JTS API from Iceberg API, as this forces all the consumer to now depend on JTS. Can we instead encapsulate this dep in iceberg-core? Going to check with other Iceberg PMC's on this.
At very least this should be implementation.
if (primitive.typeId() == Type.TypeID.GEOMETRY) { | ||
Types.GeometryType geometryType = (Types.GeometryType) primitive; | ||
generator.writeStartObject(); | ||
generator.writeStringField("type", "geometry"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: use the constant for "type"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced constant literals such as "type", "geometry", "geography", "crs" and "algorithm" with constants.
@@ -141,11 +143,34 @@ static void toJson(Types.MapType map, JsonGenerator generator) throws IOExceptio | |||
} | |||
|
|||
static void toJson(Type.PrimitiveType primitive, JsonGenerator generator) throws IOException { | |||
generator.writeString(primitive.toString()); | |||
if (primitive.typeId() == Type.TypeID.GEOMETRY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: use switch statement like rest of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to use switch-case.
generator.writeStringField("type", "geometry"); | ||
String crs = geometryType.crs(); | ||
if (!crs.isEmpty()) { | ||
generator.writeStringField("crs", geometryType.crs()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "crs" seems used frequently enough to warrant a constant.
@@ -535,6 +535,8 @@ private static int estimateSize(Type type) { | |||
return ((Types.FixedType) type).length(); | |||
case BINARY: | |||
case VARIANT: | |||
case GEOMETRY: | |||
case GEOGRAPHY: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put a comment how we arrived at 80?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot give an accurate estimation without statistics of the data, so I've simply reused the number for BINARY and VARIANT.
80 is roughly size of a polygon or linestring with 4 coordinates, which denotes a box. This is larger for a dataset full of points and smaller for datasets containing complex shapes, but I think it is a reasonable guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I meant code comment :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added reasons why we use 80 (even though the choice was made randomly like https://github.com/apache/iceberg/pull/11324/files#r1821277533) and avoid falling back into the same case as BINARY and VARIANT.
@@ -70,6 +75,20 @@ public static Type fromTypeName(String typeString) { | |||
return TYPES.get(lowerTypeString); | |||
} | |||
|
|||
if (lowerTypeString.startsWith("geometry")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why cant we put the startsWith in the regex, like DECIMAL and FIXED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot match the regex with lowerTypeString
, because we want to extract the original CRS from the type instead of a lower case one. I'll see how to make it more concise and get rid of startsWith.
return new Literals.GeometryLiteral(value); | ||
} | ||
|
||
static Literal<Geography> of(Geography value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess here, let's specifically not leak the Geography in the API, we should just have a wrapper class as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove GeometryLiteral
ad GeographyLiteral
. The iceberg expressions should only work with bounding boxes, so there's no need to involve full-fledged Geometry and Geography objects.
|
||
private final Geometry geometry; | ||
|
||
public Geography(Geometry geometry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a base class, and two implementing classes then? its confusing as it is.
} | ||
|
||
public static GeographyType of(String crs) { | ||
return of(crs, DEFAULT_ALGORITHM); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another option is to allow null for algorithm right? and the code can default it , as per the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. The algorithm field is null when no algorithm is specified.
4d1bb83
to
73f6022
Compare
…ry and geography types to ByteBuffer
I have removed dependency to JTS from iceberg-api, now the underlying Java type of Geometry and Geography are I have removed code for converting between ByteBuffer and geospatial types, removed GeometryLiteral/GeographyLiteral, as well as default initial/write value support for geospatial fields. I'll proceed adding them once we are all agree on the API of primitive geospatial types. |
Matcher geography = GEOGRAPHY_PARAMETERS.matcher(typeString.substring(9)); | ||
if (geography.matches()) { | ||
return GeographyType.of(geography.group(1), geography.group(2)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both geometry
and geography
cases, do you think we should throw IllegalArgumentException
when patterns don't match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll throw an IllegalArgumentException
at the bottom of this function if none of the pattern matches: "Cannot parse type string to primitive: " + typeString
. I think this is already informative. Do you think that we need a more specific error message for geospatial types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I didn't pay attention to the very bottom. LGTM.
import org.apache.iceberg.types.Type; | ||
import org.apache.iceberg.util.SerializableFunction; | ||
|
||
class Identity<T> implements Transform<T, T> { | ||
private static final Identity<?> INSTANCE = new Identity<>(); | ||
|
||
private static final Set<Type.TypeID> UNSUPPORTED_TYPES = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
|
||
public static GeometryType of(String crs) { | ||
return new GeometryType(crs == null ? "" : crs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Maybe crs == null
can be crs == null || crs.isEmpty()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believed that including crs.isEmpty()
was redundant in this situation, so I didn't add it, as we would return an empty string regardless.
// We need to set outputDimension = 4 for XYM geometries to make JTS WKTWriter or WKBWriter work | ||
// correctly. | ||
// The WKB/WKT writers will ignore Z ordinate for XYM geometries. | ||
if (!Double.isNaN(coordinate.getZ())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could do (not a must).
if (Coordinate.NULL_ORDINATE != coordinate.getZ())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that we cannot do this. Coordinate.NULL_ORDINATE
is NaN, comparing it with other floating point number (including NaN) using != will always yield true.
import org.locationtech.jts.geom.Geometry; | ||
import org.locationtech.jts.geom.GeometryFactory; | ||
|
||
public class TestGeometryUtil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you covered Coordinate
and its subclasses except ExtendedCoordinate
, am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. As far as I know, ExtendedCoordinate
is part of JTS example code so we don't need to handle it. JTS will only be used internally in iceberg and it exchanges geospatial data with other systems through WKB/Bounds byte buffer, so there's no need to take non-standard extensions of Coordinate
into consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Kontinuation for removing the api dependency on jts from iceberg-api, this looks a lot closer to me
} | ||
|
||
public static EdgeInterpolationAlgorithm fromName(String algorithmName) { | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precondition, check null and throw exception?
return EdgeInterpolationAlgorithm.valueOf(algorithmName.toUpperCase(Locale.ENGLISH)); | ||
} catch (IllegalArgumentException e) { | ||
throw new IllegalArgumentException( | ||
String.format("Invalid edge interpolation algorithm name: %s", algorithmName), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can remove redundant 'name'
} | ||
|
||
public static GeometryType get() { | ||
return of(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not , new GeometryType("")
} | ||
|
||
public static GeographyType get() { | ||
return of(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not new GeographyType("")
@@ -90,5 +90,42 @@ public void testNestedFieldBuilderIdCheck() { | |||
assertThatExceptionOfType(NullPointerException.class) | |||
.isThrownBy(() -> required("field").ofType(Types.StringType.get()).build()) | |||
.withMessage("Id cannot be null"); | |||
|
|||
assertThat(Types.fromPrimitiveString("geometry")).isEqualTo(Types.GeometryType.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in its own test
@@ -277,6 +313,17 @@ private static Types.MapType mapFromJson(JsonNode json) { | |||
} | |||
} | |||
|
|||
private static Types.GeometryType geometryFromJson(JsonNode json) { | |||
String crs = JsonUtil.getStringOrNull("crs", json); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: these (and following) can be replaced by the constants?
return TypeID.GEOMETRY; | ||
} | ||
|
||
public String crs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe can annotate NotNull?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NotNull
annotation requires an additional dependency, and I noticed that it is not currently used in iceberg-api or iceberg-core. Would you like me to add the additional dependency and annotate this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, nvm it may not be worth it
private final String crs; | ||
|
||
private GeometryType(String crs) { | ||
this.crs = crs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precondition that crs is not null?
} | ||
|
||
private static int getOutputDimension(Geometry geom) { | ||
int dimension = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: make a constant to avoid instantiating it every time?
Also, (as can't comment on files that are not in the change) Do we need to add Geo types to following places?
|
fc193c6
to
eb1896a
Compare
This adds 2 primitive types to
iceberg-api
andiceberg-core
for supporting geospatial data types, partially implementing the iceberg geo spec: #10981The newly added primitive types are:
geometry(C)
: geometries with linear/planar edge interpolationgeography(C, A)
: geometries with non-linear edge interpolation, the algorithm for interpolating edges is parameterized byA
.