[rfc] Add RFC 103 - OGR_SCHEMA open option #11071

elpaso · 2024-10-22T08:28:07Z

Formatted preview: https://gdal--11071.org.readthedocs.build/en/11071/development/rfc/rfc103_schema_open_option.html

jratike80 · 2024-10-22T08:43:20Z

doc/source/development/rfc/rfc103_schema_open_option.rst

+
+- Additional JSON properties will be ignored while parsing the schema.
+
+- If the schema contains a field that is not present in the dataset, a warning will be raised and the field will be ignored.


What happens if some fields are missing?

Nothing: auto-detected types will be used. The idea is that you can override only a part of the schema.

I have added a sentence to clarify that partial overrides are possible (in fact, they probably are the main use case).

I believe that it does not take long before someone wants to have an option to rename the fields by the same.

{ "name": "field1", "new_name": "description" "type": "string" },

And then someone would like to have an alternative for the OGR SQL EXCLUDE for removing some fields

The syntax * EXCLUDE ([fields]) can be used to select all fields except those listed in parentheses.

They will not have that option if the fields which are missing from the schema are autodetected. But it is not possible to make both you and them happy at the same time, and you are the author...

I believe we should add a required top-level JSON property "schema_type" whose only supported value currently will be "patch", to mean that we alter the autodetected schema for the parts we specify with the JSON document. If in the future we would want to allow full schema replacement, that would be done with "schema_type" = "full" or something like that

@jratike80 I agree that field renaming will be very useful. Instead of an implicit operation, where new_name means replace name or new_type means replace type, I'd prefer to use an explicit operation (sort of like https://jsonpatch.com/#replace) which says: replace field name with {"name": "NEW_NAME", "type": "NEW_TYPE"}.

I'll read the whole RFC soon, maybe what I've proposed won't be realistic.

Now that I think about it, field renaming can be done with SQL, right? While overriding the original field type cannot.

SQL is flexible and type casting is supported. I seem to have used this kind of SQL with ogr2ogr

SELECT 'FIN' AS "country", CAST("mtk_id" as text) as "id", 'CC BY 4.0' AS "license", ST_Simplify("geometry",5) AS "geometry" FROM "source"

I am not sure how well GDAL detects those CASTed attributes generally, but they have worked for my use case with GeoPackage as the outputformat.

However, some users feel that a need to know some SQL is weird, and having several SQL dialects, from which GDAL makes a selection by hidden rules, makes more confusion for beginners. But it is hard to set the limits when to use SQL, when to add a new specific open option or ogr2ogr switch, or when to write an OGR VRT file.

For me the OGR_SCHEMA feels more user friendly than SQL CAST especially because I fear that different dialects may not do exactly similar casts. By supporting also renaming users would not need to to one part of mapping with OGR_SCHEMA and another one with SQL.

For me the OGR_SCHEMA feels more user friendly than SQL CAST

not only that, but type casting at the SQL level is too late. If a driver has wrongly guessed that a field was Integer32 wheras later records hold 64-bit values, the driver will present truncated/wrong values to the SQL engine. There's no way to "fix" that at that level. That must be something done by the driver itself

jratike80 · 2024-10-22T09:00:35Z

doc/source/development/rfc/rfc103_schema_open_option.rst

+- GeoJSON
+- SQLite
+- GML
+


GML can already use ,xsd and .gfs, and csv can have a .csvt sidecar, but maybe the more options the better. Especially manual writing of .csvt file is not convenient at all.

doc/source/development/rfc/rfc103_schema_open_option.rst

rouault · 2024-10-22T12:22:23Z

doc/source/development/rfc/rfc103_schema_open_option.rst

+- If the schema is a valid JSON document but does not contain the expected fields or it is a no-op
+  (does not contain any actionable instruction), a warning will be raised and the schema will be ignored.
+
+- Additional JSON properties will be ignored while parsing the schema.


I would suggest that we ignore the schema if an expected JSON property is found with the error message mentionning that property. That way this will enable use to add more capabilites in the future, while giving the user a certainty that what he specified is fully taken into account.

The rationale for ignoring was to allow the easy workflow ogrinfo -json -> edit the field types -> use that JSON document as an input for ogr2ogr. If we error out on unsupported properties the user will need to remove all unsupported properties.

It is fine for me, just wanted to point this out.

Same observation goes for the "schema_type" recommended below.

If we error out on unsupported properties the user will need to remove all unsupported properties.

If the error message is sufficiently explicit, I don't think that's much a concern. But there are clearly pros/cons

rouault · 2024-10-22T12:23:48Z

doc/source/development/rfc/rfc103_schema_open_option.rst

+
+- Additional JSON properties will be ignored while parsing the schema.
+
+- If the schema contains a field that is not present in the dataset, a warning will be raised and the field will be ignored.


I believe we should add a required top-level JSON property "schema_type" whose only supported value currently will be "patch", to mean that we alter the autodetected schema for the parts we specify with the JSON document. If in the future we would want to allow full schema replacement, that would be done with "schema_type" = "full" or something like that

rouault · 2024-10-22T12:24:53Z

doc/source/development/rfc/rfc103_schema_open_option.rst

+    "fields": [
+        {
+        "name": "field1",
+        "type": "string"


We would need also to support "subtype" (for String JSON, or Integer Boolean). And perhaps "width" and "precision" too

It makes sense.
Is it subtype also exposed by ogrinfo -json ?

For schema_type I was thinking to make it a layer-level property, this way a combination of patch and full can be used for individual layers.

I'm working on a json schema document to clarify this.

rouault · 2024-10-22T12:56:45Z

doc/source/development/rfc/rfc103_schema_open_option.rst

+Implementation
+--------------
+
+A new open option named SCHEMA will be added to the following drivers:


We should mention this open option will be a reserved one. If a driver uses it, it must be for that purpose

And I believe we should modify ogr2ogr so that -mapFieldType uses the capability offered by the SCHEMA open option when available

jratike80 · 2024-10-22T13:29:04Z

Oh, names. PostGIS driver has already ACTIVE_SCHEMA and SCHEMAS. OAPIF has IGNORE_SCHEMA.
Perhaps this new open option should have some more original name, like OGR_SCHEMA.

rouault · 2024-10-22T13:34:00Z

Perhaps this new open option should have some more original name, like OGR_SCHEMA.

+1

rouault · 2024-10-22T16:59:59Z

doc/source/development/rfc/rfc103_schema_open_option.rst

+            {
+            "name": "field1",
+            "type": "string",
+            "subtype": "JSON"


Actually I see we have spelled it "subType" in ogrinfo -json output:

"fields":[ { "name":"a", "type":"String", "subType":"JSON", "nullable":true, "uniqueConstraint":false }

Schema in https://github.com/OSGeo/gdal/blob/master/apps/data/ogrinfo_output.schema.json#L117

I have added a JSON schema to the RFC: https://github.com/OSGeo/gdal/pull/11071/files#diff-9d0c9dc12d0b4a70d63be37cb390f1e9970d226296c30e8883c73a911c82806bR55

doc/source/development/rfc/rfc103_schema_open_option.rst

sgillies

I made a couple comments and suggestions inline.

doc/source/development/rfc/rfc103_schema_open_option.rst

sgillies · 2024-10-30T15:23:09Z

doc/source/development/rfc/rfc103_schema_open_option.rst

+            },
+            {
+            "name": "field2",
+            "newName": "new_field2"


@elpaso @rouault Could we make patching easy to understand by treating names and field types in the same way? For example:

Suggested change

"newName": "new_field2"

"newName": "new_field2",

"newType": "String",

"newSubType": "JSON"

If we do this, we don't have to spend any energy explaining why names and types are different.

That makes sense to me. As an alternative we could treat the fields as an object such as:

"fields" : { "field1" { "name" : "field1_renamed" ...}}

that would mean that we abandon the idea to use the output of ogrinfo -json as a template for the schema, but perhaps we have already lost that train.

I hope not to lose everything from the ogrinfo -json. It would be rather demanding to write a working schema from a scratch, and having a separate option or utility for printing the default schema feels like duplication. But rather duplication than leaving users alone.

@elpaso yeah, the original sin (so to speak) is that the OGR schema is represented as an array of fields (this is OGR's internal model) instead of as a dictionary. Thus we have to reference fields by "the item in the fields array where name is val" instead of just "fields[val]". I think it makes more sense to accept that than to change it.

Co-authored-by: Even Rouault <even.rouault@spatialys.com>

[rfc] Add RFC 103 - OGR SCHEMA open option

3231dd9

elpaso marked this pull request as draft October 22, 2024 08:28

elpaso added 2 commits October 22, 2024 10:38

formatting

3c24e35

Fix multi-layered schema

af9a72a

jratike80 reviewed Oct 22, 2024

View reviewed changes

Specify partial overrides

7750356

jratike80 reviewed Oct 22, 2024

View reviewed changes

rouault reviewed Oct 22, 2024

View reviewed changes

Apply suggestions from reviewers

1b0f100

elpaso force-pushed the rfc_103_schema_open_option branch from 91a4f1e to 1b0f100 Compare October 22, 2024 14:08

Add subtype

a2cf908

elpaso changed the title ~~[rfc] Add RFC 103 - OGR SCHEMA open option~~ [rfc] Add RFC 103 - OGR_SCHEMA open option Oct 22, 2024

rouault mentioned this pull request Oct 22, 2024

Way to override field definitions at the OGR driver level? #10943

Open

rouault reviewed Oct 22, 2024

View reviewed changes

Add Patch/Full mode and schema

1c1a024

rouault added the funded through GSP Work funded through the GDAL Sponsorship Program label Oct 24, 2024

Add rename functionality

0f1001d

jratike80 reviewed Oct 30, 2024

View reviewed changes

doc/source/development/rfc/rfc103_schema_open_option.rst Show resolved Hide resolved

elpaso added 2 commits October 30, 2024 14:23

typo

45c4601

Add metadata

cc972d9

elpaso marked this pull request as ready for review October 30, 2024 14:27

rouault reviewed Oct 30, 2024

View reviewed changes

doc/source/development/rfc/rfc103_schema_open_option.rst Outdated Show resolved Hide resolved

sgillies reviewed Oct 30, 2024

View reviewed changes

elpaso and others added 2 commits October 30, 2024 17:03

camel case schemaType

9d64e15

Update doc/source/development/rfc/rfc103_schema_open_option.rst

3278566

Co-authored-by: Even Rouault <even.rouault@spatialys.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rfc] Add RFC 103 - OGR_SCHEMA open option #11071

[rfc] Add RFC 103 - OGR_SCHEMA open option #11071

elpaso commented Oct 22, 2024 •

edited

Loading

jratike80 Oct 22, 2024

elpaso Oct 22, 2024

elpaso Oct 22, 2024

jratike80 Oct 22, 2024

jratike80 Oct 22, 2024

rouault Oct 22, 2024

sgillies Oct 22, 2024 •

edited

Loading

sgillies Oct 22, 2024

jratike80 Oct 22, 2024

rouault Oct 22, 2024

jratike80 Oct 22, 2024

rouault Oct 22, 2024

elpaso Oct 22, 2024

rouault Oct 22, 2024

rouault Oct 22, 2024

rouault Oct 22, 2024

elpaso Oct 22, 2024

elpaso Oct 23, 2024

rouault Oct 22, 2024

jratike80 commented Oct 22, 2024

rouault commented Oct 22, 2024

rouault Oct 22, 2024

elpaso Oct 23, 2024

sgillies left a comment

sgillies Oct 30, 2024

elpaso Oct 30, 2024

jratike80 Oct 30, 2024

sgillies Oct 30, 2024


		- Additional JSON properties will be ignored while parsing the schema.

		- If the schema contains a field that is not present in the dataset, a warning will be raised and the field will be ignored.

-            "newName": "new_field2"
+            "newName": "new_field2",
+            "newType": "String",
+            "newSubType": "JSON"

[rfc] Add RFC 103 - OGR_SCHEMA open option #11071

Are you sure you want to change the base?

[rfc] Add RFC 103 - OGR_SCHEMA open option #11071

Conversation

elpaso commented Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgillies Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jratike80 commented Oct 22, 2024

rouault commented Oct 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgillies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elpaso commented Oct 22, 2024 •

edited

Loading

sgillies Oct 22, 2024 •

edited

Loading