Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rfc] Add RFC 103 - OGR_SCHEMA open option #11071

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
1 change: 1 addition & 0 deletions doc/source/development/rfc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,3 +107,4 @@ RFC list
rfc98_build_requirements_gdal_3_9
rfc99_geometry_coordinate_precision
rfc101_raster_dataset_threadsafety
rfc103_schema_open_option
275 changes: 275 additions & 0 deletions doc/source/development/rfc/rfc103_schema_open_option.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
.. _rfc-103:

===================================================================
RFC 103 add a OGR_SCHEMA open option to selected OGR drivers
===================================================================

=============== =============================================
Author: Alessandro Pasotti
Contact: elpaso at itopen.it
Status: Draft
Created: 2024-10-22
=============== =============================================

Summary
-------

This RFC enables users to specify a OGR_SCHEMA open option in the OGR
drivers that support it.

The new option will be used to override the auto-detected fields types and to rename detected fields.

Motivation
----------

Several OGR drivers must guess the attribute data type: CSV, GeoJSON, SQLite,
the auto-detected types are not always correct and the user has no way to
override them at opening time.

elpaso marked this conversation as resolved.
Show resolved Hide resolved
A secondary goal is to allow users to rename fields at opening time: some drivers
have limitations regarding field names and have specific laundering rules that
may yield field names that are not ideal for the user.

For the details please see the discussion attached to the issue: https://github.com/OSGeo/gdal/issues/10943

Implementation
--------------

A new reserved open option named OGR_SCHEMA will be added to the following drivers,
choosen because they are the most likely to benefit from the field type override feature:

- CSV
- GeoJSON
- SQLite
- GML

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GML can already use ,xsd and .gfs, and csv can have a .csvt sidecar, but maybe the more options the better. Especially manual writing of .csvt file is not convenient at all.

Additional drivers may benefit from the field rename feature while are not usually
affected by the field type guessing issue and may be added later to the supported drivers.

The option will be used to specify the schema in the form of a JSON document (or a path to a JSON file).

The schema document will allow both "Full" and "Patch" modes.
"Patch" mode will be the default and will allow partial overrides of the auto-detected fields types
and names while "Full" mode will produce a layer with only the fields specified in the schema.

The structure of the JSON document has been largely inspired by the one produced by the `ogrinfo -json` command
with the notable exception that (for the scope of this RFC) only the information related to the type and and
name of the fields will be considered.

The JSON schema for the OGR_SCHEMA open option will be as follows:

.. code-block:: json

{
"$schema": "http://json-schema.org/draft-07/schema#",
"description": "Schema for OGR_SCHEMA open option",
"oneOf": [
{
"$ref": "#/definitions/dataset"
}
],
"definitions": {
"dataset": {
"type": "object",
"properties": {
"layers": {
"type": "array",
"description": "The list of layers contained in the schema",
"items": {
"$ref": "#/definitions/layer"
}
}
},
"required": [
"layers"
],
"additionalProperties": false
},
"schema_type": {
"enum": [
"Patch",
"Full"
],
"default": "Patch"
},
"layer": {
"type": "object",
"properties": {
"name": {
"description": "The name of the layer",
"type": "string"
},
"schema_type": {
"description": "The type of schema operation: patch or full",
"$ref": "#/definitions/schema_type"
},
"fields": {
"description": "The list of field definitions",
"type": "array",
"items": {
"$ref": "#/definitions/field"
}
}
},
"required": [
"name",
"fields"
],
"additionalProperties": false
},
"field": {
"description": "The field definition",
"additionalProperties": true,
"type": "object",
"properties": {
"name": {
"type": "string"
}
},
"anyOf": [
{
"type": "object",
"properties": {
"type": {
"$ref": "#/definitions/fieldType"
},
"subType": {
"$ref": "#/definitions/fieldSubType"
},
"width": {
"type": "integer"
},
"precision": {
"type": "integer"
}
},
"required": [
"type"
]
},
{
"description": "The new name of the field",
"newName": {
"type": "string"
},
"required": [
"newName"
]
}
],
"required": [
"name"
]
},
"fieldType": {
"enum": [
"Integer",
"Integer64",
"Real",
"String",
"Binary",
"IntegerList",
"Integer64List",
"RealList",
"StringList",
"Date",
"Time",
"DateTime"
]
},
"fieldSubType": {
"enum": [
"None",
"Boolean",
"Int16",
"Float32",
"JSON",
"UUID"
]
}
}
}

Here is an example of a schema document that will be used to override the fields type and the name of a dataset using the default "Patch" mode:

.. code-block:: json

{
"layers": [
{
"name": "layer1",
"fields": [
{
"name": "field1",
"type": "String",
"subType": "JSON"
},
{
"name": "field2",
"newName": "new_field2"
}
]
}
]
}


In case of multi-layered datasets, the schema will be specified as a list of layers, each with its own fields definition and Patch/Full mode:

.. code-block:: json

{
"layers":[
{
"name": "layer1",
"schema_type": "Full",
elpaso marked this conversation as resolved.
Show resolved Hide resolved
"fields": [
{
"name": "field1",
"type": "String",
"subType": "JSON"
},
{
"name": "field2",
"newName": "new_field2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elpaso @rouault Could we make patching easy to understand by treating names and field types in the same way? For example:

Suggested change
"newName": "new_field2"
"newName": "new_field2",
"newType": "String",
"newSubType": "JSON"

If we do this, we don't have to spend any energy explaining why names and types are different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. As an alternative we could treat the fields as an object such as:

"fields" : { "field1" { "name" : "field1_renamed" ...}}

that would mean that we abandon the idea to use the output of ogrinfo -json as a template for the schema, but perhaps we have already lost that train.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope not to lose everything from the ogrinfo -json. It would be rather demanding to write a working schema from a scratch, and having a separate option or utility for printing the default schema feels like duplication. But rather duplication than leaving users alone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elpaso yeah, the original sin (so to speak) is that the OGR schema is represented as an array of fields (this is OGR's internal model) instead of as a dictionary. Thus we have to reference fields by "the item in the fields array where name is val" instead of just "fields[val]". I think it makes more sense to accept that than to change it.

}
]
},
{
"name": "layer2",
"schema_type": "Patch",
"fields": [
{
"name": "field1",
"type": "String",
"subType": "JSON"
},
{
"name": "field2",
"newName": "new_field2"
}
]
}
]
}


The new option will be used by applications such as `ogr2ogr` to override the auto-detected fields types and to override the auto-detected (and possibly laundered) field names.
elpaso marked this conversation as resolved.
Show resolved Hide resolved

A preliminary draft of the implementation can be found at:
https://github.com/elpaso/gdal/commits/enhancement-gh10943-fields-schema-override/


To advertise the new feature, the following metadata items will be used:
`GDAL_DMD_OGR_SCHEMA_OPEN_OPTION_FLAGS=OverrideType OverrideName`



Errors and warnings
-------------------

- If the schema is not a valid JSON document, a critical error will be raised.

- If the schema is a valid JSON document but does not validates against the JSON schema, a critical error will be raised.

- If the schema contains a field that is not present in the dataset, a critical error will be raised.