Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rfc] Add RFC 103 - OGR_SCHEMA open option #11071

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

elpaso
Copy link
Collaborator

@elpaso elpaso commented Oct 22, 2024

@elpaso elpaso marked this pull request as draft October 22, 2024 08:28

- Additional JSON properties will be ignored while parsing the schema.

- If the schema contains a field that is not present in the dataset, a warning will be raised and the field will be ignored.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if some fields are missing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing: auto-detected types will be used. The idea is that you can override only a part of the schema.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a sentence to clarify that partial overrides are possible (in fact, they probably are the main use case).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that it does not take long before someone wants to have an option to rename the fields by the same.

{
            "name": "field1",
            "new_name": "description"
            "type": "string"
 },

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then someone would like to have an alternative for the OGR SQL EXCLUDE for removing some fields

The syntax * EXCLUDE ([fields]) can be used to select all fields except those listed in parentheses.

They will not have that option if the fields which are missing from the schema are autodetected. But it is not possible to make both you and them happy at the same time, and you are the author...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should add a required top-level JSON property "schema_type" whose only supported value currently will be "patch", to mean that we alter the autodetected schema for the parts we specify with the JSON document. If in the future we would want to allow full schema replacement, that would be done with "schema_type" = "full" or something like that

Copy link
Contributor

@sgillies sgillies Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jratike80 I agree that field renaming will be very useful. Instead of an implicit operation, where new_name means replace name or new_type means replace type, I'd prefer to use an explicit operation (sort of like https://jsonpatch.com/#replace) which says: replace field name with {"name": "NEW_NAME", "type": "NEW_TYPE"}.

I'll read the whole RFC soon, maybe what I've proposed won't be realistic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, field renaming can be done with SQL, right? While overriding the original field type cannot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL is flexible and type casting is supported. I seem to have used this kind of SQL with ogr2ogr

SELECT 
'FIN' AS "country",
CAST("mtk_id" as text) as "id", 
'CC BY 4.0' AS "license",
ST_Simplify("geometry",5) AS "geometry"
FROM "source"

I am not sure how well GDAL detects those CASTed attributes generally, but they have worked for my use case with GeoPackage as the outputformat.

However, some users feel that a need to know some SQL is weird, and having several SQL dialects, from which GDAL makes a selection by hidden rules, makes more confusion for beginners. But it is hard to set the limits when to use SQL, when to add a new specific open option or ogr2ogr switch, or when to write an OGR VRT file.

For me the OGR_SCHEMA feels more user friendly than SQL CAST especially because I fear that different dialects may not do exactly similar casts. By supporting also renaming users would not need to to one part of mapping with OGR_SCHEMA and another one with SQL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me the OGR_SCHEMA feels more user friendly than SQL CAST

not only that, but type casting at the SQL level is too late. If a driver has wrongly guessed that a field was Integer32 wheras later records hold 64-bit values, the driver will present truncated/wrong values to the SQL engine. There's no way to "fix" that at that level. That must be something done by the driver itself

- GeoJSON
- SQLite
- GML

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GML can already use ,xsd and .gfs, and csv can have a .csvt sidecar, but maybe the more options the better. Especially manual writing of .csvt file is not convenient at all.

doc/source/development/rfc/rfc103_schema_open_option.rst Outdated Show resolved Hide resolved
- If the schema is a valid JSON document but does not contain the expected fields or it is a no-op
(does not contain any actionable instruction), a warning will be raised and the schema will be ignored.

- Additional JSON properties will be ignored while parsing the schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that we ignore the schema if an expected JSON property is found with the error message mentionning that property. That way this will enable use to add more capabilites in the future, while giving the user a certainty that what he specified is fully taken into account.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale for ignoring was to allow the easy workflow ogrinfo -json -> edit the field types -> use that JSON document as an input for ogr2ogr. If we error out on unsupported properties the user will need to remove all unsupported properties.

It is fine for me, just wanted to point this out.

Same observation goes for the "schema_type" recommended below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we error out on unsupported properties the user will need to remove all unsupported properties.

If the error message is sufficiently explicit, I don't think that's much a concern. But there are clearly pros/cons


- Additional JSON properties will be ignored while parsing the schema.

- If the schema contains a field that is not present in the dataset, a warning will be raised and the field will be ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should add a required top-level JSON property "schema_type" whose only supported value currently will be "patch", to mean that we alter the autodetected schema for the parts we specify with the JSON document. If in the future we would want to allow full schema replacement, that would be done with "schema_type" = "full" or something like that

"fields": [
{
"name": "field1",
"type": "string"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need also to support "subtype" (for String JSON, or Integer Boolean). And perhaps "width" and "precision" too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense.
Is it subtype also exposed by ogrinfo -json ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For schema_type I was thinking to make it a layer-level property, this way a combination of patch and full can be used for individual layers.

I'm working on a json schema document to clarify this.

Implementation
--------------

A new open option named SCHEMA will be added to the following drivers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention this open option will be a reserved one. If a driver uses it, it must be for that purpose

And I believe we should modify ogr2ogr so that -mapFieldType uses the capability offered by the SCHEMA open option when available

@jratike80
Copy link
Collaborator

Oh, names. PostGIS driver has already ACTIVE_SCHEMA and SCHEMAS. OAPIF has IGNORE_SCHEMA.
Perhaps this new open option should have some more original name, like OGR_SCHEMA.

@rouault
Copy link
Member

rouault commented Oct 22, 2024

Perhaps this new open option should have some more original name, like OGR_SCHEMA.

+1

@elpaso elpaso changed the title [rfc] Add RFC 103 - OGR SCHEMA open option [rfc] Add RFC 103 - OGR_SCHEMA open option Oct 22, 2024
{
"name": "field1",
"type": "string",
"subtype": "JSON"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I see we have spelled it "subType" in ogrinfo -json output:

      "fields":[
        {
          "name":"a",
          "type":"String",
          "subType":"JSON",
          "nullable":true,
          "uniqueConstraint":false
        }

Schema in https://github.com/OSGeo/gdal/blob/master/apps/data/ogrinfo_output.schema.json#L117

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rouault rouault added the funded through GSP Work funded through the GDAL Sponsorship Program label Oct 24, 2024
@elpaso elpaso marked this pull request as ready for review October 30, 2024 14:27
Copy link
Contributor

@sgillies sgillies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a couple comments and suggestions inline.

doc/source/development/rfc/rfc103_schema_open_option.rst Outdated Show resolved Hide resolved
},
{
"name": "field2",
"newName": "new_field2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elpaso @rouault Could we make patching easy to understand by treating names and field types in the same way? For example:

Suggested change
"newName": "new_field2"
"newName": "new_field2",
"newType": "String",
"newSubType": "JSON"

If we do this, we don't have to spend any energy explaining why names and types are different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. As an alternative we could treat the fields as an object such as:

"fields" : { "field1" { "name" : "field1_renamed" ...}}

that would mean that we abandon the idea to use the output of ogrinfo -json as a template for the schema, but perhaps we have already lost that train.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope not to lose everything from the ogrinfo -json. It would be rather demanding to write a working schema from a scratch, and having a separate option or utility for printing the default schema feels like duplication. But rather duplication than leaving users alone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elpaso yeah, the original sin (so to speak) is that the OGR schema is represented as an array of fields (this is OGR's internal model) instead of as a dictionary. Thus we have to reference fields by "the item in the fields array where name is val" instead of just "fields[val]". I think it makes more sense to accept that than to change it.

elpaso and others added 2 commits October 30, 2024 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
funded through GSP Work funded through the GDAL Sponsorship Program
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants