-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rfc] Add RFC 103 - OGR_SCHEMA open option #11071
base: master
Are you sure you want to change the base?
Conversation
|
||
- Additional JSON properties will be ignored while parsing the schema. | ||
|
||
- If the schema contains a field that is not present in the dataset, a warning will be raised and the field will be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if some fields are missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing: auto-detected types will be used. The idea is that you can override only a part of the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a sentence to clarify that partial overrides are possible (in fact, they probably are the main use case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that it does not take long before someone wants to have an option to rename the fields by the same.
{
"name": "field1",
"new_name": "description"
"type": "string"
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then someone would like to have an alternative for the OGR SQL EXCLUDE for removing some fields
The syntax * EXCLUDE ([fields]) can be used to select all fields except those listed in parentheses.
They will not have that option if the fields which are missing from the schema are autodetected. But it is not possible to make both you and them happy at the same time, and you are the author...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should add a required top-level JSON property "schema_type" whose only supported value currently will be "patch", to mean that we alter the autodetected schema for the parts we specify with the JSON document. If in the future we would want to allow full schema replacement, that would be done with "schema_type" = "full" or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jratike80 I agree that field renaming will be very useful. Instead of an implicit operation, where new_name
means replace name
or new_type
means replace type
, I'd prefer to use an explicit operation (sort of like https://jsonpatch.com/#replace) which says: replace field name
with {"name": "NEW_NAME", "type": "NEW_TYPE"}
.
I'll read the whole RFC soon, maybe what I've proposed won't be realistic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it, field renaming can be done with SQL, right? While overriding the original field type cannot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL is flexible and type casting is supported. I seem to have used this kind of SQL with ogr2ogr
SELECT
'FIN' AS "country",
CAST("mtk_id" as text) as "id",
'CC BY 4.0' AS "license",
ST_Simplify("geometry",5) AS "geometry"
FROM "source"
I am not sure how well GDAL detects those CASTed attributes generally, but they have worked for my use case with GeoPackage as the outputformat.
However, some users feel that a need to know some SQL is weird, and having several SQL dialects, from which GDAL makes a selection by hidden rules, makes more confusion for beginners. But it is hard to set the limits when to use SQL, when to add a new specific open option or ogr2ogr switch, or when to write an OGR VRT file.
For me the OGR_SCHEMA feels more user friendly than SQL CAST especially because I fear that different dialects may not do exactly similar casts. By supporting also renaming users would not need to to one part of mapping with OGR_SCHEMA and another one with SQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me the OGR_SCHEMA feels more user friendly than SQL CAST
not only that, but type casting at the SQL level is too late. If a driver has wrongly guessed that a field was Integer32 wheras later records hold 64-bit values, the driver will present truncated/wrong values to the SQL engine. There's no way to "fix" that at that level. That must be something done by the driver itself
- GeoJSON | ||
- SQLite | ||
- GML | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GML can already use ,xsd and .gfs, and csv can have a .csvt sidecar, but maybe the more options the better. Especially manual writing of .csvt file is not convenient at all.
- If the schema is a valid JSON document but does not contain the expected fields or it is a no-op | ||
(does not contain any actionable instruction), a warning will be raised and the schema will be ignored. | ||
|
||
- Additional JSON properties will be ignored while parsing the schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that we ignore the schema if an expected JSON property is found with the error message mentionning that property. That way this will enable use to add more capabilites in the future, while giving the user a certainty that what he specified is fully taken into account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale for ignoring was to allow the easy workflow ogrinfo -json
-> edit the field types -> use that JSON document as an input for ogr2ogr
. If we error out on unsupported properties the user will need to remove all unsupported properties.
It is fine for me, just wanted to point this out.
Same observation goes for the "schema_type" recommended below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we error out on unsupported properties the user will need to remove all unsupported properties.
If the error message is sufficiently explicit, I don't think that's much a concern. But there are clearly pros/cons
|
||
- Additional JSON properties will be ignored while parsing the schema. | ||
|
||
- If the schema contains a field that is not present in the dataset, a warning will be raised and the field will be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should add a required top-level JSON property "schema_type" whose only supported value currently will be "patch", to mean that we alter the autodetected schema for the parts we specify with the JSON document. If in the future we would want to allow full schema replacement, that would be done with "schema_type" = "full" or something like that
"fields": [ | ||
{ | ||
"name": "field1", | ||
"type": "string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need also to support "subtype" (for String JSON, or Integer Boolean). And perhaps "width" and "precision" too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense.
Is it subtype
also exposed by ogrinfo -json
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For schema_type
I was thinking to make it a layer-level property, this way a combination of patch and full can be used for individual layers.
I'm working on a json schema document to clarify this.
Implementation | ||
-------------- | ||
|
||
A new open option named SCHEMA will be added to the following drivers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention this open option will be a reserved one. If a driver uses it, it must be for that purpose
And I believe we should modify ogr2ogr so that -mapFieldType uses the capability offered by the SCHEMA open option when available
Oh, names. PostGIS driver has already |
+1 |
91a4f1e
to
1b0f100
Compare
{ | ||
"name": "field1", | ||
"type": "string", | ||
"subtype": "JSON" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I see we have spelled it "subType" in ogrinfo -json output:
"fields":[
{
"name":"a",
"type":"String",
"subType":"JSON",
"nullable":true,
"uniqueConstraint":false
}
Schema in https://github.com/OSGeo/gdal/blob/master/apps/data/ogrinfo_output.schema.json#L117
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a JSON schema to the RFC: https://github.com/OSGeo/gdal/pull/11071/files#diff-9d0c9dc12d0b4a70d63be37cb390f1e9970d226296c30e8883c73a911c82806bR55
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a couple comments and suggestions inline.
}, | ||
{ | ||
"name": "field2", | ||
"newName": "new_field2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elpaso @rouault Could we make patching easy to understand by treating names and field types in the same way? For example:
"newName": "new_field2" | |
"newName": "new_field2", | |
"newType": "String", | |
"newSubType": "JSON" |
If we do this, we don't have to spend any energy explaining why names and types are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. As an alternative we could treat the fields as an object such as:
"fields" : { "field1" { "name" : "field1_renamed" ...}}
that would mean that we abandon the idea to use the output of ogrinfo -json
as a template for the schema, but perhaps we have already lost that train.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope not to lose everything from the ogrinfo -json
. It would be rather demanding to write a working schema from a scratch, and having a separate option or utility for printing the default schema feels like duplication. But rather duplication than leaving users alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elpaso yeah, the original sin (so to speak) is that the OGR schema is represented as an array of fields (this is OGR's internal model) instead of as a dictionary. Thus we have to reference fields by "the item in the fields
array where name
is val
" instead of just "fields[val]
". I think it makes more sense to accept that than to change it.
Co-authored-by: Even Rouault <even.rouault@spatialys.com>
Formatted preview: https://gdal--11071.org.readthedocs.build/en/11071/development/rfc/rfc103_schema_open_option.html