Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for sparse properties by declaring schema when available #174

Open
michaelkirk opened this issue Sep 15, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@michaelkirk
Copy link
Member

I want to convert an FGB to a CSV. This already works for a typical FGB, but I'd like to take advantage of the FGB format to save some space by skipping a features' empty properties.

I think solving this problem might have some more general purpose use in geozero.

Because an FGB's properties are prefixed with their column index, when a particular feature has no value for a column, you could choose to omit the column altogether, rather than spending 6 bytes just to say "no value for this column". I've made this change in a demo FGB feature branch here: https://github.com/michaelkirk/flatgeobuf/tree/mkirk/empty-fields.

In theory there's no problem writing this back out to another FGB or to a flexible format like geojson, but some other output formats need to know the schema up front, like csv (but maybe also gpx and shapefile, arrow?).

I think it can be broken down to a few cases:

  1. It's irrelevant for geometry-only formats such as wkt and geo-types, so we don't need to worry about them.
  2. Formats that support writing sparse properties could be serialized more succinctly, such as fgb, geojson by omitting empty values. Probably this should be a configurable option on the writer.
  3. Formats that support constant time access to their schema, such as csv, fgb, (arrow? gpkg?) can be deserialized in one pass. Other formats do not support constant time access to their schema, like geojson. That means it's not currently possible to convert sparse geojson to something rigid like csv, because "new" columns might appear after already writing some CSV rows. An additional pass before writing to ascertain the schema could address this, but that has some drawbacks, and in any case, doesn't currently exist. (There's no guarantees about any geojson in the wild having regular columns anyway, so we're already facing that problem to a degree).

As for a potential step forward:

/// Feature processing trait
#[allow(unused_variables)]
pub trait FeatureProcessor: GeomProcessor + PropertyProcessor {
    /// Begin of dataset processing
-    fn dataset_begin(&mut self, name: Option<&str>) -> Result<()> {
+    fn dataset_begin(&mut self, name: Option<&str>, schema: Option<Vec<ColumnArgs >>) -> Result<()> {
        Ok(())
    }

Reading from an fgb would call: dataset_begin(Some(name_from_header), Some(feature_schema_from_header)) whereas reading from geojson would call dataset_begin(None, None)

Note that this would mean introducing something like FGB's ColumnArgs and ColumnType to geozero.

Formats that require a rigid schema, like csv, could utilize that data in order to correctly "fill in the blanks" when reading features with sparse properties.

This definitely introduces some complexity into the library. Overall, I'm not sure if it's worth it. What do people think?

@michaelkirk michaelkirk added the enhancement New feature or request label Sep 15, 2023
@pka
Copy link
Member

pka commented Sep 15, 2023

I'm not against an additional argument in dataset_begin, but also not sure if it's worth it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants