Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoJSON to binary arrays improvements #703

Merged
merged 30 commits into from
Mar 28, 2020

Conversation

kylebarron
Copy link
Collaborator

@kylebarron kylebarron commented Mar 21, 2020

@ibgreen

  • Index array length fixed, where the last value is coordinate length
  • Mixed coordinate dimensions: Should be fixed; if mixed coordinate lengths, 0 is left for the missing values
  • Numeric properties added to typed arrays, where each property is contained within the points/lines/polygons.numericProps object. It was helpful to make it its own object, since I loop over the keys of numericProps a couple times.
  • Calculations from the first pass are combined with user-provided options. So, for example, if a user didn't want any numeric properties in binary arrays, they could pass numericProps = [].
  • Feature properties (minus any numeric keys) are added to properties arrays by geometry type. If the user passes numericProps = [], all properties are kept as JSON.
  • Both featureIndex and globalFeatureIndex arrays are created.
  • Add new tests/fix changes to existing tests
  • Use smallest data type necessary for TypedArrays
  • An option for user-defined missing values for positions and numeric feature properties? I.e. if non-zero, fill this as the default value for positions when it's initialized, and then I think the third coordinate won't get overwritten when filling a 2D point.

Closes #702

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0005%) to 62.04% when pulling a4a0361 on kylebarron:geojson-to-binary into 9b8c059 on uber-web:master.

@coveralls
Copy link

coveralls commented Mar 21, 2020

Coverage Status

Coverage increased (+0.2%) to 62.24% when pulling 262a1d4 on kylebarron:geojson-to-binary into 9b8c059 on uber-web:master.

@kylebarron
Copy link
Collaborator Author

@ibgreen any feedback?

Copy link
Collaborator

@ibgreen ibgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylebarron This looks great!

modules/gis/src/lib/geojson-to-binary.js Outdated Show resolved Hide resolved
modules/gis/src/lib/geojson-to-binary.js Show resolved Hide resolved
if (feature.properties) {
for (const key in feature.properties) {
const val = feature.properties[key];
numericProps[key] = numericProps[key] ? isNumeric(val) : numericProps[key];
Copy link
Collaborator

@ibgreen ibgreen Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic may well be correct, but I get confused here. Do we want this to be true only if all the values in all features are numeric? If so maybe the following is easier to follow?

// initalize
for (const key in feature.properties) {propIsNumericKey[key] = true;}
...
// each iteration, check if still considered numeric
if (propIsNumericKey[key]) {
  propIsNumericKey[key] = Number.isFinite(val);
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be true only if all the values in all features are numeric?

I think so, yes.

If so maybe the following is easier to follow?

That snippet is initializing based on the properties in the first feature? The problem with that is that not all features are required to have all keys. With MVT in particular, if I'm not mistaken, a property could be defined in only a single feature within a layer.

So if we want numericKey to be true if the values for that prop in all features are numeric and exist, then initializing based on the properties in the first feature are fine. Otherwise, to initialize based on all properties I think it would be necessary to loop over all features just to do the initialization.

I'm currently interpreting numericKey should be true if the values for that prop in all features are either numeric or do not exist. So I need to keep track of 1) have I seen this key in a previous feature, 2) has it been non-numeric in any previous feature.

It might be good to add a comment like

// If property has been numeric in all previous features in which the
// property existed, check if numeric in this feature

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. This can get pretty sophisticated, so you should draw the line here for now and then we can come back and make further improvements if we feel so inclined.

For later:

  • An approach used in binary columnar tables is to optionally generate a second null array to track which objects are not present (e.g. a Uint8Array).
  • We could detect the type of the array:
    • For each element check if integer. If row is all integer, generate Uint32Array
    • Track max/min values of all values
    • if (-32000, 32000), generate Uint16Array
    • If (-128, 127) generate Uint8Array
    • etc

Maybe create a follow up tasks with such ideas for a later date?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can get pretty sophisticated, so you should draw the line here for now and then we can come back and make further improvements if we feel so inclined.

Good idea, it's clear this can be a black hole.

One more idea I thought of that can be a future task: should numericKeys depend on the geometry type?

For example, if I have a set of GeoJSON features where a numeric property x only exists on Point features, and I have 10 Point features but 1000 Polygon features, the current code instantiates a Float32Array for every Polygon vertex, when every vertex is missing data.

I don't think this would necessarily be uncommon either; I have a mountain_peak layer of points in my MVTs with a numeric elevation value, but no other layer has that property, so there would be # line positions + # polygon positions extra empty data just for that property.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is a good observation, that can be another optimization.

Right now I think we should be concerned with the "combinatorial explosion" of options + input dependent outputs, specifically relating to the testability and ease-of-use of it all.

Less optionality is probably good for now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. It's easy to go on a feature-adding spree. If there's anything you think should be taken out let me know.

Now I'm pivoting to adding more tests... My initial approach would be to add tests to firstPass and secondPass separately, to more easily test edge cases. Do you know if it's possible to write tests for a non-exported function in JS? Or is it ok to export the functions for use in tests? I saw one approach on the web like

export const _testing = [firstPass, secondPass]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I sometimes add a TEST_EXPORTS: const TEST_EXPORTS = {firstPass, secondPass}

Also by using an object you can then do:

import {TEST_EXPORTS} from ...; 
const {firstPass, secondPass} = TEST_EXPORTS;

modules/gis/src/lib/geojson-to-binary.js Outdated Show resolved Hide resolved
modules/gis/src/lib/geojson-to-binary.js Outdated Show resolved Hide resolved
modules/gis/src/lib/geojson-to-binary.js Outdated Show resolved Hide resolved
@ibgreen
Copy link
Collaborator

ibgreen commented Mar 26, 2020

@kylebarron Just mark the PR as "ready for review" and ping me and I will land.

@kylebarron kylebarron marked this pull request as ready for review March 28, 2020 00:46
@kylebarron
Copy link
Collaborator Author

@ibgreen
I think I covered a decent number of edge cases in the tests I added. I also updated the docs. Do you think they're clear enough?

Copy link
Collaborator

@ibgreen ibgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylebarron This is a spectacular PR. I added a few comments, but will land this now.

I feel that collecting any remaining ideas that came up here in the tracker issue would be good.

Have you tried to build the website and made sure that your new docs show up? I could help with that.

`geojsonToBinary` returns an object containing typed arrays sorted by geometry
type. `positions` corresponds to 2D or 3D coordinates; `objectIds` returns the
index of the vertex in the initial `features` array.
type. `positions` is a flat array of coordinates; `globalFeatureIds` references
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For easier reading and a cleaner look, use a markdown table, or at least a bulleted list?

// Array of x, y or x, y, z positions
positions: {value: Float32Array, size: coordLength},
positions: {value: PositionDataType, size: coordLength},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in the process of introducing typescript defintions for loaders.gl 2.2. With minimal extra effort, we could actually turn what you wrote here into typescript types in a geojson-to-binary.d.ts. Look for other .d.ts files to see how it is done.

| PositionDataType | `Float32Array` or `Float64Array` | `Float32Array` | Data type used for positions arrays |
| Option | Type | Default | Description |
| ---------------- | -------- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| PositionDataType | `class` | `Float32Array` | Data type used for positions arrays. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me: It is best if the option values are representable in JSON, i.e. numbers, strings, etc. So you could support "float32" and "float64", or positionPrecision: 32 | 64

The reason is that loaders.gl is integrated with deck.gl, and deck.gl exposes loader options in its API, and for instance pydeck and non-JS language bindings go through the deck.gl JSON API intermediary.

}

export const TEST_EXPORTS = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I normally put this last in file


// Test value equality, missing third dimension imputed as 0
t.deepEqual(points.positions.value, [100, 0, 1, 100, 0, 0, 101, 1, 0]);
t.deepEqual(lines.positions.value, [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use // prettier-ignore just before line 125 to prevent auto formatting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flat binary arrays followup
3 participants