Add optional layer metadata at instantiation #952

jgoizueta · 2018-05-08T18:10:09Z

This is experimental to be used by CartoVL. It shouldn't be public/documented at the moment.

Fixes #948

Also fixed bug where sampling query generation needed results of count queries

phase (not only its tasks) must be executed after the tasks of previous phases

eliminate dependency on the order of PostgreSQL results

jgoizueta · 2018-05-09T13:52:15Z

Notes:

This allows obtaining additional metadata in map instantiation. When aggregation occurs, the metadata is about the original, unaggregated data source.
This is so to fullfill current Carto VL needs, but we can discuss about having aggregation metadata too in the future.

The metadata can be requested adding a metadata entry to the layer options, which can have this properties:

featureCount: (true/false) actual row count (as opposed to the default estimated row count)
geometryType (true/false) type of the geometry column
columns: (true/false) name and types of functions
columnStats: (true/false/object) same as columns and also some basic stats: max, min, avg, sum for numbers, max, min for dates, cand top categories for text columns. The number of top categories is 1024 by default and can be controlled by passing an object like { topCategories: 100 } here.
sample (integer) returns a sample of the table of the approximate size (number of rows).

The requested metadata is returned in the response for each layer (metadata.layers) in meta.stats, in addition to estimatedFeatureCount which we already had.

I've preserved the existing behaviour for estimatedFeatureCount, which takes the value -1 if any SQL error occurs when computing it. I think this is questionable, as may be hiding unexpected problems. For the new metadata I'm not filtering SQL errors.

jgoizueta · 2018-05-10T15:28:01Z

I'm reviewing this because of a major problem: at the point where metadata is being computed now only the aggregated query (in case of aggregation) is available.
We could extract the original query since it appears as ( ... ) _cdb_query but a less hacky approach would be desirable.

Also the previously existing stat estimatedFeatureCount refers to the aggregated results in case of aggregation. So, to be consistent we should be explicit in the metadata parameters about pre/post aggregation metadata and support both in some cases (like count).

Tests for metadata in the presence of aggregation must be added.

jgoizueta · 2018-05-11T08:17:44Z

Well, regarding the existing stat estimatedFeatureCount it never really worked with aggregation because it executes the aggregation query without performing Mapnik token substitution (scale_denominator, etc.) This wasn't noticed because as stated in my previous comment we were silently returning -1 in case of error.

So, now we should be free to change the behaviour of this for aggregation and return the estimate count for the pre-aggregation query in this default stat.

All stats are computed now pre-aggregation Code to help compute post-aggregation stats remains for testing

Also change aggregated stats to not filter a single tile

Remove usage of PhasedExecution This achives better query execution granularity and removes questionable usage of shared results object. It introduces a couple of behavior changes: * estimatedFeatureCount desn't ignore errors now * sample always uses estimatedFeatureCount,even if the actual count is also computed.

jgoizueta · 2018-05-18T20:35:07Z

Hey @dgaubert can you review this again?

There was an interesting problem with the tests (well, "interesting" wasn't the word I had in mind while I was pulling my hair out trying to figure it out).

The test "layergroup can hold substitution tokens" was failing but only on travis. (not even using the docker tests locally). I finally was able to reproduce it locally by setting global.environment.enabledFeatures.layerStats = true; in the test.
So we have some test-order issue with some test leaving an indeterminate state in layerStats.

Now, the problem, which has existed for a long time, and has been reveled now, is this: we always compute the row count estimate stat. But this has been failing if the sql query contains Mapnik tokens (because we make no substitution before executing it).
But since we were ignoring any SQL errors for that query (setting row count to -1) we didn't notice.

I've fixed the substitution problem, but I haven't look at the layerStats configuration problem

Keep current production behavior of ignoreing errors when computing this stat and returning -1. This is done as to no introduce any instability in production at the moment.

jgoizueta · 2018-05-21T09:52:12Z

I've reverted the behaviour in case of error when computing estimatedFeatureCount (computed in each instantiation) to ignore errors and return -1, so we can safely deploy this in production without introducing new risks.

@oleurud do you think worth to make that conditional on the environment (so that in development, staging, etc the errors aren't ignored)? (if (process.env.NODE_ENV === 'production') ...)

simon-contreras-deel · 2018-05-21T10:02:13Z

Maybe an environment configuration parameter will be the best option (easy and fast to enable/disable)

The sampling probability is now being computed using an estimate of the table row count This could led to too high probabilities (to large samples) if the estimate is not accurate. To avoid potential problems with large samples we've added a LIMIT to the sampling queries.

simon-contreras-deel

Looks good, but let me check again this afternoon (i need to read some things)

simon-contreras-deel · 2018-05-21T10:30:11Z

lib/cartodb/backends/layer-stats/mapnik-layer-stats.js

+    if (field.type === 'number') {
+        return ['min', 'max', 'avg', 'sum'];
+    }
+    if (field.type === 'date') { // TODO other types too?


It could be a else if?

Yep, I think I've omitted lately quite a few elses because of returns inside conditions.
🤔 do you think the explicit else is preferable for clarity?

Forget if, you are right.

simon-contreras-deel · 2018-05-21T10:50:00Z

lib/cartodb/backends/layer-stats/mapnik-layer-stats.js

+
+// columns are returned as an object { columnName1: { type1: ...}, ..}
+// for consistency with SQL API
+function formatResultFields(dbConnection, flds) {


It could be prettier (no required, but it will be easy to understand):

function formatResultFields(dbConnection, fields = []) { let nfields = {}; for (let field in fields) { const cname = dbConnection.typeName(field.dataTypeID); let tname; if ( ! cname ) { tname = 'unknown(' + field.dataTypeID + ')'; } else { tname = fieldType(cname); } nfields[field.name] = { type: tname }; } return nfields; }

You catched me copy-pasting from SQL API!! 😊

simon-contreras-deel · 2018-05-21T10:52:20Z

lib/cartodb/backends/layer-stats/mapnik-layer-stats.js

+
+    // TODO: compute _sample with _featureCount when available
+
+    Promise.all([


Looks good :)

simon-contreras-deel · 2018-05-21T12:53:22Z

lib/cartodb/backends/layer-stats/mapnik-layer-stats.js

+            ({ estimatedFeatureCount }) => _sample(ctx, estimatedFeatureCount)
+                .then(s => mergeResults([s, { estimatedFeatureCount }]))
+        ),
+        _featureCount(ctx),


A question to understand it: If some of _featureCount, _aggrFeatureCount, _geometryType or _columns fails, we will return an error as a response. I know that the user must request for it expressly, but I am not sure if metadata should do to fail a request. What is your point of view?

It is not a reason to stop the PR and also, I am not saying to change it. Only it raises doubts.

IMHO, If the user requests for any specific metadata and an error happens, we should return that error because we weren't able to process the request completely.

simon-contreras-deel

You have my blessings ;)

Thanks for a very nice issue (and the comments on the PR) allowing everyone to understand very well the use case, the solution, and the caveats

dgaubert · 2018-05-21T14:16:45Z

test/acceptance/aggregation.js

+                                        threshold: 1
+                                    },
+                                    metadata: {
+                                        aggrFeatureCount: 10


10 is the zoom value??

dgaubert · 2018-05-21T14:24:01Z